Model routing: running cheap when you can, expensive when you must

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Most production workloads are not uniform. The median prompt is easy; a long tail is hard. Routing means using a small fast model for the easy ones and a large slow one for the hard ones. Done correctly, 80% of calls land on the small model and the bill drops sharply.

Static routing

Different endpoints, different models. Classification → Haiku. Long-form generation → Sonnet. Hard reasoning → Opus extended thinking. The simplest routing is type-based.

Dynamic routing

Use a small model to classify the difficulty of an incoming prompt, then route. The classification model itself costs something; the breakeven is roughly when at least 30% of prompts can be handled by a model 5× cheaper than the default.

Fallback routing

Try the cheap model first. If the response is malformed or the confidence is low, retry on the smart model. Operators with strict latency budgets often skip this; operators with strict cost budgets adopt it.

Frequently asked

Does routing hurt latency?

Static routing, no. Dynamic and fallback, yes — a classification or retry hop is added latency.

Partner offer

Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.

Try Claude →

Affiliate link — see disclosure.

From the Almanac shop

The Operator's Compendium

Every agent harness, every routing pattern, every cost trick. 90-page PDF.

$29 — Coming soon

← All guides