Cost · 8 min
Model routing: running cheap when you can, expensive when you must
Routing turns a $20/day workload into a $4/day workload without losing capability.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
Most production workloads are not uniform. The median prompt is easy; a long tail is hard. Routing means using a small fast model for the easy ones and a large slow one for the hard ones. Done correctly, 80% of calls land on the small model and the bill drops sharply.
Static routing
Different endpoints, different models. Classification → Haiku. Long-form generation → Sonnet. Hard reasoning → Opus extended thinking. The simplest routing is type-based.
Dynamic routing
Use a small model to classify the difficulty of an incoming prompt, then route. The classification model itself costs something; the breakeven is roughly when at least 30% of prompts can be handled by a model 5× cheaper than the default.
Fallback routing
Try the cheap model first. If the response is malformed or the confidence is low, retry on the smart model. Operators with strict latency budgets often skip this; operators with strict cost budgets adopt it.
Frequently asked
Does routing hurt latency?
Static routing, no. Dynamic and fallback, yes — a classification or retry hop is added latency.
Partner offer
Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.
Try Claude →Affiliate link — see disclosure.
From the Almanac shop
The Operator's Compendium
Every agent harness, every routing pattern, every cost trick. 90-page PDF.
$29 — Coming soon