Selection · 12 min
How to pick an LLM for your workload
A decision tree that takes thirty minutes to walk and saves six months of switching costs.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
Most teams pick a model the way most people pick a tax software — they tried one once, it worked, they never looked again. The cost of that is real. A team running the wrong model on a workload at scale pays double for half the capability, then writes a thousand-line prompt to compensate. This guide walks the decision tree.
Step one: characterise the workload
Three axes matter: capability ceiling, throughput, and privacy. Capability ceiling is what the model has to do at its hardest. Throughput is how many calls per day at peak. Privacy is whether the data can leave your VPC.
Write these down. Operators who skip this step end up benchmarking the wrong axis and choosing models that excel at the unimportant dimension.
Step two: shortlist by capability
If the hardest tasks are agentic coding loops longer than thirty minutes, the shortlist is Claude Opus and Claude Sonnet. If the hardest tasks are math or research-grade reasoning, the shortlist is GPT-5 and Gemini 2.5 Pro. If the hardest tasks are classification or summarisation at scale, the shortlist is Haiku, GPT-4o mini, and the open-weights tier.
Do not let pricing eliminate at this stage. A capability mismatch costs more than a price mismatch, every time.
Step three: build your operator benchmark
Take a hundred real prompts from the workload. Score them on a five-point scale. Run them against every shortlist model. Compare.
This single step distinguishes operators who choose well from operators who choose by leaderboard.
Step four: forecast the bill
Take the median tokens per call, the daily volume, and the chosen model's pricing. Multiply. Add reasoning-token budget if applicable. Add prompt-cache savings if applicable. The honest number is usually 2-5× the naive estimate.
Step five: privacy and provisioning
If data privacy is non-negotiable, the choice narrows to dedicated cloud deployment or self-hosted open-weights. Each adds operational burden that operators should price.
Frequently asked
How often should I re-do this exercise?
Quarterly at minimum. Frontier capability shifts faster than internal benchmarks can settle.
Is one model enough?
Rarely. Most serious deployments end up with two — a smart-and-slow tier and a fast-and-cheap tier — routed by task type.
Partner offer
Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.
Try Claude →Affiliate link — see disclosure.
From the Almanac shop
The Operator's Compendium
Every agent harness, every routing pattern, every cost trick. 90-page PDF.
$29 — Coming soon