How to pick an LLM for your workload

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Most teams pick a model the way most people pick a tax software — they tried one once, it worked, they never looked again. The cost of that is real. A team running the wrong model on a workload at scale pays double for half the capability, then writes a thousand-line prompt to compensate. This guide walks the decision tree.

Step one: characterise the workload

Three axes matter: capability ceiling, throughput, and privacy. Capability ceiling is what the model has to do at its hardest. Throughput is how many calls per day at peak. Privacy is whether the data can leave your VPC.

Write these down. Operators who skip this step end up benchmarking the wrong axis and choosing models that excel at the unimportant dimension.

Step two: shortlist by capability

If the hardest tasks are agentic coding loops longer than thirty minutes, the shortlist is Claude Opus and Claude Sonnet. If the hardest tasks are math or research-grade reasoning, the shortlist is GPT-5 and Gemini 2.5 Pro. If the hardest tasks are classification or summarisation at scale, the shortlist is Haiku, GPT-4o mini, and the open-weights tier.

Do not let pricing eliminate at this stage. A capability mismatch costs more than a price mismatch, every time.

Step three: build your operator benchmark

Take a hundred real prompts from the workload. Score them on a five-point scale. Run them against every shortlist model. Compare.

This single step distinguishes operators who choose well from operators who choose by leaderboard.

Step four: forecast the bill

Take the median tokens per call, the daily volume, and the chosen model's pricing. Multiply. Add reasoning-token budget if applicable. Add prompt-cache savings if applicable. The honest number is usually 2-5× the naive estimate.

Step five: privacy and provisioning

If data privacy is non-negotiable, the choice narrows to dedicated cloud deployment or self-hosted open-weights. Each adds operational burden that operators should price.

Frequently asked

How often should I re-do this exercise?

Quarterly at minimum. Frontier capability shifts faster than internal benchmarks can settle.

Is one model enough?

Rarely. Most serious deployments end up with two — a smart-and-slow tier and a fast-and-cheap tier — routed by task type.

Partner offer

Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.

Try Claude →

Affiliate link — see disclosure.

From the Almanac shop

The Operator's Compendium

Every agent harness, every routing pattern, every cost trick. 90-page PDF.

$29 — Coming soon

← All guides