Operations · 10 min
Evaluating an agent the way operators actually do
Capability benchmarks measure capability. Operators want to measure deployability. The two are different.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
Operators care about three measures benchmarks rarely surface: cost-per-correct-answer, time-to-correct-answer, and failure-mode quality. This guide describes how to measure all three.
Cost-per-correct-answer
Bill for a workload divided by the number of correct outcomes. Replaces the misleading 'cost per call' and surfaces the value of expensive-but-accurate models against cheap-but-flaky ones.
Time-to-correct-answer
Wall-clock from request to correct outcome. For agent loops this includes retries. A fast-but-wrong model can have a worse time-to-correct than a slow-but-right one.
Failure-mode quality
When the agent fails, does it fail loudly, quietly, or wrongly? Loudly is fine: operator intervenes. Quietly is acceptable: operator finds the gap on review. Wrongly is dangerous: operator believes a wrong answer.
Frequently asked
How big should my eval set be?
Big enough that two consecutive runs agree to within 2 points. Usually a few hundred prompts.
Partner offer
Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.
Try Claude →Affiliate link — see disclosure.
From the Almanac shop
The Operator's Compendium
Every agent harness, every routing pattern, every cost trick. 90-page PDF.
$29 — Coming soon