kwj.ai · acquisition inquiries from >$999view prospectus →
The Domesday Book ofKWJ · AI

Operations · 10 min

Evaluating an agent the way operators actually do

Capability benchmarks measure capability. Operators want to measure deployability. The two are different.

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Operators care about three measures benchmarks rarely surface: cost-per-correct-answer, time-to-correct-answer, and failure-mode quality. This guide describes how to measure all three.

Cost-per-correct-answer

Bill for a workload divided by the number of correct outcomes. Replaces the misleading 'cost per call' and surfaces the value of expensive-but-accurate models against cheap-but-flaky ones.

Time-to-correct-answer

Wall-clock from request to correct outcome. For agent loops this includes retries. A fast-but-wrong model can have a worse time-to-correct than a slow-but-right one.

Failure-mode quality

When the agent fails, does it fail loudly, quietly, or wrongly? Loudly is fine: operator intervenes. Quietly is acceptable: operator finds the gap on review. Wrongly is dangerous: operator believes a wrong answer.

Frequently asked

How big should my eval set be?

Big enough that two consecutive runs agree to within 2 points. Usually a few hundred prompts.

Partner offer

Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.

Try Claude →

Affiliate link — see disclosure.

From the Almanac shop

The Operator's Compendium

Every agent harness, every routing pattern, every cost trick. 90-page PDF.

$29Coming soon

All guides