Field identification
Benchmarks: which ones to trust, which to discount
Every benchmark eventually gets gamed. The trick is knowing which are useful in their current form.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
Benchmarks tell you something. Almost never what their marketing implies. The honest discipline is to maintain your own private benchmark on your own workload, score every model against it, and treat published benchmarks as priors, not evidence.
Public general-capability (MMLU, MMLU-Pro, GPQA)
Largely saturated at the top of the leaderboard. Useful for ruling out incompetence, not for distinguishing capability among frontier models.
Coding (SWE-bench, HumanEval, MBPP)
SWE-bench is the gold-standard agentic-coding benchmark of mid-2026. HumanEval is contaminated and unreliable. MBPP is contaminated.
Reasoning (AIME, IMO, ICPC)
These shifted from research curiosities to flagship marketing benchmarks. The frontier labs all train on previous-year material; results on the next year's tests are the only honest signal.
Long-context (NIAH, RULER, BABILong)
Needle-in-a-haystack tests recall, not reasoning. RULER and BABILong are closer to operator needs. Most published 'we support 1M context' claims are NIAH only.
Operator benchmarks
The only reliable signal. A few hundred prompts from your actual workload, scored by you, run against every candidate model quarterly. Build this before you commit to a provider.
Tells
| Marker | Meaning |
|---|---|
| Published score above 95% on an old benchmark | Probably training-set contamination. |
| Lab announces a benchmark and tops it at release | Worth waiting for an independent reproduction. |
Frequently asked
Is LMArena trustworthy?
Useful for vibes, not for tasks. Vibes don't measure tool-use, schema-adherence, or refusal calibration.
Should I share my operator benchmark with the labs?
No. The moment you do, the next model is trained against it.
From the Almanac shop
Model Tells — Flashcard Deck
Identify any frontier model from a paragraph of output. 60 cards.
$14 — Coming soon