Benchmarks: which ones to trust, which to discount

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Benchmarks tell you something. Almost never what their marketing implies. The honest discipline is to maintain your own private benchmark on your own workload, score every model against it, and treat published benchmarks as priors, not evidence.

Public general-capability (MMLU, MMLU-Pro, GPQA)

Largely saturated at the top of the leaderboard. Useful for ruling out incompetence, not for distinguishing capability among frontier models.

Coding (SWE-bench, HumanEval, MBPP)

SWE-bench is the gold-standard agentic-coding benchmark of mid-2026. HumanEval is contaminated and unreliable. MBPP is contaminated.

Reasoning (AIME, IMO, ICPC)

These shifted from research curiosities to flagship marketing benchmarks. The frontier labs all train on previous-year material; results on the next year's tests are the only honest signal.

Long-context (NIAH, RULER, BABILong)

Needle-in-a-haystack tests recall, not reasoning. RULER and BABILong are closer to operator needs. Most published 'we support 1M context' claims are NIAH only.

Operator benchmarks

The only reliable signal. A few hundred prompts from your actual workload, scored by you, run against every candidate model quarterly. Build this before you commit to a provider.

Tells

Marker	Meaning
Published score above 95% on an old benchmark	Probably training-set contamination.
Lab announces a benchmark and tops it at release	Worth waiting for an independent reproduction.

Frequently asked

Is LMArena trustworthy?

Useful for vibes, not for tasks. Vibes don't measure tool-use, schema-adherence, or refusal calibration.

Should I share my operator benchmark with the labs?

No. The moment you do, the next model is trained against it.

From the Almanac shop

Model Tells — Flashcard Deck

Identify any frontier model from a paragraph of output. 60 cards.

$14 — Coming soon

← All identification topics