Evaluation·7 min
Benchmarks Nobody Uses That Nobody Should Dismiss
By C.W. Jameson · Published 5 December 2025 · Last reviewed 5 December 2025
MMLU is the benchmark every press release cites. HellaSwag is the one I actually look at first.
HellaSwag, WinoGrande, GSM8K, HumanEval: why the obscure benchmarks predict real-world performance better than MMLU.
Related dispatches