kwj.ai · acquisition inquiries from >$999view prospectus →
The Domesday Book ofKWJ · AI
Evaluation·7 min

Benchmarks Nobody Uses That Nobody Should Dismiss

By C.W. Jameson · Published 5 December 2025 · Last reviewed 5 December 2025

MMLU is the benchmark every press release cites. HellaSwag is the one I actually look at first.

HellaSwag, WinoGrande, GSM8K, HumanEval: why the obscure benchmarks predict real-world performance better than MMLU.