Evaluation·8 min

What SWE-bench Verified Actually Tests (and What It Doesn't)

By C.W. Jameson · Published 28 January 2026 · Last reviewed 28 January 2026

SWE-bench Verified measures pull-request-level code repair on real GitHub issues. It measures that very precisely. It measures nothing else.

Inside SWE-bench Verified: the task distribution, the harness, what scores mean for operator selection, and the gaps.

Related dispatches