Evaluation·8 min
What SWE-bench Verified Actually Tests (and What It Doesn't)
By C.W. Jameson · Published 28 January 2026 · Last reviewed 28 January 2026
SWE-bench Verified measures pull-request-level code repair on real GitHub issues. It measures that very precisely. It measures nothing else.
Inside SWE-bench Verified: the task distribution, the harness, what scores mean for operator selection, and the gaps.
Related dispatches