kwj.ai · acquisition inquiries from >$999view prospectus →
The Domesday Book ofKWJ · AI

Field identification

Context-window claims, real and theoretical

Published context lengths and effective context lengths are different numbers. Here is how to test the gap.

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Every frontier lab publishes a maximum context length. Some of those numbers are useful. Others are marketing. The needle-in-a-haystack benchmark is the standard tool for distinguishing them, but it tests recall, not reasoning. A model can recall a fact buried at token 950,000 and still fail to reason across two facts buried near each other. Below is the methodology.

The advertised number

Frontier model cards report a maximum context window. As of mid-2026: Claude Opus 4.7 reports 1M tokens (200K standard, 1M opt-in), Gemini 2.5 Pro reports 2M, GPT-5 reports 400K, GPT-4o reports 128K, DeepSeek R1 reports 64K. These are the numbers that go in marketing copy.

The needle-in-a-haystack test

Inject a single fact ('the secret code is 4719') at varying depths into long context. Ask for it. Measure recall by depth. The result is a heatmap. Models that report 1M often retain useful recall to about 60-80% of the window; past that, recall degrades.

The multi-hop test

Inject two related facts at different depths. Require the model to combine them. Most models that score 100% on single-needle drop to 60-80% on two-needle even at modest depths. This is the practical limit operators care about.

The cost wall

Pricing past a certain depth doubles or triples. Claude Opus past 200K is double the rate. Gemini past 200K is double. Operators rarely use the full window even when capable, because the bill scales linearly with usage.

Tells

MarkerMeaning
Model recalls a fact at 90% depth but cannot combine it with a fact at 10% depthEffective multi-hop ceiling is below advertised length.
Latency rises ~linearly past 200K tokens even on 'long-context' modelsQuadratic attention, not Mamba/linear-attention substitute.

Frequently asked

Why does Gemini advertise 2M when nothing tests above 1M usefully?

Marketing leadership matters; operators rarely fill the window. The 2M figure is real for needle-style recall.

Are infinite-context architectures real yet?

Research-level only. Commercial frontier remains quadratic-attention plus context engineering.

From the Almanac shop

Model Tells — Flashcard Deck

Identify any frontier model from a paragraph of output. 60 cards.

$14Coming soon

All identification topics