Why Context Windows Matter More Than Parameter Count
In 2024, the field learned that the window mattered more than the weights. Here is what changed and why it still surprises people.
Practitioner notes from C.W. Jameson. Production observations, cost analyses, and field reports on what actually works.
In 2024, the field learned that the window mattered more than the weights. Here is what changed and why it still surprises people.
For fourteen months, GPT-4 was the only serious choice. Then Anthropic benchmarked past it on a Tuesday, and the market never went back.
I ran the same 5,000-token system prompt through 10,000 calls. With caching: $12. Without: $112. The setup took twenty minutes.
After running six different loop architectures in production, I ended up using three of them. Here are the five patterns and which two I stopped using.
The 2025 January release of R1 changed three things: the floor for open-weights reasoning, the confidence of Chinese AI labs, and the board slide at every US lab.
MCP was supposed to be the USB of AI tool connections. After six months: it mostly is, with two significant caveats.
At 5 million tokens per day, self-hosting wins. Below 1 million, it is almost always cheaper to use the API. Here is the full calculation.
The modal failure mode is not the wrong click. It is the agent deciding the task is done when it is 70% complete.
I watched the tool call logs from a three-hour refactor session. Here is what happened in each phase.
You asked for ten lines of code. The model thought for four thousand tokens before writing them. You paid for both.
The surprise: the routing classifier itself was the hardest part. The model selection, once you have the signal, is straightforward.
There are exactly twelve ways a model will break your schema. Seven of them have deterministic fixes. Five require prompt engineering.
SWE-bench Verified measures pull-request-level code repair on real GitHub issues. It measures that very precisely. It measures nothing else.
The citation quality comparison was the most surprising result: Perplexity cited more, but Claude was more accurate about what the cited source said.
The transformer is not magic. It is a very well-designed function approximator with three specific properties that make it scale. Understanding those three properties explains most model behaviour operators observe.
Fine-tuning helps with format and style. It rarely helps with knowledge. I have the invoice to prove it.
Confabulation (inventing plausible details), source drift (misattributing real content), and contradiction (contradicting the prompt) are three different problems requiring three different solutions.
The trend over three years has been toward more specific refusals with better alternatives. The outlier remains xAI's Grok, which is trending the other direction.
The right answer depends on three variables: update frequency, query specificity, and document density. Here is the decision matrix.
MMLU is the benchmark every press release cites. HellaSwag is the one I actually look at first.
Mode 7 — the agent decides it is done when it is not — accounts for 38% of the production failures I have logged. It is also the hardest to detect.
o3 wins on competition math. Opus wins on long-horizon code. For everything else, the answer depends on the specific task more than the model.
The cache expects your prefix to be stable. Your prefix is not stable. Here is what to do about it.
When an agent can spawn sub-agents, who is responsible for the sub-agent's actions? The answer is 'nobody' until you design it to be otherwise.
Short answer: Cursor for line-level edits, Claude Code for refactors that take more than twenty minutes. Longer answer follows.
Meta shipped a model that can be self-hosted on eight H100s and matches GPT-4o on most benchmarks. The inference industry responded within two weeks.
Grounding is not just RAG. It is a claim verification process. The retrieval is the easy part.
Most model names obscure their lineage. This is the family tree with the actual derivative relationships.
Mistral's moat is not the best model. It is the only frontier model with a credible EU data-residency story.
The most common tool design mistake is an error handling contract that returns nothing. The model has no idea what to do with silence.
Competition math, architecture design, legal contract review: yes. Customer support, classification, summarisation: almost never.
Batch processing typically cuts cost 50% and increases throughput 5×. It requires accepting up to 24 hours of latency. For most workloads, that is a straightforward trade.
In-context memory is the simplest and breaks first. Episodic memory is the most human-like and the hardest to implement correctly.
The reliable part: agent loops that run for hours on well-defined tasks. The unreliable part: knowing when the task is actually well-defined until it is not.
Qwen processes Chinese-language content with the same quality that Claude handles English. That sounds obvious; it was not obvious until someone built it.
A 500-page PDF is 250,000 tokens. At Opus pricing, reading it costs $3.75 per pass. Caching it costs $0.75 after the first.
The feature still works exactly as advertised. The content is still shallow. For onboarding to a new domain, it remains genuinely useful.
Permissive is not the same as accurate. Grok will tell you things other models won't. Some of them are true.
LLM-as-judge is surprisingly reliable for quality dimensions. It is completely unreliable for factual accuracy.
Copilot has the distribution. Cursor has the speed. Cline has the trust. Pick two.
Prompt injection is not a solved problem. It is a managed risk. The operators who do it well have accepted that and built accordingly.
Pricing tables lie. The number that matters is cost per completed task, not cost per million tokens.
Most agents fail before the 30-minute mark. The reasons are structural, not just model limitations.
MCP standardises tool definitions across models. The implications for multi-agent systems are significant.
Fine-tuning felt necessary in 2022. By 2025, the base models moved faster than the fine-tunes could keep up.
Constrained decoding is not magic. Here is what actually guarantees schema adherence and what just reduces violations.
Vision models handle PDFs and screenshots well. They fail predictably on handwriting and dense tables.
The wrong model is either too slow or too dumb for your use case. Here is how to find the boundary.
Choosing the wrong embedding model for your language pair or domain quietly degrades RAG quality for months.
Prompt injection is the SQL injection of the LLM era. Most published defenses do not hold under adversarial conditions.
Reasoning models are not better chat models. They are a different tool with different failure modes.
Most agent memory fails because engineers treat it like a database when it needs to behave like human recall.
MMLU hit 90%. The frontier moved to harder tests. Here is a guide to which benchmarks still separate good from great.
RAG pipelines look simple until they are in production. Here are the failure modes that surprise every team.
Three different theories of how an AI should help you code. Only one is right for long-horizon autonomous work.
Open weights give control. Closed APIs give capability. The right answer depends on what you are optimising for.
You cannot test an LLM application the way you test a deterministic function. Here is what actually works.
Distillation is not copying. It is teaching a student model to match the distribution of a larger teacher.
Function calling looks simple. It breaks in production when schemas are ambiguous or error messages are not informative.
GPT-5 is not GPT-4 with more parameters. The architecture and training approach changed in observable ways.
Constitutional AI replaces human labellers with a model judging its own outputs against a written constitution.
CoT is one of the most replicated findings in LLM research. Here is why it works and when it does not.
Hallucination is not a bug to be patched. It is a property of the architecture with known mitigation strategies.
Multi-agent systems fail in specific ways. Knowing the patterns lets you build around the failure modes from the start.
System prompts are your model's constitution. Most are written badly and produce variable, unreliable behaviour.
Unit tests do not capture agent behaviour. Here are the evaluation approaches that do.
Sending every query to the frontier model is expensive. Smart routing cuts costs while preserving quality.
DeepSeek showed that frontier-level capability did not require frontier-level compute. The implications rippled through the industry for months.
Voice pipelines live and die on latency. Here is how architecture decisions interact with model selection.
Web research agents are easy to build and hard to make reliable. Source quality and citation accuracy are the hard problems.
Production AI costs surprised most teams. Here are the ten approaches that deliver the highest cost reduction per unit of effort.
Prompt caching cuts costs by up to 90% on repeated system prompts. Most teams leave this saving on the table.
Enterprise security teams are still building mental models for LLM-specific risks. This is the current map.
SWE-bench tasks agents with fixing real GitHub issues. The progression from 5% to over 50% in 18 months was extraordinary.
Document processing looks like a solved problem until you hit PDFs with tables, charts, and multi-column layouts.
You do not need to understand backpropagation to understand attention. Here is the intuition that matters for building applications.
Regulation is arriving faster than most teams are preparing for. Here is what actually applies to most commercial LLM products.
Agent state management is where most frameworks fall short. Here are the patterns that handle real-world failure conditions.
The AI feature graveyard is full of products that users tried once. Here is the pattern that separates sticky AI from abandoned AI.
A million tokens is not a bigger context window. It is a different way of thinking about what you can ask a model to do.
Extended thinking is powerful on hard reasoning tasks and wasteful on simple ones. Here is how to tell which is which.
Running open-source models sounds like cost savings. It is until you account for operations, GPU rental, and engineering time.
All frontier models write well. They differ on consistency, instruction-following, and what they refuse to write.
Red teaming an LLM is not pen testing. The attack surface is fundamentally different and the failures are harder to categorise.
Long context windows change the calculus for RAG. Here is how to decide which approach is right for your use case.
LLM monitoring is not like service monitoring. The failure modes are different and the metrics that matter are counterintuitive.
The same base model can be helpless or highly capable depending on instruction tuning quality. Here is what that means in practice.
Computer use agents are impressive in demos and unreliable in production. Here is the current honest assessment.
Qwen-3 scored at frontier level while remaining openly available. The strategic implications extend beyond model quality.
Real-time AI is a latency problem wrapped in a model selection problem. The solutions interact in non-obvious ways.
Customer service agents work until they meet an upset customer or an unusual product edge case. Here is how to handle both.
RLHF is not the alignment technique. It is the usability technique. The distinction matters more than most people realise.
Tool design is half of agent design. A poorly defined tool is as damaging as a poorly designed prompt.
Synthetic data sounds like a shortcut. Used correctly, it is one of the highest-leverage data strategies available.
LLMs are better moderators than classifiers for nuanced content. They are slower and more expensive, which changes the architecture.
Examples help models generalise correctly. They also anchor models to distributions you did not intend. Both effects are real.
Scaling laws predicted GPT-4. Whether they will predict GPT-6 is a genuinely open question in 2025.
AI in science has genuine successes and genuine hype. Here is a clear-eyed assessment of what is working.
Batch APIs offer 50% cost reduction on non-realtime workloads. Most teams do not use them because they require different job management patterns.
Conversation design is part writing, part systems architecture, and part psychology. Here is how the three interact.
AI products have higher marginal costs than SaaS. The pricing models that work are different as a result.
Models get deprecated faster than most engineering teams plan for. Here is the migration architecture that handles it cleanly.
MoE models activate only a fraction of parameters per token. The efficiency gains are real and the tradeoffs are specific.
Legal is high-stakes. The tasks where AI helps and the tasks where it can damage client outcomes are not always obvious.
Financial analysis and LLMs have a complicated relationship. The models are capable; the liability frameworks are not ready.
Context windows are large but costs scale with tokens. Prompt compression is the highest-leverage optimisation most teams skip.
AI in education is one of the most-discussed and least-measured application domains. Here is an evidence-first assessment.
AI code review has moved from experiment to standard practice in two years. Here is what good deployment actually looks like.