Field identification
Prompt caching, the 90%-discount most operators don't use
Anthropic, OpenAI, and Gemini all offer cached-prefix discounts. The architectures that take advantage of them look different from the ones that don't.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
Prompt caching converts a recurring prefix into a 90%-discounted read for a short window. Operators who structure their prompts around it pay 5-10× less than operators who don't. The architectural pattern is: put the static parts of the prompt at the top, the dynamic parts at the bottom, and never put a fresh timestamp anywhere near the cache breakpoint.
Anthropic cache
Mark up to four cache breakpoints in a prompt. The longest prefix that hits a breakpoint is cached for 5 minutes. Subsequent reads of that prefix bill at 10% of input rate. Cache writes are 25% above input rate, so the breakeven is approximately 2-3 reads before the cache pays for itself.
OpenAI cache
Automatic, no explicit breakpoints. Cache TTL is shorter (typically 5-10 minutes). Discount is similar (~50%, varies by model).
Architecture
System prompts, tool definitions, and document context belong above the cache breakpoint. User turn and dynamic state belong below. A common mistake: putting the current timestamp at the top of the prompt for 'context', which invalidates every cache.
Tells
| Marker | Meaning |
|---|---|
| Cache hit-rate visible in API response is below 30% on a long-running agent | Prefix is fluctuating; restructure to stabilise it. |
| Bill stays high across agent turns despite identical-looking prompts | Cache misses; usually a timestamp or randomised field at the top. |
Frequently asked
Does caching survive a model switch?
No; cache is per-model.
What about cross-provider caching?
Doesn't exist. Each provider's cache is local.
From the Almanac shop
Model Tells — Flashcard Deck
Identify any frontier model from a paragraph of output. 60 cards.
$14 — Coming soon