Prompt caching, the 90%-discount most operators don't use

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Prompt caching converts a recurring prefix into a 90%-discounted read for a short window. Operators who structure their prompts around it pay 5-10× less than operators who don't. The architectural pattern is: put the static parts of the prompt at the top, the dynamic parts at the bottom, and never put a fresh timestamp anywhere near the cache breakpoint.

Anthropic cache

Mark up to four cache breakpoints in a prompt. The longest prefix that hits a breakpoint is cached for 5 minutes. Subsequent reads of that prefix bill at 10% of input rate. Cache writes are 25% above input rate, so the breakeven is approximately 2-3 reads before the cache pays for itself.

OpenAI cache

Automatic, no explicit breakpoints. Cache TTL is shorter (typically 5-10 minutes). Discount is similar (~50%, varies by model).

Architecture

System prompts, tool definitions, and document context belong above the cache breakpoint. User turn and dynamic state belong below. A common mistake: putting the current timestamp at the top of the prompt for 'context', which invalidates every cache.

Tells

Marker	Meaning
Cache hit-rate visible in API response is below 30% on a long-running agent	Prefix is fluctuating; restructure to stabilise it.
Bill stays high across agent turns despite identical-looking prompts	Cache misses; usually a timestamp or randomised field at the top.

Frequently asked

Does caching survive a model switch?

No; cache is per-model.

What about cross-provider caching?

Doesn't exist. Each provider's cache is local.

From the Almanac shop

Model Tells — Flashcard Deck

Identify any frontier model from a paragraph of output. 60 cards.

$14 — Coming soon

← All identification topics