
Every LLM API call reprocesses the entire input from scratch. In multi-turn conversations or agentic workflows, the same system prompt, tool definitions, and conversation history get re-tokenized and re-computed on every request. Prompt caching stores the key-value (KV) pairs from previously processed prefixes server-side, allowing subsequent requests that share the same prefix to skip that computation. The result is lower latency (up to 85% reduction in time-to-first-token) and lower cost (up to 90% off input token pricing) [1][2].
As of mid-2026, all three major providers — Anthropic, OpenAI, and Google — offer prompt caching, but with fundamentally different design philosophies: explicit control, fully automatic, and a hybrid approach respectively.
Anthropic offers the most granular caching system. Developers place cache_control markers on content blocks to designate what gets cached, or use a top-level automatic caching mode that manages breakpoints as conversations grow [1][3].
Anthropic uses a write-premium / read-discount model. Cache writes cost more than standard input; cache reads cost 90% less. There are no separate storage fees [1].
| Model | Base Input | 5-min Cache Write | 1-hr Cache Write | Cache Read (Hit) | Output |
|---|---|---|---|---|---|
| Claude Opus 4.7 | $5.00/MTok | $6.25/MTok (1.25×) | $10.00/MTok (2×) | $0.50/MTok (0.1×) | $25.00/MTok |
| Claude Sonnet 4.6 | $3.00/MTok | $3.75/MTok (1.25×) | $6.00/MTok (2×) | $0.30/MTok (0.1×) | $15.00/MTok |
| Claude Haiku 4.5 | $0.80/MTok | $1.00/MTok (1.25×) | $1.60/MTok (2×) | $0.08/MTok (0.1×) | $4.00/MTok |
| Claude Haiku 3 | $0.25/MTok | $0.30/MTok (1.25×) | $0.50/MTok (2×) | $0.03/MTok (0.1×) | $1.25/MTok |
Source: Anthropic official pricing page, April 2026 [1]
For Sonnet 4.6 with a 5-minute cache window [5]:
Write cost: $3.75/MTok
Savings per read: $3.00 - $0.30 = $2.70/MTok
Break-even: $3.75 / $2.70 = 1.39 reads
You need approximately 1.4 cache reads within a 5-minute window to recover the write cost — in practice, 2 reads is the minimum viable threshold. For the 1-hour window (2× write), the break-even rises to ~2.2 reads, meaning 3 reads within an hour [5].
A real-world test with Claude Sonnet 4.5 over 10 requests against a 6,313-token cached prefix showed: first request cost $0.39 (cache write + query), subsequent 9 requests cost $0.04 each, for a total of $0.75 — a 75.9% savings versus uncached [6].
OpenAI takes the opposite approach: zero configuration. Caching activates automatically on all prompts over 1,024 tokens with no code changes required and no write surcharge [7][8].
OpenAI charges no premium for cache writes. Cache hits receive a discount that varies by model — 50% for GPT-4o, 75% for the GPT-4.1 series, and up to 90% for o3/o4-mini reasoning models [8][9].
| Model | Standard Input | Cached Input | Discount | Output |
|---|---|---|---|---|
| GPT-4o | $2.50/MTok | $1.25/MTok | 50% | $10.00/MTok |
| GPT-4o mini | $0.15/MTok | $0.075/MTok | 50% | $0.60/MTok |
| GPT-4.1 | $2.00/MTok | $0.50/MTok | 75% | $8.00/MTok |
| GPT-4.1 mini | $0.40/MTok | $0.10/MTok | 75% | $1.60/MTok |
| GPT-4.1 nano | $0.10/MTok | $0.025/MTok | 75% | $0.40/MTok |
| o3 | $2.50/MTok | $0.25/MTok | 90% | $10.00/MTok |
| o4-mini | $0.75/MTok | $0.075/MTok | 90% | $3.00/MTok |
Source: OpenAI official pricing page, April 2026 [8]
prompt_cache_key parameter can be provided to influence routing and improve hit rates for requests sharing long common prefixes [7]A test with GPT-4o over 10 requests showed: first request $0.26 (no write surcharge), subsequent 9 requests $0.13 each, total $1.45 — a 53.4% savings [6]. The lower percentage versus Anthropic reflects the 50% discount (vs. 90%), but the absence of a write surcharge means every single request after the first benefits, with no break-even threshold to clear.
For GPT-4.1 series models with 75% cache discounts, the economics improve significantly. A workload processing 55M input tokens/day on GPT-4.1 drops from $110/day to $27.50/day — monthly savings of ~$2,475 [9].
Google offers two caching modes: implicit caching (automatic, enabled by default since May 2025) and explicit caching (manual, with storage fees) [10][11].
Implicit caching works automatically with no code changes and no storage fees. When the system detects a cache hit, you receive a 90% discount on cached tokens for Gemini 2.5+ models [10][11].
Explicit caching gives developers control over what gets cached and for how long, but introduces per-hour storage fees [10][11].
| Model | Standard Input | Cached Read | Discount | Storage Cost |
|---|---|---|---|---|
| Gemini 2.5 Pro (≤200K) | $1.25/MTok | $0.3125/MTok | 75% → 90%* | $4.50/MTok/hr |
| Gemini 2.5 Flash | $0.30/MTok | $0.075/MTok | 75% → 90%* | $1.00/MTok/hr |
| Gemini 2.5 Flash-Lite | $0.10/MTok | $0.025/MTok | 75% → 90%* | $0.25/MTok/hr |
Gemini 2.5+ models receive a 90% discount on cached reads; Gemini 2.0 models receive 75% [10].
Implicit caching is the right default for most workloads — it's free, automatic, and provides the same 90% discount when hits occur. Explicit caching makes sense only for very high-volume workloads where you need guaranteed cache availability and can justify the storage overhead [10][11].
| Feature | Anthropic | OpenAI | Google Gemini |
|---|---|---|---|
| Trigger mechanism | Explicit cache_control or automatic mode | Fully automatic | Implicit (auto) + Explicit (manual) |
| Min tokens | 1,024–4,096 | 1,024 | 2,048–4,096 (implicit), 32,768 (explicit) |
| Cache read discount | 90% | 50–90% (model-dependent) | 90% (Gemini 2.5+) |
| Write cost premium | 1.25× (5-min) / 2× (1-hr) | None | None (implicit) / Storage fees (explicit) |
| TTL | 5 min or 1 hour | 5–10 min auto, up to 24 hr extended | Auto (implicit), user-defined (explicit) |
| Storage fees | No | No | No (implicit) / Yes (explicit) |
| Break-even reads | ~1.4 per window | 1 (no write premium) | 1 (implicit) / ~40/hr at 10M tokens (explicit) |
| Max latency reduction | 85% TTFT | 80% TTFT | Not officially published |
| Multimodal caching | Text, images, tools | Text, images, tools | Text, images, audio, video, PDFs |
ProjectDiscovery's AI security pipeline provides one of the best-documented case studies. Their initial implementation achieved only a 7.4% cache hit rate because dynamic content (timestamps, request IDs, session tokens) was embedded near the top of the system prompt, causing every request to hash differently. Moving dynamic content after the stable prefix — a purely structural change — pushed hit rates to 84% [4][5].
On their most demanding task (1,225 steps, 67.5 million input tokens), they achieved a 91.8% cache rate and reduced total LLM costs by 59–70% [4].
Without prompt caching, a long Claude Opus coding session (100 turns with compaction cycles) can cost $50–100 in input tokens. With caching, that drops to $10–19 — making the $20/month Claude Code Pro subscription economically viable [3].
Data synthesized from multiple 2025–2026 production reports [12]:
| Workload Type | Typical Hit Rate | Cost Reduction |
|---|---|---|
| RAG-based customer support (Anthropic) | 85%+ | ~85% on input costs |
| Code analysis pipeline (OpenAI) | 70–80% | ~60% |
| Multi-step agent tasks | 80–92% | 59–70% |
| Single-step tasks | 30–40% | 15–25% |
| General SaaS (50% hit rate, 100K daily requests) | 50% | ~49% |
| Target for healthy implementations | >70–80% | — |
The single most impactful optimization is prompt structure. Place all stable content (system instructions, tool definitions, few-shot examples, reference documents) at the beginning of the prompt. Place all variable content (user query, timestamps, session-specific data) at the end [3][4][5].
[Cached block 1: System prompt — never changes]
[Cached block 2: Tool definitions — rarely changes]
[Cached block 3: Conversation history — grows each turn]
[Uncached: Current user message]
Any of these will invalidate the cache and force a full-price recomputation [3][5]:
"Today is April 27, 2026" at the top of a prompt breaks caching for everything after itWhen firing many parallel requests simultaneously, the first request triggers a cache write that takes 2–4 seconds to materialize. All concurrent requests arrive before the cache is available, so every request pays full price (or write premium) with zero reads. The fix: issue a single warm-up request and wait for it to complete before launching the parallel batch [5].
| Scenario | Best Provider | Why |
|---|---|---|
| Zero-config prototyping | OpenAI | Automatic, no write surcharge, immediate benefit |
| High-volume production with stable prompts | Anthropic | 90% discount + granular control = maximum savings |
| Multimodal caching (video/audio) | Google Gemini | Native support for caching video, audio, PDFs |
| Cost-sensitive with variable traffic | OpenAI | No write premium means no penalty for cache misses |
| Very large context (>32K tokens), high reuse | Google Gemini (explicit) | Long TTLs, guaranteed cache availability |
| Agent loops / multi-turn conversations | Anthropic | 90% savings on growing conversation prefixes |
Discounts can be combined. On Anthropic, prompt caching multipliers stack with the Batch API's 50% discount and data residency pricing [1]. On OpenAI, cached input pricing stacks with the Batch API's 50% discount [8].
Example — Maximum savings on Anthropic Opus 4.7:
Example — Maximum savings on OpenAI GPT-4.1:
Anthropic offers the deepest discounts (90% on reads) but charges a write premium (1.25–2×). Best for high-volume, stable-prefix workloads where the write cost amortizes quickly.
OpenAI is the safest default — automatic caching with no write surcharge means you never pay more than standard pricing. The 50–90% read discount (model-dependent) is lower than Anthropic's flat 90%, but the zero-risk model is compelling for variable workloads.
Gemini's implicit caching is the best free option — 90% discount with no storage fees when hits occur. Explicit caching's storage fees ($4.50/MTok/hr on 2.5 Pro) make it viable only for very high-throughput workloads.
Hit rate is an architectural property, not a prompt property. The difference between 7% and 84% hit rates is where you place dynamic content in the prompt, not whether you enable caching.
Break-even is remarkably low: ~1.4 reads per cache window on Anthropic, 1 read on OpenAI (no write premium). At scale, caching is almost always correct. At low volume, verify before shipping.
[1] Anthropic. "Pricing." https://docs.anthropic.com/en/docs/about-claude/pricing (accessed April 2026).
[2] Anthropic. "Token-saving updates on the Anthropic API." https://www.anthropic.com/news/token-saving-updates (October 2024).
[3] ClaudeCodeCamp. "How Prompt Caching Actually Works in Claude Code." https://www.claudecodecamp.com/p/how-prompt-caching-actually-works-in-claude-code (July 2026).
[4] AppXLab. "Prompt Caching LLM Cost Savings: Claude vs GPT vs Gemini." https://blog.appxlab.io/2026/04/13/prompt-caching-llm-cost-savings/ (April 2026).
[5] Tian Pan. "The Exact Math on When Provider-Side Prefix Caching Actually Pays Off." https://tianpan.co/blog/2026-04-17-prompt-cache-break-even-math (April 2026).
[6] Will McGinnis. "I Tested LLM Prompt Caching With Anthropic and OpenAI." https://mcginniscommawill.com/posts/2025-11-17-llm-prompt-caching-comparison/ (November 2025).
[7] OpenAI. "Prompt Caching Guide." https://developers.openai.com/api/docs/guides/prompt-caching (accessed April 2026).
[8] OpenAI. "API Pricing." https://www.openai.com/api/pricing (accessed April 2026).
[9] TokenMix. "How to Reduce OpenAI API Cost by 80%." https://tokenmix.ai/blog/how-to-reduce-openai-api-cost (April 2026).
[10] Google Cloud. "Context Caching Overview — Vertex AI." https://cloud.google.com/vertex-ai/docs/generative-ai/context-cache/context-cache-overview (accessed April 2026).
[11] Google Developers Blog. "Gemini 2.5 Models now support implicit caching." https://developers.googleblog.com/en/gemini-2-5-models-now-support-implicit-caching/ (May 2025).
[12] Zylos Research. "Prompt Caching and KV Cache Optimization for Long-Running AI Agent Sessions." https://zylos.ai/research/2026-03-27-prompt-caching-kv-cache-optimization-long-running-ai-agents (March 2026).