
Every major frontier model provider — OpenAI, Anthropic, and Google — now ships a dedicated batch / asynchronous inference tier that trades real-time latency for a flat 50% discount on both input and output tokens [1][2][3]. What started as a quiet OpenAI feature in 2024 has become a standard pricing lane in 2025–2026: submit a job, wait minutes-to-hours, pay half.
For workloads where a response can wait — nightly summarization, evaluation harnesses, bulk classification, embeddings backfills, dataset labeling — batch APIs are often the single largest cost lever available, bigger than model-tier downgrades and competitive with prompt caching. One industry analysis estimates that in typical production LLM bills, 60%+ of spend goes to offline jobs (nightly summarization, enrichment, classification) that nobody reads in real time [4]. Moving that traffic to batch halves the bill with essentially no code change beyond a different endpoint.
This report covers the three major batch offerings, concrete per-million-token pricing, latency tradeoffs, stacking with prompt caching, and when batch is the right (and wrong) tool.
All three providers landed on the same headline number: -50% versus the synchronous/standard tier, applied to both input and output tokens.
| Provider | Batch/Async Product | Discount | SLA |
|---|---|---|---|
| OpenAI | Batch API | 50% off standard [1][5] | 24 hours |
| Anthropic | Message Batches API | 50% off standard [2][6] | 24 hours (most < 1 hour) |
| Gemini Batch API / Batch tier | 50% off standard [3][7] | up to 24 hours |
The OpenAI pricing page lists the batch column explicitly as "Standard / Batch -50% / Data residency +10%", confirming the discount is a flat adjuster applied across essentially the entire model catalog [1]. Anthropic makes the same claim in its official product announcement: "Each batch is processed in less than 24 hours and costs 50% less than standard API calls" [6]. Google's Vertex AI documentation states the same: "Batch processing is offered at a 50% discounted rate compared to real-time inference" [3].
The symmetry is not a coincidence. Batch inference lets providers schedule work against spare GPU capacity during off-peak windows, smoothing utilization curves that would otherwise be dominated by daytime chat traffic. Google explicitly describes the Flex tier (a close cousin of Batch) as "leveraging underutilized computing resources during off-peak periods" in exchange for 1–15 minute latency [8]. Batch takes the same tradeoff further: looser SLA, same 50% off.
Providers compete heavily on headline per-token pricing for real-time tiers; competing on batch rates would signal capacity oversupply. 50% has become a Schelling point — large enough to change buying behavior, consistent across providers, easy to reason about.
Batch pricing is always derived: take the standard per-million rate and halve it. The absolute dollar figures below use publicly listed standard pricing from late-2025 / early-2026 snapshots.
Standard GPT-5 pricing is $1.25 input / $10.00 output per million tokens [9][10]. Under Batch:
| Model | Standard I/O (per 1M) | Batch I/O (per 1M) |
|---|---|---|
| GPT-5 | $1.25 / $10.00 | $0.625 / $5.00 [11] |
| GPT-5 mini | $0.25 / $2.00 [10] | $0.125 / $1.00 |
| GPT-5 nano | ~$0.05 / $0.40 | ~$0.025 / $0.20 |
| GPT-4o (legacy) | $2.50 / $10.00 | $1.25 / $5.00 [12] |
A Redress Compliance analysis confirms the math: "GPT-5 is OpenAI's most capable model at $1.25 per million input tokens (standard) or as low as $0.625 with batch API" [11].
Note: Historically there was one reported pricing discrepancy for gpt-4o-2024-08-06 where batch output tokens were listed at $7.50/M rather than the expected $5.00/M (a 25% rather than 50% discount); OpenAI corrected this after community reports [12]. In general, verify the live pricing page before committing production budget.
Anthropic's headline batch rates for Claude Opus / Sonnet / Haiku follow the same 50% rule [13][14]:
| Model | Standard I/O (per 1M) | Batch I/O (per 1M) |
|---|---|---|
| Claude Opus 4.x | $15 / $75 (historical) → $5 / $25 (Opus 4.6) [13] | $2.50 / $12.50 |
| Claude Sonnet 4.x | $3 / $15 [14] | $1.50 / $7.50 |
| Claude Haiku 4.5 | $1 / $5 [14] | $0.50 / $2.50 [15] |
An independent pricing tracker records Haiku batch pricing precisely as "Batch API: $0.50/$2.50 per 1M tokens (50% off)" [15].
Gemini 2.5 Pro standard pricing is $1.25 input / $10.00 output per 1M tokens (≤200K prompts) and $2.50 / $15.00 for prompts >200K [16][17]. Gemini 2.5 Flash is $0.30 / $2.50 per 1M [18].
| Model | Standard I/O (per 1M) | Batch I/O (per 1M) |
|---|---|---|
| Gemini 2.5 Pro | $1.25 / $10.00 | $0.625 / $5.00 [19] |
| Gemini 2.5 Flash | $0.30 / $2.50 | $0.15 / $1.25 |
| Gemini 2.5 Flash-Lite | $0.10 / $0.40 | $0.05 / $0.20 |
| Embeddings (text-embedding) | $0.15 / 1M | $0.075 / 1M [20] |
Google announced the Batch-mode embedding discount explicitly: "leverage the model with the Batch API at much higher rate limits and at half the price — $0.075 per 1M input tokens" [20].
A realistic backfill of 10 billion input tokens through Gemini 2.5 Pro standard costs $12,500. The same job through Batch costs $6,250 — a $6,250 saving for code that waits until morning. Scale to a 50B-token knowledge-base embedding refresh on text-embedding-3-large at $0.13/1M standard, and batch flips that from $6,500 to ~$3,250 [21].
The 50% discount is free only if the latency budget permits it. Real-world observed latencies diverge substantially from the advertised SLA.
| Provider | Advertised SLA | Typical observed | Pathological cases |
|---|---|---|---|
| OpenAI Batch | 24 hours | Minutes–hours when load is light; 2 hours common for nightly jobs [22] | 16-hour successes and 24-hour timeouts during high-load periods [22] |
| Anthropic Batches | 24 hours; "most < 1 hour" [2][23] | Usually <1 hour for small batches | One production report documented 4+ hour completion times with no per-item progress and no cancellation support [24] |
| Gemini Batch | Up to 24 hours | Hours | Developer reports of jobs stuck >72 hours in PROCESSING [25]; occasional >24h runs on gemini-3.1-pro [26] |
Key implication: treat "24 hours" as the actual SLA, not the long tail. Build pipelines that tolerate delays beyond 24 hours (retry loops, cancel-and-resubmit paths) rather than assuming the typical 30-minute turnaround.
In April 2026 Google restructured Gemini's pricing into five tiers — Standard, Flex, Priority, Batch, Caching — giving developers a spectrum rather than a binary [27][28]:
Flex is important because it removes the main operational tax of batch: file uploads, polling, result parsing. You call the same synchronous endpoint with a flex parameter and accept a 1–15 minute latency window in exchange for the same 50% discount. A scoop summary phrases it cleanly: "unlike the Batch tier, Flex operates strictly on a synchronous processing model (like the standard API), allowing you to handle background jobs cheaply without needing to overhaul your entire code architecture" [30].
OpenAI also ships a Flex processing tier on reasoning models with a similar promise of cheaper non-realtime inference, and Anthropic's Priority Tier / Batch tiering serves the inverse end of the spectrum.
Batch APIs are not unbounded. Per-batch payload and request count limits shape how you architect jobs.
Every provider supports multi-turn conversations, tool use, system prompts, and vision inputs inside batch requests — batch does not mean feature-reduced. Anthropic explicitly notes "each request is a standard Messages API call, so it can include system messages, multi-turn conversations, tool use definitions, and vision inputs" [33].
The canonical batch use cases cluster around three patterns.
Categorizing millions of support tickets, tagging product catalogs, extracting entities from archival documents, moderating user-generated content. These are the textbook batch workloads — high volume, no user waiting, cost-dominated. Tianpan's 2026 writeup frames the problem starkly: teams "obsess over time-to-first-token... then discover that 60% or more of their LLM API spend goes to nightly summarization jobs, data enrichment pipelines, and classification runs that nobody watches in real time" [4].
Running an eval harness against 10,000 prompts to validate a prompt change or model upgrade is embarrassingly parallel and latency-insensitive. A tutorial on OpenAI's Batch API for evaluation notes "guaranteed completion: all requests completed within 24 hours (usually much faster)" and halved cost, making it feasible to sweep over prompt variants that would be unaffordable in real time [34].
Refreshing a vector index over a full document corpus is perhaps the highest-ROI batch workload: purely input-token-bound, no output generation cost, and vector DBs don't care when the vectors land. A Medium tutorial on batch embedding notes "batch jobs are half the price of individual API calls, which is beneficial if you need to embed a large amount of text" [35]. With Gemini's batch embeddings at $0.075/1M [20] and OpenAI's text-embedding-3-small at roughly $0.00013 per 1M under batch [21], a 100-billion-token embedding run costs on the order of $13 rather than $26 — rounding error at current infra budgets.
Generating training datasets (instruction pairs, chain-of-thought traces, paraphrase variants) is a pure batch problem. There is no user, no SLA, and the output feeds downstream training — jobs routinely run for hours regardless of API latency.
The biggest question for cost-sensitive teams: can batch (-50%) stack with prompt caching (-90% on cached reads)?
An industry analysis argues the combined effect of prompt caching + batch + model routing can cut Claude API bills 50 to 95 percent [39][40], with independent academic evaluation showing prompt caching alone reduces costs 41–80% and improves time-to-first-token 13–31% [41]. Stacking compounds these gains.
A simple 2×2 on latency tolerance and cost sensitivity:
Rough rule of thumb: if you're spending >$5,000/month on any LLM provider and >30% of that spend is on non-interactive workloads, you are leaving roughly $750+/month on the table by not moving them to batch.
Batch APIs are the most boring, most under-used, highest-ROI lever in the LLM cost optimization toolbox. A 50% discount is enormous, the latency tradeoff (minutes-to-hours) is acceptable for the majority of real production workloads, and the migration is usually a one-day effort: change the endpoint, wrap the submission in a polling loop, write results to the same sink you were using before. Combined with prompt caching (up to 90% on cached reads) and the newer middle-tier options like Gemini Flex and OpenAI Flex, the cost curve for non-interactive inference has dropped by an order of magnitude since 2023. For any team running nightly jobs, evaluation sweeps, or embedding backfills, the question in 2026 is not whether to use batch — it's why you aren't already.
[1] OpenAI API Pricing — https://openai.com/api/pricing/ [2] Batch processing — Anthropic docs — https://docs.anthropic.com/en/docs/build-with-claude/batch-processing [3] Generative AI on Vertex AI — Batch prediction — https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini [4] Offline Processing and the Queue Design Nobody Talks About — https://tianpan.co/blog/2026-04-10-batch-llm-pipeline-blind-spot-offline-processing [5] OpenAI Batch guide — https://platform.openai.com/docs/guides/batch [6] Introducing the Message Batches API — Anthropic — https://www.anthropic.com/news/message-batches-api [7] Google Launches New Gemini API Pricing Strategy with Tiered Service Options — https://www.kucoin.com/news/flash/google-launches-new-gemini-api-pricing-strategy-with-tiered-service-options [8] Alphabet (GOOGL) Stock: Google Unveils Flexible Gemini API Pricing Options — https://blockonomi.com/alphabet-googl-stock-google-unveils-flexible-gemini-api-pricing-options/ [9] GPT-5 is here — OpenAI — https://openai.com/gpt-5/ [10] GPT-5 API pricing & specs — https://cloudprice.net/models/openai-gpt-5 [11] OpenAI API Pricing Calculator — https://redresscompliance.com/openai-api-pricing-calculator.html [12] Batch API pricing for gpt-4o-2024-08-06 — https://community.openai.com/t/batch-api-pricing-for-gpt-4o-2024-08-06/918686 [13] Claude Opus 4.6 Pricing Guide 2026 — https://blog.laozhang.ai/en/posts/claude-opus-4-6-pricing-subscription-guide [14] Anthropic Claude API Pricing Guide 2026 — https://curlscape.com/blog/anthropic-claude-api-pricing-guide-2026 [15] Claude API Pricing (April 2026) — https://pecollective.com/tools/anthropic-api-pricing/ [16] Google Gemini 2.5 Pro Pricing (2026) — https://langcopilot.com/llm-pricing/google/gemini-2.5-pro [17] Gemini Developer API Pricing — https://ai.google.dev/gemini-api/docs/pricing [18] Google Gemini 2.5 Flash Pricing (2026) — https://langcopilot.com/llm-pricing/google/gemini-2.5-flash [19] Gemini API Pricing Guide 2025 — https://aifreeapi.com/en/posts/gemini-api-pricing-guide [20] Gemini Batch API now supports Embeddings and OpenAI Compatibility — https://developers.googleblog.com/en/gemini-batch-api-now-supports-embeddings-and-openai-compatibility/ [21] Pricing discrepancy for embedding models — https://community.openai.com/t/pricing-discrepancy-for-embedding-models-between-pricing-page-and-model-docs/1346972 [22] Gpt-4o Batch Processing Jobs Response Time Increased Significantly — https://community.openai.com/t/gpt-4o-batch-processing-jobs-response-time-increased-significantly-causing-job-timeouts-failures/1110744 [23] Batch processing — Anthropic docs (message-batches) — https://docs.anthropic.com/en/docs/build-with-claude/message-batches [24] Anthropic Batch API in Production — https://ryan.dotzlaw.com/articles/obsidiannotes/02anthropicbatch/ [25] Batch API Jobs Stuck in PROCESSING for 72+ Hours — https://discuss.ai.google.dev/t/batch-api-jobs-stuck-in-processing-for-72-hours/114081 [26] Batch API is taking longer than 24h (gemini-3.1-pro) — https://discuss.ai.google.dev/t/batch-api-is-taking-longer-than-24h-gemini-3-1-pro/129209 [27] Google Gemini API Introduces Flex, Priority, and Batch Tiers — https://g.wplaybook.com/google-gemini-api-new-inference-tiers-flex-batch/ [28] Google Unveils Flexible Gemini API Pricing With New Flex and Priority Tiers — https://computing.net/news/stocks/google-unveils-flexible-gemini-api-pricing-with-new-flex-and-priority-tiers/ [29] Gemini API optimization and inference — https://ai.google.dev/gemini-api/docs/optimization [30] Alphabet (GOOGL) Stock: Google Unveils Flexible Gemini API Pricing Options — https://scoopsquare24.com/alphabet-googl-stock-google-unveils-flexible-gemini-api-pricing-options/ [31] Batch processing (100K request / 256 MB limits) — https://docs.anthropic.com/en/docs/build-with-claude/message-batches [32] A practical guide to the OpenAI Batch API — https://www.eesel.ai/blog/openai-batch-api [33] 10,000 Tasks, One Request, Half the Cost — https://www.activelogic.com/insights/10000-tasks-one-request-half-the-cost/ [34] Evaluation with OpenAI's Batch API — https://medium.com/@mmonishp147/evaluation-with-openais-batch-api-a-cost-effective-approach-8ea6a444cf4b [35] Tutorial: Batch embedding with OpenAI API — https://medium.com/@mikehpg/tutorial-batch-embedding-with-openai-api-95da95c9778a [36] Regarding the Issue of Half-Priced Prompt Caching — https://community.openai.com/t/regarding-the-issue-of-half-priced-prompt-caching/990681 [37] Gemini API Batch vs Context Caching — https://yingtu.ai/en/blog/gemini-api-batch-vs-caching [38] Vertex AI batch prediction cached-token discount — https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini [39] How to cut Claude API costs by up to 95 percent — https://amitkoth.com/reduce-claude-api-costs/ [40] AI API Cost Reduction: Prompt Caching and Routing — https://redresscompliance.com/ai-api-cost-reduction-prompt-caching-routing.html [41] An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks — https://arxiv.org/html/2601.06007v2 [42] Critical Issues with Batch API: Detailed Report — https://community.openai.com/t/critical-issues-with-batch-api-detailed-report-and-observations/1245120 [43] The GPT-4o Batch API has been extremely slow — https://community.openai.com/t/the-gpt-4o-batch-api-has-been-extremely-slow/1111011
Content was rephrased for compliance with licensing restrictions.