backBatch APIs and Async Discounts

Batch APIs and Async Discounts: The 50%-Off Lane for LLM Workloads

1. Executive Summary

Every major frontier model provider — OpenAI, Anthropic, and Google — now ships a dedicated batch / asynchronous inference tier that trades real-time latency for a flat 50% discount on both input and output tokens [1][2][3]. What started as a quiet OpenAI feature in 2024 has become a standard pricing lane in 2025–2026: submit a job, wait minutes-to-hours, pay half.

For workloads where a response can wait — nightly summarization, evaluation harnesses, bulk classification, embeddings backfills, dataset labeling — batch APIs are often the single largest cost lever available, bigger than model-tier downgrades and competitive with prompt caching. One industry analysis estimates that in typical production LLM bills, 60%+ of spend goes to offline jobs (nightly summarization, enrichment, classification) that nobody reads in real time [4]. Moving that traffic to batch halves the bill with essentially no code change beyond a different endpoint.

This report covers the three major batch offerings, concrete per-million-token pricing, latency tradeoffs, stacking with prompt caching, and when batch is the right (and wrong) tool.

2. The 50% Rule: A Cross-Provider Convergence

All three providers landed on the same headline number: -50% versus the synchronous/standard tier, applied to both input and output tokens.

ProviderBatch/Async ProductDiscountSLA
OpenAIBatch API50% off standard [1][5]24 hours
AnthropicMessage Batches API50% off standard [2][6]24 hours (most < 1 hour)
GoogleGemini Batch API / Batch tier50% off standard [3][7]up to 24 hours

The OpenAI pricing page lists the batch column explicitly as "Standard / Batch -50% / Data residency +10%", confirming the discount is a flat adjuster applied across essentially the entire model catalog [1]. Anthropic makes the same claim in its official product announcement: "Each batch is processed in less than 24 hours and costs 50% less than standard API calls" [6]. Google's Vertex AI documentation states the same: "Batch processing is offered at a 50% discounted rate compared to real-time inference" [3].

2.1 Why exactly 50%?

The symmetry is not a coincidence. Batch inference lets providers schedule work against spare GPU capacity during off-peak windows, smoothing utilization curves that would otherwise be dominated by daytime chat traffic. Google explicitly describes the Flex tier (a close cousin of Batch) as "leveraging underutilized computing resources during off-peak periods" in exchange for 1–15 minute latency [8]. Batch takes the same tradeoff further: looser SLA, same 50% off.

Providers compete heavily on headline per-token pricing for real-time tiers; competing on batch rates would signal capacity oversupply. 50% has become a Schelling point — large enough to change buying behavior, consistent across providers, easy to reason about.

3. Concrete Numbers: Batch Pricing by Model

Batch pricing is always derived: take the standard per-million rate and halve it. The absolute dollar figures below use publicly listed standard pricing from late-2025 / early-2026 snapshots.

3.1 OpenAI GPT-5 family

Standard GPT-5 pricing is $1.25 input / $10.00 output per million tokens [9][10]. Under Batch:

ModelStandard I/O (per 1M)Batch I/O (per 1M)
GPT-5$1.25 / $10.00$0.625 / $5.00 [11]
GPT-5 mini$0.25 / $2.00 [10]$0.125 / $1.00
GPT-5 nano~$0.05 / $0.40~$0.025 / $0.20
GPT-4o (legacy)$2.50 / $10.00$1.25 / $5.00 [12]

A Redress Compliance analysis confirms the math: "GPT-5 is OpenAI's most capable model at $1.25 per million input tokens (standard) or as low as $0.625 with batch API" [11].

Note: Historically there was one reported pricing discrepancy for gpt-4o-2024-08-06 where batch output tokens were listed at $7.50/M rather than the expected $5.00/M (a 25% rather than 50% discount); OpenAI corrected this after community reports [12]. In general, verify the live pricing page before committing production budget.

3.2 Anthropic Claude family

Anthropic's headline batch rates for Claude Opus / Sonnet / Haiku follow the same 50% rule [13][14]:

ModelStandard I/O (per 1M)Batch I/O (per 1M)
Claude Opus 4.x$15 / $75 (historical) → $5 / $25 (Opus 4.6) [13]$2.50 / $12.50
Claude Sonnet 4.x$3 / $15 [14]$1.50 / $7.50
Claude Haiku 4.5$1 / $5 [14]$0.50 / $2.50 [15]

An independent pricing tracker records Haiku batch pricing precisely as "Batch API: $0.50/$2.50 per 1M tokens (50% off)" [15].

3.3 Google Gemini family

Gemini 2.5 Pro standard pricing is $1.25 input / $10.00 output per 1M tokens (≤200K prompts) and $2.50 / $15.00 for prompts >200K [16][17]. Gemini 2.5 Flash is $0.30 / $2.50 per 1M [18].

ModelStandard I/O (per 1M)Batch I/O (per 1M)
Gemini 2.5 Pro$1.25 / $10.00$0.625 / $5.00 [19]
Gemini 2.5 Flash$0.30 / $2.50$0.15 / $1.25
Gemini 2.5 Flash-Lite$0.10 / $0.40$0.05 / $0.20
Embeddings (text-embedding)$0.15 / 1M$0.075 / 1M [20]

Google announced the Batch-mode embedding discount explicitly: "leverage the model with the Batch API at much higher rate limits and at half the price — $0.075 per 1M input tokens" [20].

3.4 Dollar-weight comparison at 10B tokens

A realistic backfill of 10 billion input tokens through Gemini 2.5 Pro standard costs $12,500. The same job through Batch costs $6,250 — a $6,250 saving for code that waits until morning. Scale to a 50B-token knowledge-base embedding refresh on text-embedding-3-large at $0.13/1M standard, and batch flips that from $6,500 to ~$3,250 [21].

4. Latency Tradeoffs: What Does "Async" Actually Cost?

The 50% discount is free only if the latency budget permits it. Real-world observed latencies diverge substantially from the advertised SLA.

4.1 Advertised vs observed

ProviderAdvertised SLATypical observedPathological cases
OpenAI Batch24 hoursMinutes–hours when load is light; 2 hours common for nightly jobs [22]16-hour successes and 24-hour timeouts during high-load periods [22]
Anthropic Batches24 hours; "most < 1 hour" [2][23]Usually <1 hour for small batchesOne production report documented 4+ hour completion times with no per-item progress and no cancellation support [24]
Gemini BatchUp to 24 hoursHoursDeveloper reports of jobs stuck >72 hours in PROCESSING [25]; occasional >24h runs on gemini-3.1-pro [26]

Key implication: treat "24 hours" as the actual SLA, not the long tail. Build pipelines that tolerate delays beyond 24 hours (retry loops, cancel-and-resubmit paths) rather than assuming the typical 30-minute turnaround.

4.2 Middle tiers closing the latency gap

In April 2026 Google restructured Gemini's pricing into five tiers — Standard, Flex, Priority, Batch, Caching — giving developers a spectrum rather than a binary [27][28]:

  • Priority: +75% to +100% over Standard, millisecond-to-second latency, for mission-critical real-time [27][29].
  • Standard: the reference rate.
  • Flex: -50%, 1–15 minute latency, synchronous endpoint (no job management) [28][30].
  • Batch: -50%, up to 24h, asynchronous job submission [27].
  • Caching: separate discount of up to 90% on cached prefix tokens.

Flex is important because it removes the main operational tax of batch: file uploads, polling, result parsing. You call the same synchronous endpoint with a flex parameter and accept a 1–15 minute latency window in exchange for the same 50% discount. A scoop summary phrases it cleanly: "unlike the Batch tier, Flex operates strictly on a synchronous processing model (like the standard API), allowing you to handle background jobs cheaply without needing to overhaul your entire code architecture" [30].

OpenAI also ships a Flex processing tier on reasoning models with a similar promise of cheaper non-realtime inference, and Anthropic's Priority Tier / Batch tiering serves the inverse end of the spectrum.

5. Hard Limits and Constraints

Batch APIs are not unbounded. Per-batch payload and request count limits shape how you architect jobs.

  • Anthropic Message Batches: up to 100,000 message requests or 256 MB per batch, whichever is hit first [31]. The original launch blog advertised 10,000 requests per batch [6], and current docs confirm the newer 100,000 ceiling.
  • OpenAI Batch: uses uploaded JSONL files; per-file and per-job limits are enforced at the Files API level (typical file size cap 200 MB; per-job request count depends on model-specific quotas). The core contract: submit a file of requests, poll, get a response file within 24h [32].
  • Gemini Batch: supports embeddings as of late 2025 [20] and an OpenAI-compatible surface, making migration from OpenAI batch pipelines largely mechanical.

Every provider supports multi-turn conversations, tool use, system prompts, and vision inputs inside batch requests — batch does not mean feature-reduced. Anthropic explicitly notes "each request is a standard Messages API call, so it can include system messages, multi-turn conversations, tool use definitions, and vision inputs" [33].

6. Use Cases: Where Batch Dominates

The canonical batch use cases cluster around three patterns.

6.1 Bulk data enrichment and classification

Categorizing millions of support tickets, tagging product catalogs, extracting entities from archival documents, moderating user-generated content. These are the textbook batch workloads — high volume, no user waiting, cost-dominated. Tianpan's 2026 writeup frames the problem starkly: teams "obsess over time-to-first-token... then discover that 60% or more of their LLM API spend goes to nightly summarization jobs, data enrichment pipelines, and classification runs that nobody watches in real time" [4].

6.2 Evaluation and benchmarking

Running an eval harness against 10,000 prompts to validate a prompt change or model upgrade is embarrassingly parallel and latency-insensitive. A tutorial on OpenAI's Batch API for evaluation notes "guaranteed completion: all requests completed within 24 hours (usually much faster)" and halved cost, making it feasible to sweep over prompt variants that would be unaffordable in real time [34].

6.3 Embeddings backfills

Refreshing a vector index over a full document corpus is perhaps the highest-ROI batch workload: purely input-token-bound, no output generation cost, and vector DBs don't care when the vectors land. A Medium tutorial on batch embedding notes "batch jobs are half the price of individual API calls, which is beneficial if you need to embed a large amount of text" [35]. With Gemini's batch embeddings at $0.075/1M [20] and OpenAI's text-embedding-3-small at roughly $0.00013 per 1M under batch [21], a 100-billion-token embedding run costs on the order of $13 rather than $26 — rounding error at current infra budgets.

6.4 Synthetic data generation and fine-tuning prep

Generating training datasets (instruction pairs, chain-of-thought traces, paraphrase variants) is a pure batch problem. There is no user, no SLA, and the output feeds downstream training — jobs routinely run for hours regardless of API latency.

6.5 When NOT to use batch

  • Anything user-facing with a response expectation — chat, agent steps in an interactive loop, search query augmentation.
  • Tight feedback loops during development — waiting 30 minutes for a 5-request debugging iteration destroys productivity. Flex or Priority tiers fit better.
  • Work with hard deadlines inside a 1-hour window — batch SLAs skew long-tailed; a 15-minute deadline is unsafe even when average latency is 5 minutes.
  • Sequential multi-step agent chains — each step's input depends on the previous step's output, so batching provides no parallelism gain; the 50% discount is real but rarely worth the 24h cumulative wait.

7. Stacking Batch with Prompt Caching

The biggest question for cost-sensitive teams: can batch (-50%) stack with prompt caching (-90% on cached reads)?

  • OpenAI: Batch applies a flat 50% on total tokens per completed request. An OpenAI staff answer on the community forum clarifies: "If your use case is asynchronous, consider using the Batch API. This option will incur only 50% of the cost for the total token count per successfully completed request, regardless of size" [36]. Cached input discount and batch discount generally do not compound multiplicatively on the same tokens on OpenAI; behavior depends on model and should be verified per-model.
  • Anthropic: Prompt caching is available inside batch requests (see anthropics/anthropic-sdk-typescript issue #553 for implementation details), but exact stacking depends on whether the cache prefix was written by a prior standard-tier call.
  • Google Gemini: A cost-optimization analysis reports that Batch (50%) and Context Caching (up to 90%) can combine, with the net effect being multiplicative on the cached portion [37]. The Vertex docs confirm "90% discount on cached tokens compared to standard input tokens" [38], making the optimal strategy — cache your system prompt + long context, then hit the batch endpoint — capable of reducing an input-heavy workload by more than 90% relative to uncached standard calls.

An industry analysis argues the combined effect of prompt caching + batch + model routing can cut Claude API bills 50 to 95 percent [39][40], with independent academic evaluation showing prompt caching alone reduces costs 41–80% and improves time-to-first-token 13–31% [41]. Stacking compounds these gains.

8. Operational Gotchas

  1. No per-item progress on most batch APIs. Anthropic in particular exposes batch-level status only; for jobs with 50K+ requests this makes progress estimation hard [24].
  2. Cancellation is limited. On Anthropic you cannot cancel individual items mid-batch; you cancel the entire batch [24].
  3. Failures are opaque. OpenAI's dashboard usage counts have diverged from actual success/failure counts returned by the API in at least one documented incident (API said 885 total / 758 successful / 97 failed; dashboard said 1,838 total) [42]. Track success via the result file, not the console.
  4. Billing is per successful request. Failed requests typically do not incur charges, but partial completions (stream cut off, content-filter triggers) may still be billed.
  5. Capacity contention is real. OpenAI users reported nightly batch jobs that historically completed in 2 hours spiking to 16 hours or failing entirely around the release of new models — i.e., when new-model inference ate shared capacity [43].

9. Decision Framework

A simple 2×2 on latency tolerance and cost sensitivity:

  • Latency < 1s AND cost-sensitive → Standard tier with aggressive prompt caching; downgrade model tier (e.g., Haiku/Flash/Mini variants) before touching batch.
  • Latency < 15 min AND cost-sensitive → Gemini Flex or OpenAI Flex (-50%, synchronous). Easiest migration path.
  • Latency < 24 hours AND cost-sensitive → Batch API on all three providers. Half the cost of standard, highest throughput ceiling.
  • Latency < 100ms AND reliability-critical → Priority / Standard tier; batch is the wrong tool.

Rough rule of thumb: if you're spending >$5,000/month on any LLM provider and >30% of that spend is on non-interactive workloads, you are leaving roughly $750+/month on the table by not moving them to batch.

10. Conclusion

Batch APIs are the most boring, most under-used, highest-ROI lever in the LLM cost optimization toolbox. A 50% discount is enormous, the latency tradeoff (minutes-to-hours) is acceptable for the majority of real production workloads, and the migration is usually a one-day effort: change the endpoint, wrap the submission in a polling loop, write results to the same sink you were using before. Combined with prompt caching (up to 90% on cached reads) and the newer middle-tier options like Gemini Flex and OpenAI Flex, the cost curve for non-interactive inference has dropped by an order of magnitude since 2023. For any team running nightly jobs, evaluation sweeps, or embedding backfills, the question in 2026 is not whether to use batch — it's why you aren't already.


References

[1] OpenAI API Pricing — https://openai.com/api/pricing/ [2] Batch processing — Anthropic docs — https://docs.anthropic.com/en/docs/build-with-claude/batch-processing [3] Generative AI on Vertex AI — Batch prediction — https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini [4] Offline Processing and the Queue Design Nobody Talks About — https://tianpan.co/blog/2026-04-10-batch-llm-pipeline-blind-spot-offline-processing [5] OpenAI Batch guide — https://platform.openai.com/docs/guides/batch [6] Introducing the Message Batches API — Anthropic — https://www.anthropic.com/news/message-batches-api [7] Google Launches New Gemini API Pricing Strategy with Tiered Service Options — https://www.kucoin.com/news/flash/google-launches-new-gemini-api-pricing-strategy-with-tiered-service-options [8] Alphabet (GOOGL) Stock: Google Unveils Flexible Gemini API Pricing Options — https://blockonomi.com/alphabet-googl-stock-google-unveils-flexible-gemini-api-pricing-options/ [9] GPT-5 is here — OpenAI — https://openai.com/gpt-5/ [10] GPT-5 API pricing & specs — https://cloudprice.net/models/openai-gpt-5 [11] OpenAI API Pricing Calculator — https://redresscompliance.com/openai-api-pricing-calculator.html [12] Batch API pricing for gpt-4o-2024-08-06 — https://community.openai.com/t/batch-api-pricing-for-gpt-4o-2024-08-06/918686 [13] Claude Opus 4.6 Pricing Guide 2026 — https://blog.laozhang.ai/en/posts/claude-opus-4-6-pricing-subscription-guide [14] Anthropic Claude API Pricing Guide 2026 — https://curlscape.com/blog/anthropic-claude-api-pricing-guide-2026 [15] Claude API Pricing (April 2026) — https://pecollective.com/tools/anthropic-api-pricing/ [16] Google Gemini 2.5 Pro Pricing (2026) — https://langcopilot.com/llm-pricing/google/gemini-2.5-pro [17] Gemini Developer API Pricing — https://ai.google.dev/gemini-api/docs/pricing [18] Google Gemini 2.5 Flash Pricing (2026) — https://langcopilot.com/llm-pricing/google/gemini-2.5-flash [19] Gemini API Pricing Guide 2025 — https://aifreeapi.com/en/posts/gemini-api-pricing-guide [20] Gemini Batch API now supports Embeddings and OpenAI Compatibility — https://developers.googleblog.com/en/gemini-batch-api-now-supports-embeddings-and-openai-compatibility/ [21] Pricing discrepancy for embedding models — https://community.openai.com/t/pricing-discrepancy-for-embedding-models-between-pricing-page-and-model-docs/1346972 [22] Gpt-4o Batch Processing Jobs Response Time Increased Significantly — https://community.openai.com/t/gpt-4o-batch-processing-jobs-response-time-increased-significantly-causing-job-timeouts-failures/1110744 [23] Batch processing — Anthropic docs (message-batches) — https://docs.anthropic.com/en/docs/build-with-claude/message-batches [24] Anthropic Batch API in Production — https://ryan.dotzlaw.com/articles/obsidiannotes/02anthropicbatch/ [25] Batch API Jobs Stuck in PROCESSING for 72+ Hours — https://discuss.ai.google.dev/t/batch-api-jobs-stuck-in-processing-for-72-hours/114081 [26] Batch API is taking longer than 24h (gemini-3.1-pro) — https://discuss.ai.google.dev/t/batch-api-is-taking-longer-than-24h-gemini-3-1-pro/129209 [27] Google Gemini API Introduces Flex, Priority, and Batch Tiers — https://g.wplaybook.com/google-gemini-api-new-inference-tiers-flex-batch/ [28] Google Unveils Flexible Gemini API Pricing With New Flex and Priority Tiers — https://computing.net/news/stocks/google-unveils-flexible-gemini-api-pricing-with-new-flex-and-priority-tiers/ [29] Gemini API optimization and inference — https://ai.google.dev/gemini-api/docs/optimization [30] Alphabet (GOOGL) Stock: Google Unveils Flexible Gemini API Pricing Options — https://scoopsquare24.com/alphabet-googl-stock-google-unveils-flexible-gemini-api-pricing-options/ [31] Batch processing (100K request / 256 MB limits) — https://docs.anthropic.com/en/docs/build-with-claude/message-batches [32] A practical guide to the OpenAI Batch API — https://www.eesel.ai/blog/openai-batch-api [33] 10,000 Tasks, One Request, Half the Cost — https://www.activelogic.com/insights/10000-tasks-one-request-half-the-cost/ [34] Evaluation with OpenAI's Batch API — https://medium.com/@mmonishp147/evaluation-with-openais-batch-api-a-cost-effective-approach-8ea6a444cf4b [35] Tutorial: Batch embedding with OpenAI API — https://medium.com/@mikehpg/tutorial-batch-embedding-with-openai-api-95da95c9778a [36] Regarding the Issue of Half-Priced Prompt Caching — https://community.openai.com/t/regarding-the-issue-of-half-priced-prompt-caching/990681 [37] Gemini API Batch vs Context Caching — https://yingtu.ai/en/blog/gemini-api-batch-vs-caching [38] Vertex AI batch prediction cached-token discount — https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini [39] How to cut Claude API costs by up to 95 percent — https://amitkoth.com/reduce-claude-api-costs/ [40] AI API Cost Reduction: Prompt Caching and Routing — https://redresscompliance.com/ai-api-cost-reduction-prompt-caching-routing.html [41] An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks — https://arxiv.org/html/2601.06007v2 [42] Critical Issues with Batch API: Detailed Report — https://community.openai.com/t/critical-issues-with-batch-api-detailed-report-and-observations/1245120 [43] The GPT-4o Batch API has been extremely slow — https://community.openai.com/t/the-gpt-4o-batch-api-has-been-extremely-slow/1111011

Content was rephrased for compliance with licensing restrictions.