
Model routing and cascading have emerged as the single most tractable lever for controlling LLM inference costs in production. The thesis is simple: a two-order-of-magnitude price gap separates frontier models from efficient ones — Claude Opus 4.6 at $5/$25 per million input/output tokens versus Claude Haiku 4.5 at $1/$5, GPT-5.2 Pro at $21/$168 versus GPT-5 nano at $0.05/$0.40, and DeepSeek V3.2 at $0.28/$0.42 [1][2]. Yet for 60–70% of production traffic — classification, extraction, short answers, and routine summarization — the cheap tier produces output that is functionally indistinguishable from the expensive one [3]. Routing simple queries to cheap models and escalating only the hard ones captures that gap. Reported production savings cluster in the 30–70% range for mixed workloads, with benchmark-level numbers reaching 85% (RouteLLM on MT Bench) and 98% (FrugalGPT on classification) under favorable conditions [4][5][6].
This document covers the academic foundations (FrugalGPT, RouteLLM), the production gateway ecosystem (Portkey, LiteLLM, OpenRouter, Martian, vLLM Semantic Router), concrete cost math with 2025–2026 pricing, and the failure modes that separate teams that capture the savings from teams that introduce silent quality regressions.
As of late 2025/early 2026, the pricing landscape has settled into a clear three-tier shape, with a roughly 100× spread between the cheapest and most expensive models [1][2][7]:
| Tier | Representative Models | Input $/M | Output $/M |
|---|---|---|---|
| Premium | Claude Opus 4.6, GPT-5.2 Pro, o3-pro | $5–$21 | $25–$168 |
| Mid | Claude Sonnet 4.6, GPT-5.4, Gemini 3 Pro | $2–$3 | $12–$15 |
| Budget | Claude Haiku 4.5, GPT-5 mini, Gemini 2.0 Flash | $0.25–$1 | $1.25–$5 |
| Ultra-budget | DeepSeek V3.2, Gemini 2.0 Flash-Lite, Mistral Nemo | $0.02–$0.28 | $0.04–$0.42 |
DeepSeek V3.2 notably halved its prices in late 2025 and offers a 90% cache-hit discount that can bring effective input cost to $0.028/M [1]. The practical implication: if even 30% of queries can be served by a $0.28/M model instead of a $15/M model, the blended cost drops by roughly an order of magnitude on that slice — and for typical chatbot, extraction, and classification workloads the shiftable fraction is far higher than 30%.
Chen, Zaharia, and Zou's FrugalGPT (Stanford, 2023; updated TMLR 2024) formalized the LLM cascade pattern and set the ceiling numbers the rest of the field benchmarks against [5][6]. The cascade strategy sends a query to a list of LLM APIs in ascending price order; each response is scored by a lightweight scorer model, and the pipeline stops as soon as the scorer's confidence exceeds a per-task threshold. Only the residual hard queries reach the most expensive model.
Headline FrugalGPT results on HEADLINES, OVERRULING, and COQA benchmarks [5][6]:
The surprising non-obvious finding: cascades can beat the frontier model, because a cheap-first cascade exposes diversity. Different models fail on different queries; when a strong scorer selects the cheap model's answer when it is correct and escalates only when it is wrong, the cascade's accuracy ceiling exceeds any single component.
FrugalGPT's limitation is task specificity: the scorer has to be trained (or carefully prompted) per task, so the gains attenuate on open-ended generation where "correctness" is hard to score. This is exactly the gap RouteLLM targets.
RouteLLM (UC Berkeley / LMSYS, ICLR 2025) reframes cascading as a pre-call classification problem: train a BERT-scale router (~110M parameters) on Chatbot Arena preference data to predict, before any LLM call, whether a cheap model will suffice [4][8][9]. It is the canonical open-source reference implementation — a drop-in replacement for OpenAI's client that routes between a strong and weak model based on a tunable cost threshold.
Four router architectures are provided: similarity-weighted ranking, matrix factorization, BERT classifier, and causal LLM classifier. Benchmark numbers against GPT-4 (strong) and Mixtral 8x7B (weak) [4][8][9]:
The paper estimates average GPT-4 cost at $24.7/M tokens and Mixtral 8x7B at $0.24/M tokens — a ~100× ratio that makes even imperfect routing lucrative [8]. The router itself adds only 10–30ms latency and <0.4% extra cost [3][8].
At ICLR 2025, the matrix factorization router demonstrated 95% of GPT-4 quality while routing only 26% of queries to the expensive model; with data augmentation that dropped to 14%, a 75% cost reduction [10].
Academic benchmarks over-represent easy cases. Production deployments reveal a tighter but still substantial band of savings [3][11][12][13]:
Four categories of infrastructure have matured around this pattern in 2025–2026.
Across the gateway ecosystem, the 30–70% savings band comes from a stack of compounding techniques rather than a single trick [18][19]:
Take a hypothetical chatbot serving 10M requests/month at ~1K tokens per request (mix of input and output, weighted 3:1 input-heavy).
Baseline: all traffic on Claude Sonnet 4.6 ($3/$15 per MTok) [2][7]:
Two-tier router: 70% Haiku 4.5 ($1/$5), 30% Sonnet 4.6:
Three-tier router: 50% DeepSeek V3.2 ($0.28/$0.42), 40% Haiku 4.5, 10% Sonnet 4.6:
The cheap-first cascade variant where every query is tried on DeepSeek first, with a scorer escalating ~30% of responses, is even more favorable in theory — but only works when the scorer is reliable enough to avoid double-billing (paying for cheap + expensive on the same query). In practice, production teams prefer pre-call routing because the cost of a mis-scored cascade (paying both tiers) can exceed the savings.
The failure mode to avoid is what a 2026 Tianpan post calls "optimizing cost without tracking quality" — the dashboard shows "95% quality maintained" while contract analysis accuracy drops from 94% to 79% because the aggregate metric averages over query types where quality matters differently [10][3]. The fix is quality-aware routing with per-task evaluation, not just a single quality threshold.
The cheap-first / expensive-fallback pattern is now the default architecture for serious production LLM deployments. The open-source tooling (RouteLLM, LiteLLM, vLLM Semantic Router) is mature, the academic foundations (FrugalGPT's 50–98% cost savings with accuracy gains, RouteLLM's 85% on MT Bench) are solid, and production case studies consistently land in the 30–70% savings band for mixed workloads [3][4][5][11][12]. With 100× price spreads between frontier and efficient models still persisting in 2026 — and DeepSeek, Gemini Flash-Lite, and Mistral Nemo pushing the floor even lower — the arbitrage opportunity is not closing. What separates the teams that capture it from the teams that introduce quality regressions is not the routing logic itself; it's the monitoring infrastructure, quality-aware thresholds, and per-task evaluation that surround it.
[1] IntuitionLabs, "LLM API Pricing Comparison (2025): OpenAI, Gemini, Claude," Oct 31, 2025. https://intuitionlabs.ai/articles/llm-api-pricing-comparison-2025
[2] AI Cost Check, "AI API Pricing Guide 2026: Cheapest Models, Best Defaults, and Long-Context Picks," Feb 10, 2026. https://aicostcheck.com/blog/ai-api-pricing-guide-2026
[3] NeuralRouting.io, "What Is an LLM Router? The Engineering Guide," Apr 10, 2026. https://neuralrouting.io/blog/what-is-an-llm-router
[4] LMSYS Org, "RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing," Jul 1, 2024. https://lmsys.org/blog/2024-07-01-routellm/
[5] Chen, Zaharia, Zou, "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance," arXiv:2305.05176 / TMLR 2024. http://arxiv.org/pdf/2305.05176
[6] Stanford FutureData, "FrugalGPT GitHub repository," 2023–2024. https://github.com/stanford-futuredata/FrugalGPT
[7] APIScout, "LLM API Pricing 2026: GPT-5 vs Claude vs Gemini," Mar 16, 2026. https://apiscout.dev/blog/llm-api-pricing-comparison-2026
[8] Ong et al., "RouteLLM: Learning to Route LLMs with Preference Data," arXiv:2406.18665 (ICLR 2025). https://arxiv.org/pdf/2406.18665
[9] OpenReview, "ROUTENLP: Closed-Loop Cost-Aware Routing" (anonymous submission). https://openreview.net/pdf?id=H9KBJXHoA8
[10] Tianpan, "Quality-Aware Model Routing: Why Optimizing for Cost Alone Fails," Apr 14, 2026. https://tianpan.co/blog/2026-04-14-quality-aware-model-routing
[11] Particula Tech, "LLM Model Routing: Cheap First, Expensive Only When Needed," Feb 26, 2026. https://particula.tech/blog/llm-model-routing-cheap-first-reduce-api-costs
[12] TokenMix, "How to Use Multiple AI Models: Routing, Failover, and the Unified API Approach (2026)," Apr 13, 2026. https://tokenmix.ai/blog/how-to-use-multiple-ai-models
[13] Markaicode, "The LLM Router Pattern: Dynamically Switching Models by Task Complexity and Cost," Mar 2, 2026. https://markaicode.com/llm-router-pattern-model-switching/
[14] Portkey, "Build AI agents with Portkey's MCP client — delivery platform case study." https://portkey.ai/case-studies/leading-delivery-platform
[15] Martian, "Routing for AI Agents — financial services case study." https://www.withmartian.com/solutions/routing-for-ai-agents
[16] Wang et al., "When to Reason: Semantic Router for vLLM," OpenReview/HuggingFace, Sept 2025. https://huggingface.co/papers/2510.08731
[17] Red Hat Developer, "vLLM Semantic Router: Improving efficiency in AI reasoning," Sept 11, 2025. https://developers.redhat.com/articles/2025/09/11/vllm-semantic-router-improving-efficiency-ai-reasoning
[18] Maxim AI, "Top 5 LLM Gateways in 2025," Dec 4, 2025. https://www.getmaxim.ai/articles/top-5-llm-gateways-in-2025-the-definitive-guide-for-production-ai-applications/
[19] Portkey, "LLMs in Prod 2025: Insights from 2 Trillion+ Tokens," Jan 21, 2025. https://portkey.ai/blog/report
Content was rephrased for compliance with licensing restrictions.