
A numbers-heavy decision guide covering LoRA/QLoRA training costs, hosted fine-tune pricing, per-token economics, and concrete break-even analysis.
Prompt engineering has zero upfront cost but pays a per-token tax on every request (more context = more dollars). Fine-tuning inverts this: you pay a one-time training bill, then enjoy shorter prompts and often a smaller base model for the same quality. The economic question is simply: at what request volume does the training cost amortize below the prompt-engineering cost?
The structural formula most practitioners use [1]:
Break-even N = T / (P_base - P_tuned)
where T is one-time training cost, P_base is per-call inference on a frontier model with your current (long) prompt, and P_tuned is per-call inference on the fine-tuned model with a shortened prompt. For a 3,000-token prompt shrunk to 500 tokens on a popular model, break-even typically lands between 100,000 and 1 million requests [1].
LoRA (Low-Rank Adaptation) trains tiny adapter layers (rank 8–64) instead of all model weights, giving a ~10× memory reduction. QLoRA adds 4-bit quantization of the frozen base, stacking another ~2× reduction [2][3].
VRAM requirements (8B class model) [2][3][4]:
| Method | Llama 3 8B | Llama 3 70B | Llama 3 405B |
|---|---|---|---|
| Full fine-tune (FP16) | ~60 GB | ~500 GB (multi-GPU) | 420–500 GB+ |
| LoRA (FP16) | ~18 GB | ~160 GB | ~420 GB |
| QLoRA (4-bit) | ~6–8 GB | ~40 GB | ~200–240 GB |
The practical consequence: a single RTX 4090 (24 GB) now fine-tunes 8B models, and a single A100 80 GB handles 70B QLoRA — tasks that required datacenter clusters two years ago [3][4].
QLoRA reaches roughly 90–92 % of full fine-tune quality at rank 16 and within 0.5 % of full fine-tuning at rank 64 on domain benchmarks [2][5]. That small quality gap is the price you pay for a 5–10× cost reduction.
GPU rental rates (Apr 2026) across major providers [3][4][6]:
| GPU | Spot/low-end | Mainstream | Premium |
|---|---|---|---|
| RTX 4090 (24 GB) | $0.44/hr (vast.ai) | $0.50–1.00/hr | — |
| A100 40 GB | $0.78/hr (Thunder) | $1.29/hr (Lambda) | $1.48–2.50/hr (AWS/Vultr) |
| A100 80 GB | $0.44/hr (vast.ai) | $1.19/hr (RunPod) | $2.40/hr (AWS) |
| H100 80 GB | $1.99/hr (RunPod) | $3.00/hr | $4.00/hr+ |
End-to-end training cost per run (10k-15k examples, 3 epochs, QLoRA) [3][4][6]:
| Model | GPU | Wall time | All-in cost |
|---|---|---|---|
| Llama 3 8B QLoRA | RTX 4090 | 3–5 hr | $2–5 |
| Llama 3 8B full FT | A100 80 GB | 4–8 hr | $6–9 |
| Llama 3 70B QLoRA | A100 80 GB | 12–20 hr | $15–60 |
| Llama 3 70B LoRA | A100 80 GB | 8–24 hr | $25–50 |
| Llama 3 70B full FT | 4× H100 | ~10 hr | $200–400 |
| Llama 3 405B LoRA | 8× H100 | ~12 hr | $500–1,000 |
A widely-cited hands-on report [6]: Llama 3.1 8B QLoRA on 15k examples at batch size 4, 3 epochs = 4.5 hr at $0.52/hr on a marketplace A100 80 GB = $2.34 in GPU + ~$0.80 for eval/setup ≈ $3.12 all-in.
Practitioners consistently budget 3–5 training iterations before a production-worthy model emerges; failed experiments are typically 30–50 % of total training spend [7]. A "$5 fine-tune" is really a $15–25 project once you include dataset debugging, hyperparameter sweeps, and eval runs.
Unsloth's custom Triton kernels deliver 2× training speedup and ~60 % memory savings vs. vanilla HuggingFace PEFT, effectively halving GPU-hour bills again [5]. This moves the Llama 3 8B QLoRA job from ~4.5 hr to ~2.5 hr on the same hardware.
When you don't want to manage GPUs, the major platforms charge per training token and often a premium on inference.
| Model | Training ($/1M tokens) | Inference input ($/1M) | Inference output ($/1M) |
|---|---|---|---|
| GPT-4o (fine-tuned) | $25.00 | $3.75 | $15.00 |
| GPT-4o mini (fine-tuned) | $3.00 | $0.30 | $1.20 |
| GPT-4 (legacy FT) | $90.00 | $45.00 | $90.00 |
Fine-tuned GPT-4o inference is 1.5× the base GPT-4o price ($2.50/$10.00 per 1M for the base model) [9]. Prompt caching on fine-tuned GPT-4o cuts cached input to $1.875/1M — a 50 % discount on repeated context [8].
OpenAI also offers Reinforcement Fine-Tuning (RFT) on o4-mini at $100/hr of wall-clock training time, plus separate token charges for model graders [10].
Charged per million training tokens (= dataset tokens × epochs). Inference runs at the same price as the base model — a crucial difference from OpenAI.
| Model | Supervised FT ($/1M training tokens) |
|---|---|
| Gemini 2.5 Pro | $25.00 |
| Gemini 2.5 Flash | $5.00 |
| Gemini 2.5 Flash Lite | $1.50 |
| Gemma 3 27B IT | $6.83 |
| Gemma 3 4B IT | $1.14 |
| Llama 3.3 70B | $6.72 |
| Llama 3.1 8B | $0.67 |
| Qwen 3 32B | $6.57 |
Token-based, split by model size and method:
| Model size | LoRA ($/1M tokens) | Full FT ($/1M tokens) |
|---|---|---|
| ≤16B | $0.48 | $0.54 |
| 17B–69B | $1.50 | $1.65 |
| 70B–100B | $2.90 | $3.20 |
Together AI's Serverless Multi-LoRA lets you run custom adapters at base-model token prices — you don't pay for an idle dedicated endpoint, a major structural advantage over OpenAI's pricing [13].
Training a domain Q&A model on 50 M tokens × 3 epochs = 150 M tokens processed:
Using uatgpt.com's framework on GPT-4o-mini with 10k requests/day (≈300k/month) [15]:
| Task | Few-shot prompt tokens | Fine-tuned prompt tokens | Monthly savings | Breakeven |
|---|---|---|---|---|
| Simple classification | 800 in → 50 out | 200 in → 50 out | $54/mo | 1–2 months |
| Entity extraction | 1,200 → 200 | 400 → 200 | $72/mo | ~1 month |
| Content generation | 2,000 → 1,000 | 500 → 1,000 | $135/mo | ~1 month |
| Complex analysis | 3,000 → 2,000 | 1,000 → 2,000 | $180/mo | ~1 month |
At 10,000 requests/day, fine-tuning GPT-4o-mini pays for itself in 1–2 months. At 100 requests/day, breakeven stretches to 6–18 months — prompting wins [15].
A second validated case [16]:
GPT-4 + RAG at 50k queries/month ≈ $1,500/mo.
Fine-tuned GPT-3.5 at 50k queries/month ≈ $250/mo + $12,000 training.
Break-even: ~9 months at stable volume.
Empirically, the crossover between prompting a frontier model and a fine-tuned smaller model typically occurs at 500,000 – 1,000,000 requests/month for well-defined tasks. Below that volume, training + maintenance overhead rarely pays off.
Based on aggregated 2025-2026 practitioner guidance [1][7][15][16][17]:
A notable counter-example: Google's MedPrompt framework beat fine-tuned Med-PaLM 2 by up to 12 absolute percentage points on medical benchmarks using sophisticated prompting on a general model [7]. Prompt engineering has more headroom than most teams explore.
Consider a production classifier processing 300k requests/day (9M/month), current GPT-4o prompt is 1,500 tokens in → 100 tokens out with 5 few-shot examples.
Prompting baseline (GPT-4o at $2.50/$10.00 per 1M) [9]:
Fine-tuned GPT-4o-mini (prompt shrinks to 300 tokens, model downshifts) [8]:
Self-hosted fine-tuned Llama 3 8B on a cloud endpoint (~$0.20/$0.60 per 1M commodity serving):
At this volume, the cost question is decided; the remaining trade-offs are whether your team can operate the self-hosted stack and whether quality parity is real after evals.
Dataset labeling — 10k high-quality labeled examples at $0.50–$2.00 each is $5k–$20k of human effort, typically the largest line-item. Synthetic data generation via frontier models can cut this to $200–$1,000 but introduces quality risk [7].
Evaluation infrastructure — offline eval harness, golden sets, drift monitors. Budget 1–2 engineer-weeks initially plus ongoing maintenance.
Model drift on base updates — when OpenAI updates GPT-4o or Google ships Gemini 2.6, your fine-tuned variant may underperform the new base. Plan to re-train every 6–12 months.
Dedicated endpoint costs — On Together AI, hosting a fine-tuned model on a dedicated endpoint is charged per minute even when idle [13]. Serverless Multi-LoRA sidesteps this, but not every model is supported.
Lock-in — Fine-tuned weights are model-specific. Switching from GPT-4o to Gemini means re-training. Prompts are largely portable [15].
The 2026 economic reality: QLoRA at $3–$60 per training run and $3–$25 per million training tokens on hosted platforms has driven the fine-tuning decision down-market by roughly 10× compared with 2023. The break-even volume that once sat at 5M requests/month now sits near 300k–500k for most narrow tasks. Prompt engineering still wins on flexibility and iteration speed; fine-tuning wins when your task is boring enough to be baked into weights.
[1] FastTool — Fine-Tuning vs Prompting: When Each One Wins (2026) — https://fasttool.app/blog/cluster-llm-fine-tuning-vs-prompting-guide
[2] Antonio Brundo — Fine-Tuning LLMs: LoRA vs QLoRA Production Guide — https://antoniobrundo.org/knowledge/fine-tuning-lora-guide.html
[3] DeployBase — Fine-Tune LLM with LoRA: GPU Requirements & Costs (Apr 2025) — https://deploybase.ai/articles/fine-tune-llm-with-lora-gpu-requirements-costs
[4] BestGPUCloud — Best GPU for LLaMA 3 Fine-Tuning in 2026 — https://www.bestgpucloud.com/en/blog/best-gpu-llama-3-fine-tuning-2026
[5] YoungJu.dev — Complete Guide to LLM Fine-tuning with Unsloth 2025 — https://www.youngju.dev/blog/culture/2026-03-25-unsloth-llm-finetuning-qlora-optimization-guide-2025.en
[6] Alexa V. (Medium) — How to Fine-Tune LLMs for Under $20 (Feb 2026) — https://medium.com/@velinxs/how-to-fine-tune-llms-for-under-20-step-by-step-c187a3059ca2
[7] Tian Pan — Fine-Tuning Economics: The Real Cost Calculation Before You Commit (Apr 2026) — https://tianpan.co/blog/2026-04-09-fine-tuning-economics-lora-peft-vs-prompt-engineering
[8] LangCopilot — OpenAI GPT-4o (fine-tuned) Pricing — https://langcopilot.com/llm-pricing/openai/gpt-4o-finetuned
[9] StackCompare — OpenAI API Pricing Guide 2026 (Apr 2026) — https://stackcompare.net/openai-api-pricing-guide-2026-every-model-cost-and-tip-explained/
[10] OpenAI Help Center — Billing guide for the Reinforcement Fine Tuning API — https://help.openai.com/en/articles/11323177-billing-guide-for-the-reinforcement-fine-tuning-api
[11] Google Cloud — Vertex AI Generative AI Pricing — https://cloud.google.com/vertex-ai/generative-ai/pricing
[12] eesel AI — A complete guide to Together AI pricing in 2025 — https://www.eesel.ai/blog/together-ai-pricing
[13] Together AI — Announcing Serverless Multi-LoRA — https://www.together.ai/blog/serverless-multi-lora-fine-tune-and-deploy-hundreds-of-adapters-for-model-customization-at-scale
[14] Google — Gemini Developer API pricing — https://ai.google.dev/gemini-api/docs/pricing
[15] uatgpt.com — Fine-Tuning vs Prompt Engineering — The Decision Framework with Cost Breakpoints (Apr 2026) — https://uatgpt.com/ai-development-workflows/fine-tuning-vs-prompting/
[16] Pullflow — Fine-Tuning vs Prompt Engineering: Which One Actually Saves You Money? (Jul 2025) — https://pullflow.com/blog/finetuning-vs-prompt-engineering/
[17] Viqus — Fine-Tuning vs. Prompting in 2026: When Each Actually Makes Sense (Jan 2026) — https://viqus.ai/blog/fine-tuning-vs-prompting-2026-guide
Content synthesized and rephrased for licensing compliance.