
Tokens are the fundamental unit of measurement in large language models. Every character of text you send, every word the model generates, every image you attach — all of it is metered in tokens. Understanding how tokens are produced, counted, and billed is essential for anyone building on LLM APIs in 2026.
LLMs do not process raw text. Before any inference begins, a tokenizer splits input into smaller units called tokens and maps each to a unique integer ID. A token roughly corresponds to 3–4 characters of English text, or about 0.75 words. Code, non-Latin scripts, and emoji consume significantly more tokens per character [1][2].
BPE is the dominant tokenization algorithm in modern LLMs. The algorithm works in four steps [1][3]:
t + h → th).The result: common English words like "hello" become a single token, while rare or invented words decompose into subword pieces (e.g., "tokenization" → "token" + "ization"). This subword approach eliminates the "unknown token" problem of older word-level tokenizers while keeping sequences compact [1][3].
OpenAI's tiktoken is the standard tokenizer library for GPT-family models. It is implemented in Rust with a thin Python wrapper for speed. Key encodings include cl100k_base (GPT-4, GPT-3.5-Turbo) and o200k_base (GPT-4o). tiktoken operates on UTF-8 bytes after a regex pre-split that prevents merges from crossing word boundaries — this is why "don't" splits as ["don", "'t"] rather than merging the apostrophe with adjacent letters [1][4].
SentencePiece, used by models like LLaMA, T5, and Gemma, takes a different approach: it operates directly on raw Unicode text without assuming whitespace-delimited words. This makes it far more suitable for multilingual and noisy corpora (Japanese, Chinese, Thai, etc.) where spaces do not separate words. SentencePiece supports both BPE and Unigram algorithms. The Unigram variant starts with a large vocabulary and iteratively removes tokens that contribute least to overall likelihood — a probabilistic approach compared to BPE's greedy frequency-based merging [2][3][5].
The practical difference: tiktoken encodes UTF-8 bytes then merges; SentencePiece merges at the code-point level with a byte fallback for rare characters [3].
Every API call has two billable components [6][7][8]:
Output tokens are universally more expensive because generation is sequential and autoregressive (each token depends on all previous tokens), while input can be processed in parallel. Across all major providers in 2026, output tokens cost 3–5× more than input tokens [7][8]:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128K |
| GPT-4o mini | ~$0.15 | ~$0.60 | 128K |
| Claude 3.5/4 Sonnet | $3.00 | $15.00 | 200K |
| Gemini 2.0 Flash | ~$0.15 | ~$0.60 | 1M |
| GPT-4.6 (2026) | $5.00 | $25.00 | 128K |
Prices as of mid-2026; confirm on provider pricing pages. [6][9][10]
This asymmetry means that verbose output is the most expensive part of any API call. A system generating long responses burns money on the costliest token class [8].
Not all tokens in a request are visible in the user's message:
All of these accumulate silently. A "simple" chatbot request that looks like 50 tokens of user text may actually consume 5,000+ tokens once system prompts, history, and tool definitions are included.
Prompt caching is one of the most impactful cost optimizations available in 2026. When the beginning of a prompt matches a recently seen prefix, providers can reuse the cached computation and charge a reduced rate [11][12]:
The key insight: structure your prompts so that the static prefix (system prompt, tool definitions, reference documents) comes first and the variable part (user query) comes last. This maximizes cache hit rates. For agentic workloads that make many calls with the same system prompt, caching alone can cut costs by 50–90% [12].
Images and audio are converted to token equivalents before processing:
Multimodal inputs are particularly expensive because they inflate the input token count dramatically. A single high-res image attached to a short text prompt can double or triple the total input tokens.
Every model has a hard context window — the maximum total tokens (input + output) it can handle in a single request [1][3]:
The context window is shared between input and output. If you send 120K tokens of input to GPT-4o, you have only ~8K tokens left for the model's response. Engineers who do not count tokens discover this limit in production during incidents [1].
Practical accounting: A typical 250-word passage tokenizes to roughly 240–280 tokens after special tokens are added. A 50-page PDF might consume 30,000–50,000 tokens. A full codebase dump can easily exceed 100K tokens. The 2026 pricing spread ranges from $0.15/1M tokens (Gemini Flash input) to $75/1M tokens (frontier model output) — a 500× range that makes token accounting a first-class engineering concern [7][10].
[1] Selva Prabhakaran, "How LLM Tokenization Works: Build a BPE Tokenizer" — https://machinelearningplus.com/gen-ai/build-bpe-tokenizer/ [2] DigitalOcean, "LLM Tokenizers Simplified: BPE, SentencePiece, and More" — https://www.digitalocean.com/community/conceptual-articles/llm-tokenizers-bpe-sentencepiece-custom-vs-pretrained [3] fast.ai / Andrej Karpathy, "Let's Build the GPT Tokenizer" — https://www.fast.ai/posts/2025-10-16-karpathy-tokenizers [4] machinelearningplus, "tiktoken vs HuggingFace Tokenizers: Benchmark Guide" — https://machinelearningplus.com/gen-ai/tiktoken-vs-huggingface-tokenizers/ [5] MyEngineeringPath, "Tokenization Guide — BPE, SentencePiece & Token Counting (2026)" — https://myengineeringpath.dev/genai-engineer/tokenization/ [6] mem0.ai, "Claude, Gemini & OpenAI Compared" — https://mem0.ai/blog/llm-api-cost-breakdown-claude-gemini-openai-compared [7] wring.co, "LLM Inference Cost Optimization" — https://wring.co/blog/llm-inference-cost-optimization [8] ATXP, "How LLM Token Pricing Works: A Developer's Guide (2026)" — https://atxp.ai/blog/how-llm-token-pricing-works [9] LangCopilot, "Official GPT-4o Pricing (2026)" — https://langcopilot.com/llm-pricing/openai/gpt-4o [10] Vibe Coder Blog, "AI API Costs: OpenAI, Anthropic, Google" — https://blog.vibecoder.me/ai-api-costs-openai-anthropic-google-budget [11] OpenAI, "Prompt Caching in the API" — https://openai.com/index/api-prompt-caching/ [12] AI Superior, "LLM Cost Optimization Strategies 2026" — https://aisuperior.com/llm-cost-optimization-strategies-2026/