
Large Language Models (LLMs) such as GPT-4o, Claude, Gemini, Llama 4, DeepSeek-V4, and Qwen all share a common lineage: the Transformer architecture introduced in 2017's Attention Is All You Need. What has changed dramatically since then — and especially through 2025–2026 — is not the core skeleton but everything around it: how text is tokenized, how attention is computed, how models are scaled through sparse Mixture-of-Experts (MoE), how they are aligned with human intent, and how inference is squeezed for every last FLOP using KV caches, speculative decoding, PagedAttention, and test-time compute scaling [1][2][5].
This document walks the stack from the bottom up: tokenization → embeddings → attention → the Transformer block → training (pretraining, SFT, RLHF/DPO) → inference optimization → the 2026 frontier (MoE at scale, reasoning models, linear and hybrid attention).
flowchart LR
A[Raw Text] --> B[Tokenizer<br/>BPE / Byte-level]
B --> C[Token Embeddings + RoPE]
C --> D[Transformer Block × N<br/>Attention + MoE FFN]
D --> E[LM Head / Softmax]
E --> F[Next-Token Distribution]
F -->|autoregressive| C
Before a model sees anything, raw UTF-8 text is chopped into tokens. Nearly every frontier model in 2026 uses some variant of Byte-Pair Encoding (BPE) — iteratively merging the most frequent adjacent byte pairs until a vocabulary of ~100K–200K tokens is built. GPT-4o uses ~200K tokens; Llama 3/4 uses a 128K SentencePiece-BPE vocabulary; DeepSeek-V3/V4 uses ~129K [2][11].
Tokenization matters far more than it looks:
fastokens Rust BPE implementation delivers a 9.1× average speedup over HuggingFace tokenizers and up to 40% faster time-to-first-token (TTFT) on long-context agentic workloads — tokenization had quietly become a CPU bottleneck on H100-class inference [11].Each token ID is looked up in an embedding matrix of shape [vocab_size, d_model] (e.g., 128K × 8192 in a Llama-3.1-70B). Because self-attention is permutation-invariant, the model needs positional information. The 2017 sinusoidal encodings are long gone. In 2026 the dominant choices are:
Self-attention lets each token attend to every other token in the sequence. For query Q, key K, value V matrices of shape [seq_len, d_head]:
$$\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}} + M\right) V$$
where M is the causal mask (−∞ on future positions). Cost is O(n²) in sequence length — the central pain point of the entire field.
The 2026 lineage of attention variants:
| Variant | Models | Idea |
|---|---|---|
| Multi-Head Attention (MHA) | GPT-2, GPT-3 | h separate Q/K/V heads |
| Multi-Query Attention (MQA) | PaLM | 1 shared K/V head |
| Grouped-Query Attention (GQA) | Llama 3/4, Mistral, Qwen | g groups of shared K/V — sweet spot for quality vs KV-cache size |
| Multi-Head Latent Attention (MLA) | DeepSeek-V2/V3/V4 | Compress K/V into a low-rank latent vector; 8–16× KV memory reduction with near-MHA quality [6][7] |
| Lightning / Linear Attention | MiniMax, RWKV, Mamba-hybrids | Kernelize softmax, O(n) cost; hybrids (e.g., Jamba, MiniMax-01) interleave linear and softmax layers [9] |
DeepSeek-V3's MLA is one of the most impactful architectural tweaks of 2025: it's what lets a 671B-parameter model serve long contexts at GPT-4-class quality on a fraction of the memory [6][7].
FlashAttention (Dao et al., with FlashAttention-3 in 2024 and FlashAttention-4 refinements through 2026) doesn't change the math — it changes the I/O pattern. By tiling Q/K/V into SRAM-sized blocks and fusing the softmax, it eliminates the materialization of the full n×n attention matrix in HBM, giving 2–4× wall-clock speedups and enabling much longer contexts on the same hardware [13][14].
A modern decoder-only block looks like:
flowchart TD
X[Input Hidden State] --> N1[RMSNorm]
N1 --> Att[GQA / MLA Attention<br/>with RoPE + KV Cache]
Att --> R1[+ Residual]
X --> R1
R1 --> N2[RMSNorm]
N2 --> FFN[FFN: SwiGLU<br/>or MoE Router → Experts]
FFN --> R2[+ Residual]
R1 --> R2
R2 --> Y[Output Hidden State]
Key 2026 defaults: RMSNorm (cheaper than LayerNorm), pre-norm placement, SwiGLU activation in the FFN, and — increasingly — a sparse MoE FFN instead of a dense one.
MoE replaces the dense FFN with N parallel "expert" FFNs plus a lightweight router that picks the top-k (usually k=1 or k=2) experts per token. Only the selected experts run, so active parameters ≪ total parameters.
Concrete 2026 numbers:
The DeepSeekMoE recipe introduced two refinements that became standard:
Routing is typically token-choice top-k with an auxiliary-loss-free load-balancing trick (DeepSeek-V3) that avoids the quality hit of traditional aux losses. Training MoE at scale requires expert parallelism (experts sharded across GPUs) + all-to-all communication, which is why MoE infrastructure became a first-class topic in 2026 [6].
In 2026, MoE is practically the default choice for any serious frontier model [6].
The objective is next-token prediction — cross-entropy loss over ~15–30 trillion tokens. 2026 recipes (Llama 3.1, DeepSeek-V3, phi-4, Step Law) have formalized several principles [3][4]:
A raw pretrained model is a fluent autocomplete, not an assistant. Alignment turns it into one, typically in three stages [8][15]:
OpenAI's o1 (2024) and DeepSeek-R1 (early 2025) changed the narrative: instead of only scaling pretraining, let the model think longer at inference. R1 matched o1 quality at ~70% lower cost by running large-scale RL with verifiable rewards directly on a base model, producing long chain-of-thought traces [16]. Through 2026 a full "test-time compute" scaling law emerged — doubling inference compute via longer reasoning chains, best-of-N sampling, or tree search yields predictable quality gains on math, coding, and scientific reasoning [16]. ByteDance's P1 winning physics-olympiad gold and ThreadWeaver achieving 1.5× speedup on reasoning workflows are 2025–2026 landmarks of this paradigm [16].
Serving an LLM has two phases:
flowchart LR
P[Prompt] --> PF[Prefill<br/>Compute-bound<br/>Parallel over all tokens]
PF --> KV[(KV Cache<br/>grows with seq len)]
KV --> DEC[Decode<br/>Memory-bound<br/>One token at a time]
DEC -->|append K,V| KV
DEC --> OUT[Streamed tokens]
Autoregressive decoding would recompute attention over all past tokens every step — quadratic waste. The KV cache stores each layer's K and V tensors so only the new token's Q attends against them. This turns per-step cost from O(n²) to O(n) — but the cache itself grows linearly with sequence length × batch size and often exceeds GPU HBM, becoming the bottleneck for long-context serving [17][18][19].
KV-cache management is now a research subfield of its own. 2025–2026 techniques include [17][18][19]:
Static batching pads every request to the longest sequence and waits for all to finish — average GPU utilization on production workloads was historically ~40% [20][21]. Continuous (iteration-level) batching (Orca/vLLM) lets new requests join mid-flight and finished ones leave immediately. Chunked prefill splits long prompt prefills into small pieces that interleave with decode steps, preventing a single long prompt from stalling everyone else. Together on an H100, these yield 3–5× more traffic than a naive PyTorch loop [20][21].
Decode is memory-bound — the GPU computes one token per forward pass regardless of FLOP budget. Speculative decoding exploits the slack: a small draft model proposes k tokens cheaply, then the large target model verifies all k in parallel with one forward pass. Accepted tokens are free; rejected ones fall back to target sampling. Mathematically lossless (same output distribution), typical speedups 2–3×, stacking to 10× with KV-cache tricks [22][23].
2026 variants [22][23]:
INT8, INT4 (GPTQ, AWQ), FP8, and FP4 on Blackwell/B200 hardware are now routine. DeepSeek-V3 was natively trained in FP8. 2026 inference stacks combine FP8 weights + KV cache + FP4 activations for 4–6× throughput on the same silicon, with sub-1% quality loss on well-calibrated models [3].
[1] Starmorph, The Complete Technical Guide to Transformers, Training, and Inference (2026) — https://blog.starmorph.com/blog/how-llms-work-complete-technical-guide [2] Starmorph, From Attention Mechanism to GPT-4o, Claude, and Open-Source LLMs (2026) — https://blog.starmorph.com/blog/intro-to-transformers [3] youngju.dev, LLM Pretraining & Scaling Laws: From Chinchilla to Flash Attention and MoE (2026) — https://www.youngju.dev/blog/llm/2026-03-17-llm-pretraining-scaling-laws-guide.en [4] arXiv 2503.04715, Step Law – Optimal Hyperparameter Scaling Law in LLM Pre-training — https://arxiv.org/html/2503.04715 [5] Stanford, Next Generation LLM Architecture — http://web.stanford.edu/~jksun/blog/llm-architecture.html [6] Introl, Mixture of Experts Infrastructure: Scaling Sparse Models for Production AI — https://introl.com/uk/blog/mixture-of-experts-moe-infrastructure-scaling-sparse-models-guide [7] The NextGen Tech Insider, DeepSeek Launches V4 Mixture-of-Experts AI Model Family — https://www.thenextgentechinsider.com/pulse/deepseek-launches-v4-mixture-of-experts-ai-model-family [8] PremAI, Which LLM Alignment Method? RLHF vs DPO vs KTO Tradeoffs Explained — https://blog.premai.io/which-llm-alignment-method-rlhf-vs-dpo-vs-kto-tradeoffs-explained/ [9] GetMaxim, The Attention Arms Race: How Modern Open-Source LLMs Are Reinventing the Transformer's Core — https://www.getmaxim.ai/blog/the-attention-arms-race-how-modern-open-source-llms-are-reinventing-the-transformers-core/ [10] BuildFastWithAI, Attention Mechanism in LLMs Explained (2026) — https://www.buildfastwithai.com/blogs/attention-mechanism-llm-explained [11] Crusoe, Reducing TTFT by CPUMaxxing Tokenization — https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization [12] EmergentMind, Byte-Level Tokenization / Byte Language Models — https://www.emergentmind.com/topics/byte-level-tokenization [13] Reintech, Flash Attention & PagedAttention Guide — https://reintech.io/blog/llm-inference-optimization-flash-attention-pagedattention [14] BuildFastWithAI, What Is Mixture of Experts (MoE)? How It Works (2026) — https://www.buildfastwithai.com/blogs/mixture-of-experts-moe-explained [15] Tianpan, The Alignment Method Decision Matrix for Narrow Domain Applications (April 2026) — https://tianpan.co/blog/2026-04-16-sft-rlhf-dpo-alignment-method-decision-matrix [16] Introl, Inference-Time Scaling: Research and Reasoning Models (Dec 2025) — https://introl.com/blog/inference-time-scaling-research-reasoning-models-december-2025 [17] arXiv 2512.11920, A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving — https://arxiv.org/html/2512.11920 [18] arXiv 2503.16163, SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs — https://arxiv.org/html/2503.16163v1 [19] arXiv 2603.20397, KV Cache Optimization Strategies for Scalable and Efficient LLM Inference — https://arxiv.org/html/2603.20397v1 [20] Spheron, LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill on H100 (2026) — https://www.spheron.network/blog/llm-serving-optimization-continuous-batching-paged-attention/ [21] RunPod, PagedAttention, Continuous Batching, and Deploying High-Throughput LLM Inference in Production — https://www.runpod.io/articles/guides/vllm-pagedattention-continuous-batching [22] Substack/BoringBot, KV Caching and Speculative Decoding — https://boringbot.substack.com/p/kv-caching-and-speculative-decoding [23] arXiv 2404.11912, Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding — https://arxiv.org/html/2404.11912v3
Content rephrased and synthesized from the referenced sources for compliance with licensing restrictions.