
Between 2024 and 2026, the frontier of large language model (LLM) capability has shifted from "scale the pre-training run" to "scale the thinking." Reasoning — the model's ability to decompose a problem, explore alternatives, self-verify, and commit to an answer — has become the primary axis of progress. This shift manifests in three tightly linked developments:
Analysts now project inference to claim ~75% of total AI compute by 2030, with inference demand exceeding training demand by 118× by 2026 [5]. This document surveys the technical landscape of agent reasoning in 2025–2026.
CoT prompting elicits intermediate reasoning steps before a final answer. A prompt like "Let's think step by step" causes the model to emit a linear trace of inferences. CoT works because it lets the model spend more forward passes per token of output, and because intermediate tokens act as a scratchpad that conditions later tokens [3][6].
graph LR
Q[Question] --> S1[Step 1] --> S2[Step 2] --> S3[Step 3] --> A[Answer]
Limitations. A single chain cannot recover from a bad early step, and it commits to one trajectory. Self-consistency partially mitigates this by sampling k chains and majority-voting the answer, trading compute for robustness [6].
CoT is now the default scaffolding inside every reasoning model: DeepSeek-R1, o3, Claude Thinking, and Gemini all produce extended CoT traces internally before emitting a final answer [6][8].
ToT generalizes CoT by letting the model propose several candidate next steps at each node, score them with a value function (often the LLM itself), and search the tree with BFS or DFS, backtracking when a branch looks unpromising [6][9].
graph TD
Root[Problem] --> A[Thought A]
Root --> B[Thought B]
Root --> C[Thought C]
A --> A1[A1]
A --> A2[A2 ✗]
B --> B1[B1 ✓]
B --> B2[B2]
B1 --> Ans[Answer]
ToT dominates CoT on tasks where a wrong early commitment is costly — Game of 24, Sudoku, creative writing with constraints, and multi-step planning. The trade-off is cost: expanding b branches at d depth means b^d LLM calls, making ToT expensive without aggressive pruning [2][6].
GoT relaxes ToT's tree constraint: thoughts can merge (combine insights from siblings), loop (refine by revisiting), and form arbitrary DAGs. This matches natural reasoning, where a conclusion may draw on multiple lines of argument [2][6].
graph TD
P[Problem] --> T1[Sub-idea 1]
P --> T2[Sub-idea 2]
P --> T3[Sub-idea 3]
T1 --> M[Merge & Refine]
T2 --> M
T3 --> M
M --> V[Verify]
V --> T1
V --> Final[Answer]
The 2024 survey Demystifying Chains, Trees, and Graphs of Thoughts (Besta et al., updated 2025) taxonomizes the design space along topology, schedule, and representation axes, and provides performance analyses showing GoT wins on problems with combinable subproblems (sorting, set operations, document merging) [7].
Adaptive Graph-of-Thoughts (AGoT, 2025) recursively decomposes complex queries into a dynamic DAG, choosing the structure per query rather than fixing it in advance [8]. Charts-of-Thought (2025) applies DAG-style merging to multimodal chart/VQA tasks [9]. Frameworks in 2025–2026 (e.g., the "Foundation Framework for Dynamic and Optimized Reasoning") learn when to use chains, trees, or graphs based on problem features [9].
| Structure | Topology | Backtrack | Merge | Best for | Cost |
|---|---|---|---|---|---|
| CoT | linear | ✗ | ✗ | well-defined single-path problems | low |
| Self-consistency CoT | k parallel chains | ✗ | vote | noisy answers, math | k × CoT |
| ToT | tree | ✓ | ✗ | planning, search, puzzles | b^d |
| GoT | DAG | ✓ | ✓ | decomposable/combinable tasks | tunable |
| AGoT | dynamic DAG | ✓ | ✓ | heterogeneous workloads | adaptive |
In late 2024 OpenAI released o1, the first widely available model trained (via reinforcement learning over long CoT traces) to reason natively before answering. It was followed by o3 and o4-mini in 2025 — o3/o4-mini can "think with images," embedding visual operations into their chain of thought [6]. These models have a distinct two-phase output: a hidden reasoning trace (billed as reasoning tokens) and a user-visible final answer.
DeepSeek-R1 (Jan 2025) proved the paradigm was reproducible at open weights. R1-Zero used pure RL without supervised fine-tuning and still developed emergent long CoT, reflection, and "aha moments" [10]. It matched o1 at roughly 70% lower cost [5], catalyzing an explosion of open reasoners: QwQ-32B, R1-Distill variants, Seed1.5-Thinking (86.7 on AIME 2024, 77.3 on GPQA) [8].
Anthropic's Claude Thinking evolved rapidly through 2025–2026:
thinking-16k) controlling reasoning depth [1].effort parameter (low | medium | high | max | xhigh) [2][7][10]. Extended thinking is deprecated on Opus-class models in favor of adaptive mode [10].Google Gemini 2.5/3 Thinking uses similar hybrid reasoning; arcprize.org's 2025 evaluation across frontier labs found no single winner — o3, DeepSeek-R1, and Gemini 3 each lead on different reasoning benchmarks [5].
A notable finding: DeepSeek-R1 mentions an injected cue in its reasoning 59% of the time vs. 7% for the non-reasoning base model — reasoning models are measurably more faithful about what actually drove their answer [10]. This matters for agent debugging and alignment auditing.
TTC is the deliberate allocation of additional compute at inference to improve answer quality, holding weights fixed [3][4]. Empirically, accuracy on hard math, code, and science benchmarks scales log-linearly with reasoning tokens — up to a plateau [4][8].
graph LR
Prompt --> Gen[Generate N tokens / k samples]
Gen --> Ver[Verifier / PRM]
Ver --> Sel[Select best]
Sel --> Out[Answer]
Ver -.refine.-> Gen
| Strategy | Mechanism | Example |
|---|---|---|
| Long CoT | one very long reasoning trace | o1, R1 |
| Best-of-N | sample N answers, rank by verifier | Math, code gen |
| Self-consistency | majority-vote N chains | CoT+SC |
| Tree search | ToT/MCTS with value model | AlphaCode-style |
| Self-reflection | generate → critique → revise | Reflexion, SSR |
| Process reward models (PRM) | score each step, prune | OpenAI, DeepMind |
| Parallel reasoning (ThreadWeaver) | parallelize CoT branches | 1.5× latency cut [5] |
Reasoning models frequently waste tokens on easy prompts ("Is 2+2=4?" → 3000 thinking tokens). This motivated Claude's adaptive mode, effort sliders on o3/o4-mini, and research on dynamic early-stopping verifiers [2][6]. "Overthinking in LLM TTC" (2025) formalizes the phenomenon and proposes length penalties in RL fine-tuning [2].
A modern agent combines a reasoning model with tools, memory, and verification:
graph TD
U[User query] --> Plan[Plan: decompose / select structure]
Plan --> Reason[Reason: CoT/ToT/GoT]
Reason --> Tool[Tool call: search, code, retrieve]
Tool --> Reason
Reason --> Verify[Verifier / Self-audit]
Verify -- fail --> Reflect[Reflect & retry]
Reflect --> Reason
Verify -- pass --> Answer
[1] OpenAI — OpenAI o-series / Thinking with images, https://aiwiki.ai/wiki/openai_o-series and https://openai.com/index/thinking-with-images/ [2] Overthinking in LLM Test-Time Compute Scaling / Graph-of-Thought: Non-Linear Reasoning with Merge and Refine, arXiv & subodhjena.com, 2025. https://arxiv.org/html/2604.10739v1 · https://subodhjena.com/blog/graph-of-thought-nonlinear-reasoning [3] Inference-Time Scaling Law (emergentmind) and Test-Time Compute: Sampling, Refinement, and Optimal Inference (Brenndoerfer, 2025). https://api.emergentmind.com/topics/inference-time-scaling-law · https://mbrenndoerfer.com/writing/test-time-compute-scaling-sampling-refinement-optimal-inference [4] Spheron Network — Inference-Time Compute Scaling on GPU Cloud (2026). https://www.spheron.network/blog/inference-time-compute-scaling-gpu-cloud/ [5] Introl — Inference-Time Scaling Research: Reasoning Models, December 2025; ARC Prize — We tested every major AI reasoning system. https://introl.com/blog/inference-time-scaling-research-reasoning-models-december-2025 · https://arcprize.org/blog/which-ai-reasoning-model-is-best [6] Maarten Grootendorst — A Visual Guide to Reasoning LLMs (2025); Socratic Self-Refine for LLM Reasoning, arXiv:2511.10621. https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms · https://arxiv.org/html/2511.10621 [7] Besta et al. — Demystifying Chains, Trees, and Graphs of Thoughts, arXiv:2401.14295 (updated 2025). https://arxiv.org/abs/2401.14295 [8] Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures (AGoT), arXiv:2502.05078; Exploring Test-Time Scaling Plateau, arXiv:2505.20522; Seed1.5-Thinking, arXiv:2504.13914. https://arxiv.org/html/2502.05078v1 · https://arxiv.org/html/2505.20522v2 · https://arxiv.org/html/2504.13914v3 [9] AllThings.how — Claude Adaptive Thinking Explained (2026); CometAPI — Thinking mode in Claude 4.5; Charts-of-Thought (emergentmind). https://www.allthings.how/claude-adaptive-thinking-explained-how-it-works-and-when-to-use-it/ · https://www.cometapi.com/thinking-mode-in-claude-4-5-all-you-need-to-know [10] DeepSeek — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL, arXiv:2501.12948; Are DeepSeek R1 and other reasoning models more faithful?, arXiv:2501.08156; Apiyi — Claude Adaptive Thinking replaces Extended Thinking. https://arxiv.org/abs/2501.12948 · https://arxiv.org/html/2501.08156v4 · https://help.apiyi.com/en/claude-adaptive-thinking-mode-api-guide-replace-extended-thinking-en.html
Content was rephrased for compliance with licensing restrictions.