backAgent Reasoning

Agent Reasoning: Chain-of-Thought, Tree/Graph-of-Thoughts, Reasoning Models, and Test-Time Compute

Overview

Between 2024 and 2026, the frontier of large language model (LLM) capability has shifted from "scale the pre-training run" to "scale the thinking." Reasoning — the model's ability to decompose a problem, explore alternatives, self-verify, and commit to an answer — has become the primary axis of progress. This shift manifests in three tightly linked developments:

  1. Prompt-level reasoning structures: Chain-of-Thought (CoT), Tree-of-Thoughts (ToT), and Graph-of-Thoughts (GoT), plus adaptive hybrids that unify them.
  2. Native reasoning models: OpenAI's o-series (o1 → o3 → o4-mini), Anthropic's Claude Extended/Adaptive Thinking (Claude 4, 4.5, 4.6), DeepSeek-R1, Google's Gemini 2.5/3 Thinking, and open reasoners like QwQ-32B and Seed1.5-Thinking.
  3. Test-time compute (TTC) scaling: a new scaling law where accuracy rises with the number of reasoning tokens or parallel samples spent at inference, independent of model size [1][4][10].

Analysts now project inference to claim ~75% of total AI compute by 2030, with inference demand exceeding training demand by 118× by 2026 [5]. This document surveys the technical landscape of agent reasoning in 2025–2026.


1. Chain-of-Thought (CoT) — The Linear Baseline

CoT prompting elicits intermediate reasoning steps before a final answer. A prompt like "Let's think step by step" causes the model to emit a linear trace of inferences. CoT works because it lets the model spend more forward passes per token of output, and because intermediate tokens act as a scratchpad that conditions later tokens [3][6].

graph LR
    Q[Question] --> S1[Step 1] --> S2[Step 2] --> S3[Step 3] --> A[Answer]

Limitations. A single chain cannot recover from a bad early step, and it commits to one trajectory. Self-consistency partially mitigates this by sampling k chains and majority-voting the answer, trading compute for robustness [6].

CoT is now the default scaffolding inside every reasoning model: DeepSeek-R1, o3, Claude Thinking, and Gemini all produce extended CoT traces internally before emitting a final answer [6][8].


2. Tree-of-Thoughts (ToT) — Branching and Backtracking

ToT generalizes CoT by letting the model propose several candidate next steps at each node, score them with a value function (often the LLM itself), and search the tree with BFS or DFS, backtracking when a branch looks unpromising [6][9].

graph TD
    Root[Problem] --> A[Thought A]
    Root --> B[Thought B]
    Root --> C[Thought C]
    A --> A1[A1]
    A --> A2[A2 ✗]
    B --> B1[B1 ✓]
    B --> B2[B2]
    B1 --> Ans[Answer]

ToT dominates CoT on tasks where a wrong early commitment is costly — Game of 24, Sudoku, creative writing with constraints, and multi-step planning. The trade-off is cost: expanding b branches at d depth means b^d LLM calls, making ToT expensive without aggressive pruning [2][6].


3. Graph-of-Thoughts (GoT) — Non-Linear Reasoning with Merging

GoT relaxes ToT's tree constraint: thoughts can merge (combine insights from siblings), loop (refine by revisiting), and form arbitrary DAGs. This matches natural reasoning, where a conclusion may draw on multiple lines of argument [2][6].

graph TD
    P[Problem] --> T1[Sub-idea 1]
    P --> T2[Sub-idea 2]
    P --> T3[Sub-idea 3]
    T1 --> M[Merge & Refine]
    T2 --> M
    T3 --> M
    M --> V[Verify]
    V --> T1
    V --> Final[Answer]

The 2024 survey Demystifying Chains, Trees, and Graphs of Thoughts (Besta et al., updated 2025) taxonomizes the design space along topology, schedule, and representation axes, and provides performance analyses showing GoT wins on problems with combinable subproblems (sorting, set operations, document merging) [7].

Adaptive Graph-of-Thoughts (AGoT, 2025) recursively decomposes complex queries into a dynamic DAG, choosing the structure per query rather than fixing it in advance [8]. Charts-of-Thought (2025) applies DAG-style merging to multimodal chart/VQA tasks [9]. Frameworks in 2025–2026 (e.g., the "Foundation Framework for Dynamic and Optimized Reasoning") learn when to use chains, trees, or graphs based on problem features [9].

Comparison

StructureTopologyBacktrackMergeBest forCost
CoTlinearwell-defined single-path problemslow
Self-consistency CoTk parallel chainsvotenoisy answers, mathk × CoT
ToTtreeplanning, search, puzzlesb^d
GoTDAGdecomposable/combinable taskstunable
AGoTdynamic DAGheterogeneous workloadsadaptive

4. Reasoning Models: From Prompting to Native Thinking

In late 2024 OpenAI released o1, the first widely available model trained (via reinforcement learning over long CoT traces) to reason natively before answering. It was followed by o3 and o4-mini in 2025 — o3/o4-mini can "think with images," embedding visual operations into their chain of thought [6]. These models have a distinct two-phase output: a hidden reasoning trace (billed as reasoning tokens) and a user-visible final answer.

DeepSeek-R1 (Jan 2025) proved the paradigm was reproducible at open weights. R1-Zero used pure RL without supervised fine-tuning and still developed emergent long CoT, reflection, and "aha moments" [10]. It matched o1 at roughly 70% lower cost [5], catalyzing an explosion of open reasoners: QwQ-32B, R1-Distill variants, Seed1.5-Thinking (86.7 on AIME 2024, 77.3 on GPQA) [8].

Anthropic's Claude Thinking evolved rapidly through 2025–2026:

  • Claude 4 Extended Thinking (Apr 2025): explicit token budget (e.g., thinking-16k) controlling reasoning depth [1].
  • Claude 4.5 Sonnet / Haiku (Sept/Oct 2025): Haiku became the first "small" model with extended thinking; Sonnet 4.5 sustains autonomous work for 30+ hours [8][9].
  • Claude 4.6 Adaptive Thinking (late 2025 / early 2026): replaces fixed-budget extended thinking. The model itself decides whether to think and how long, controlled only by an effort parameter (low | medium | high | max | xhigh) [2][7][10]. Extended thinking is deprecated on Opus-class models in favor of adaptive mode [10].

Google Gemini 2.5/3 Thinking uses similar hybrid reasoning; arcprize.org's 2025 evaluation across frontier labs found no single winner — o3, DeepSeek-R1, and Gemini 3 each lead on different reasoning benchmarks [5].

Faithfulness and behavior

A notable finding: DeepSeek-R1 mentions an injected cue in its reasoning 59% of the time vs. 7% for the non-reasoning base model — reasoning models are measurably more faithful about what actually drove their answer [10]. This matters for agent debugging and alignment auditing.


5. Test-Time Compute (TTC): The New Scaling Law

TTC is the deliberate allocation of additional compute at inference to improve answer quality, holding weights fixed [3][4]. Empirically, accuracy on hard math, code, and science benchmarks scales log-linearly with reasoning tokens — up to a plateau [4][8].

graph LR
    Prompt --> Gen[Generate N tokens / k samples]
    Gen --> Ver[Verifier / PRM]
    Ver --> Sel[Select best]
    Sel --> Out[Answer]
    Ver -.refine.-> Gen

TTC strategies

StrategyMechanismExample
Long CoTone very long reasoning traceo1, R1
Best-of-Nsample N answers, rank by verifierMath, code gen
Self-consistencymajority-vote N chainsCoT+SC
Tree searchToT/MCTS with value modelAlphaCode-style
Self-reflectiongenerate → critique → reviseReflexion, SSR
Process reward models (PRM)score each step, pruneOpenAI, DeepMind
Parallel reasoning (ThreadWeaver)parallelize CoT branches1.5× latency cut [5]

Scaling-law highlights (2025–2026)

  • Spheron's 2026 analysis: "The model doesn't change. The weights stay fixed. What changes is how many tokens the model generates while working through the problem" — accuracy on AIME and GPQA climbs steadily with token budget [4].
  • The test-time scaling plateau paper shows returns diminish after a task-dependent ceiling; beyond that, additional tokens produce "overthinking" and even regressions [8].
  • P1 (Dec 2025) became the first open-source model to win physics olympiad gold by combining RL training with test-time agent loops [5].
  • Compute-optimal TTC papers show pairing a smaller model with a strong verifier can beat a larger model at equal compute — a foundational result for cheap reasoning [3][5].

Overthinking and its costs

Reasoning models frequently waste tokens on easy prompts ("Is 2+2=4?" → 3000 thinking tokens). This motivated Claude's adaptive mode, effort sliders on o3/o4-mini, and research on dynamic early-stopping verifiers [2][6]. "Overthinking in LLM TTC" (2025) formalizes the phenomenon and proposes length penalties in RL fine-tuning [2].


6. Agent Reasoning: Putting It All Together

A modern agent combines a reasoning model with tools, memory, and verification:

graph TD
    U[User query] --> Plan[Plan: decompose / select structure]
    Plan --> Reason[Reason: CoT/ToT/GoT]
    Reason --> Tool[Tool call: search, code, retrieve]
    Tool --> Reason
    Reason --> Verify[Verifier / Self-audit]
    Verify -- fail --> Reflect[Reflect & retry]
    Reflect --> Reason
    Verify -- pass --> Answer

2026 research directions

  • Verify-before-commit: internal self-auditing so agents detect coherent-but-wrong reasoning before executing side-effects [4].
  • Dual-process speculation (DualSpec, Mar 2026): treats web-search as System-2, page-visit as System-1, and overlaps them to cut deep-research agent latency [5].
  • Meta-RL with self-reflection (MR-Search): agents learn how to search at test time, improving exploration in-context [3].
  • Socratic Self-Refine (SSR, late 2025): decomposes answers into (sub-question, sub-answer) pairs, re-solves each to estimate per-step confidence, and refines unreliable steps [6].
  • Evidence-centric CoT feedback: persisting reasoning across similar queries so agents stop "reasoning from scratch every time," reducing variance [2].

Practical guidance

  • Default to CoT/self-consistency for bounded tasks; escalate to ToT/GoT only when a single path is insufficient.
  • Prefer native reasoning models (o3, Claude 4.6 adaptive, R1) over hand-crafted ToT prompts — RL-trained reasoning is usually cheaper and stronger.
  • Budget reasoning: use adaptive/effort controls; cap with length penalties to prevent overthinking.
  • Always pair generators with verifiers (LLM-as-judge, unit tests, executable checks) — best-of-N without a good verifier ≈ no gain.
  • Treat the reasoning trace as debugging data, not user-facing text.

Key Takeaways

  1. CoT → ToT → GoT → adaptive DAGs: reasoning scaffolds are converging on dynamic, per-query structures rather than fixed templates [7][8].
  2. Native reasoning models win: o1/o3/o4-mini, Claude Thinking, DeepSeek-R1, and Gemini 3 Thinking have made test-time reasoning a built-in model capability, not a prompt trick [1][5][10].
  3. TTC is the dominant 2025–2026 scaling lever: accuracy scales with inference tokens up to a plateau; inference compute will dwarf training compute by 2030 [3][5].
  4. Adaptive effort beats fixed budgets: Claude 4.6's adaptive thinking and o-series effort levels address the overthinking tax that hurt early reasoning models [10][2].
  5. Verification is the rate-limiter: the next leap comes from better process reward models, self-auditing, and tool-grounded verifiers, not longer chains [4][6].
  6. Open weights keep pace: DeepSeek-R1, QwQ-32B, Seed1.5-Thinking, and P1 show the reasoning paradigm is no longer a closed-lab advantage [5][8].

References

[1] OpenAI — OpenAI o-series / Thinking with images, https://aiwiki.ai/wiki/openai_o-series and https://openai.com/index/thinking-with-images/ [2] Overthinking in LLM Test-Time Compute Scaling / Graph-of-Thought: Non-Linear Reasoning with Merge and Refine, arXiv & subodhjena.com, 2025. https://arxiv.org/html/2604.10739v1 · https://subodhjena.com/blog/graph-of-thought-nonlinear-reasoning [3] Inference-Time Scaling Law (emergentmind) and Test-Time Compute: Sampling, Refinement, and Optimal Inference (Brenndoerfer, 2025). https://api.emergentmind.com/topics/inference-time-scaling-law · https://mbrenndoerfer.com/writing/test-time-compute-scaling-sampling-refinement-optimal-inference [4] Spheron Network — Inference-Time Compute Scaling on GPU Cloud (2026). https://www.spheron.network/blog/inference-time-compute-scaling-gpu-cloud/ [5] Introl — Inference-Time Scaling Research: Reasoning Models, December 2025; ARC Prize — We tested every major AI reasoning system. https://introl.com/blog/inference-time-scaling-research-reasoning-models-december-2025 · https://arcprize.org/blog/which-ai-reasoning-model-is-best [6] Maarten Grootendorst — A Visual Guide to Reasoning LLMs (2025); Socratic Self-Refine for LLM Reasoning, arXiv:2511.10621. https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms · https://arxiv.org/html/2511.10621 [7] Besta et al. — Demystifying Chains, Trees, and Graphs of Thoughts, arXiv:2401.14295 (updated 2025). https://arxiv.org/abs/2401.14295 [8] Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures (AGoT), arXiv:2502.05078; Exploring Test-Time Scaling Plateau, arXiv:2505.20522; Seed1.5-Thinking, arXiv:2504.13914. https://arxiv.org/html/2502.05078v1 · https://arxiv.org/html/2505.20522v2 · https://arxiv.org/html/2504.13914v3 [9] AllThings.how — Claude Adaptive Thinking Explained (2026); CometAPI — Thinking mode in Claude 4.5; Charts-of-Thought (emergentmind). https://www.allthings.how/claude-adaptive-thinking-explained-how-it-works-and-when-to-use-it/ · https://www.cometapi.com/thinking-mode-in-claude-4-5-all-you-need-to-know [10] DeepSeek — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL, arXiv:2501.12948; Are DeepSeek R1 and other reasoning models more faithful?, arXiv:2501.08156; Apiyi — Claude Adaptive Thinking replaces Extended Thinking. https://arxiv.org/abs/2501.12948 · https://arxiv.org/html/2501.08156v4 · https://help.apiyi.com/en/claude-adaptive-thinking-mode-api-guide-replace-extended-thinking-en.html

Content was rephrased for compliance with licensing restrictions.