How Agents Navigate and Plan

Overview

Planning and reasoning are the capabilities that separate a chatbot from a genuine AI agent. They give the system the ability to decompose ambiguous goals into concrete steps, evaluate trade-offs between approaches, and recover gracefully when reality diverges from the plan [1]. In 2025–2026, the field has converged on a small family of reasoning patterns — Chain-of-Thought, ReAct, Plan-and-Execute, Tree-of-Thoughts, and Reflexion — while a new generation of reasoning models (OpenAI o3, Claude with extended thinking, Gemini 2.5 Deep Think, DeepSeek R1) is beginning to internalize many of these patterns directly into model weights [2][3].

This document surveys the core planning and reasoning architectures that power modern AI agents, traces their evolution through 2026, and examines how reasoning models are reshaping the landscape.

1. Chain-of-Thought: The Foundation

Chain-of-Thought (CoT) prompting, introduced by Wei et al. (2022) and extended by Kojima et al. (2022) with zero-shot CoT ("Let's think step by step"), is the foundation all other reasoning patterns build on [4]. CoT elicits step-by-step reasoning from a language model, giving agents the ability to reason within a single step before committing to an action.

How it works: The model generates intermediate reasoning steps — breaking a problem into sub-problems, working through each, and synthesizing a final answer — all within a single forward pass. No external tools are involved.

Strengths: Simple, cheap, and effective for problems where the model's parametric knowledge is sufficient. Zero-shot CoT requires no examples.

Limitations: CoT keeps reasoning entirely inside the model. It relies on the model's training data and cannot ground its reasoning in external observations. This makes it prone to hallucination on factual questions and unable to interact with the world.

┌─────────────┐
│   Question   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Step 1...  │
│  Step 2...  │  ← Internal reasoning only
│  Step 3...  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Answer    │
└─────────────┘

CoT is the inner monologue of every agent pattern that follows. ReAct extends it with actions; Tree-of-Thoughts branches it; Reflexion critiques it.

2. ReAct: The Dominant Paradigm

ReAct (Reason + Act), introduced by Yao et al. at ICLR 2023, is the most important single-agent pattern and the default architecture for production agents in 2026 [5][6]. It interleaves chain-of-thought reasoning with tool-use actions in a single generation loop.

The Core Loop

graph TD
    A[Task / Question] --> B[Thought]
    B --> C[Action: tool call]
    C --> D[Observation: tool result]
    D --> B
    D --> E{Done?}
    E -->|No| B
    E -->|Yes| F[Final Answer]

Each iteration follows three steps:

Thought — The agent reasons about the current state: what it knows, what it needs, and what to do next.
Action — The agent calls an external tool (search, calculator, database, API) with specific inputs.
Observation — The tool's output is fed back as context for the next thought.

This cycle repeats until the agent produces a final answer or hits a maximum iteration limit [7].

Why ReAct Works

The key insight is that generating reasoning traces alongside actions dramatically reduces hallucination — the model grounds each reasoning step in observed evidence rather than relying solely on parametric memory [8]. On benchmarks like HotPotQA, FEVER, and WebShop, ReAct outperformed standalone CoT by anchoring every inference to real data [4].

ReAct in 2026: Evolved Form

Most production agents now use native function calling (GPT-4o, Claude 3.x/4.x, Gemini 2.x) which is functionally equivalent to ReAct but more reliable than text-parsed Thought/Action traces [5]. The ReAct mental model — reason, act, observe, repeat — remains the dominant paradigm, but the implementation has shifted from string parsing to structured tool-call APIs.

Production safeguards are essential [6]:

Max iterations (typically 10 steps) to prevent infinite loops
Context-length budgets to avoid overflow on long trajectories
Cost controls — a 6-step ReAct loop costs roughly $0.15 per run at scale [9]
Human escalation — if the agent cannot solve a problem in N steps, escalate rather than loop

LangGraph (v1.0, GA October 2025) is the production standard for ReAct agents, implementing the pattern as a stateful graph where nodes handle tool calls and edges route based on the model's next action [5].

3. Plan-and-Execute: Seeing the Big Picture

Plan-and-Execute addresses ReAct's fundamental limitation: the agent never sees the big picture [10]. Instead of deciding one step at a time, a powerful model analyzes the full task and generates a plan — a DAG of subtasks with dependencies. A simpler, cheaper model then executes each step. If a step fails, a replanner revises the remaining steps.

graph TD
    A[Complex Task] --> B[Planner: strong model]
    B --> C[Step 1]
    B --> D[Step 2]
    B --> E[Step 3]
    C --> F[Executor: cheap model]
    D --> F
    E --> F
    F --> G{All steps pass?}
    G -->|No| H[Replanner]
    H --> B
    G -->|Yes| I[Final Result]

Key Advantages Over ReAct

Dimension	ReAct	Plan-and-Execute
Planning horizon	One step at a time	Full task upfront
LLM calls	One per step (expensive)	Fewer total calls
Inspectability	Emergent trajectory	Explicit plan before execution
Best for	Exploratory tasks	Sequential dependencies, pipelines

Plan-and-Execute is better suited for tasks with clear sequential dependencies — data pipelines, deployment workflows, multi-step form processing — where the structure of the work is known in advance [6]. The plan is inspectable before execution starts, enabling human review.

4. ReWOO: Maximum Token Efficiency

ReWOO (Reasoning Without Observation), introduced by Xu et al. (2023), is the most token-efficient reasoning pattern for multi-step tasks [4][10]. It separates planning from execution entirely:

Plan — Generate all steps at once, using placeholders for tool outputs (e.g., #E1, #E2)
Execute — Run all tools in parallel, filling in placeholders
Synthesize — One final LLM call combines all results into an answer

Only 2 LLM calls total. On HotPotQA, ReWOO achieved 42.4% accuracy using ~2,000 tokens vs. ReAct's 40.8% at ~10,000 tokens — a 5× token efficiency gain [4].

The catch: ReWOO breaks if a tool returns something unexpected that would have changed the plan. It assumes the plan is correct upfront, with no opportunity for mid-course correction. This makes it ideal for well-understood, predictable workflows but fragile for exploratory tasks.

5. Tree-of-Thoughts: Branching Exploration

Tree-of-Thoughts (ToT), introduced by Yao et al. (2023), generalizes CoT by allowing the model to explore multiple reasoning paths simultaneously, evaluate intermediate states, and backtrack to more promising branches [8]. Where CoT commits to a single linear trace, ToT maintains a tree of partial solutions.

graph TD
    A[Problem] --> B1[Branch 1]
    A --> B2[Branch 2]
    A --> B3[Branch 3]
    B1 --> C1[Evaluate: 0.3]
    B2 --> C2[Evaluate: 0.8]
    B3 --> C3[Evaluate: 0.5]
    C2 --> D1[Expand Branch 2a]
    C2 --> D2[Expand Branch 2b]
    C3 --> D3[Expand Branch 3a]
    D1 --> E[Best Path → Execute with ReAct]

Two-Phase Architecture

Phase 1 — Tree Search: Generate multiple candidate approaches, evaluate each with a scoring function, prune unpromising branches, expand promising ones using breadth-first or depth-first search.
Phase 2 — Execution: Run a ReAct-style loop guided by the best path, calling real tools [7].

When to Use ToT

ToT outperforms CoT on problems requiring hypothesis exploration: GPT-4 + ToT solved 74% of Game of 24 tasks vs. 4% with standard CoT [4]. It excels at:

Mathematical proof construction
Code generation where multiple approaches must be evaluated
Complex multi-constraint optimization
Game-playing agents

The Cost Problem

A typical ToT run with branching factor 3 and depth 4 requires 3⁴ = 81 LLM calls at minimum, plus evaluation calls [8]. This makes ToT economically viable only for high-value, low-frequency tasks where accuracy outweighs speed and cost.

6. Reflexion and Self-Correction

Reflexion, introduced by Shinn et al. (NeurIPS 2023), adds something the other patterns lack: the ability to learn from failure within a single session [10][11].

The Reflexion Loop

graph TD
    A[Task] --> B[Attempt]
    B --> C[Evaluator: pass/fail]
    C -->|Pass| D[Return Result]
    C -->|Fail| E[Self-Reflection]
    E --> F[Episodic Memory]
    F --> B

Attempt — The agent produces a result (code, answer, plan).
Evaluate — An evaluator scores it (run unit tests, validate against a schema, check expected output).
Reflect — If the score is below threshold, a self-reflection module generates a natural-language analysis of what went wrong: "My SQL query didn't account for NULL values" or "My code failed because it didn't handle empty lists."
Retry — The reflection is stored in episodic memory and fed back on the next attempt.

This mimics reinforcement learning — the agent improves at a specific task over multiple tries without any weight updates [11]. Reflexion achieved 91% pass@1 on HumanEval coding benchmarks, surpassing GPT-4's prior state-of-the-art of 80% [4].

Agent-R: Scaling Self-Correction with MCTS

Agent-R (2025) advances self-correction by using Monte Carlo Tree Search to construct training samples that recover correct trajectories from erroneous ones [12]. Rather than waiting until the end of a rollout to revise errors, Agent-R identifies the first error step within a failed trajectory and splices it with an adjacent correct path from the search tree. This enables timely, mid-trajectory correction and has shown +5.59% improvement over baseline methods across interactive environments [12].

A related approach, SWE-Search, extends MCTS with a hybrid value function that combines numerical evaluation with qualitative natural-language assessment, enabling software engineering agents to iteratively refine their debugging strategies [13].

7. Hybrid Patterns: Combining Strengths

In practice, the most effective agents combine multiple patterns [10]:

ReAct + Reflexion (Most Common Hybrid)

Run a ReAct loop for step-by-step adaptation. If the final result fails validation, enter a Reflexion retry cycle with episodic memory. This gives you per-step grounding for the common case and self-correction for the hard ones.

Plan-Execute-Reflect

Plan — Generate a numbered list of steps
Execute — Work through each step sequentially, using tools
Reflect — Evaluate execution against the original plan
Refine — If reflection identifies gaps, generate a revised plan and re-execute [7]

ReAcTree: Hierarchical Task Decomposition

ReAcTree (2025) combines ReAct with hierarchical planning by decomposing a complex goal into manageable subgoals within a dynamically constructed agent tree [14]. Each node in the tree is a sub-agent running its own ReAct loop, with control flow managing dependencies between subtasks. This addresses ReAct's weakness on long-horizon tasks (100+ steps) where context grows linearly and the agent loses coherence.

Pattern Selection Guide

Pattern	LLM Calls	Best For	Weakness
CoT	1	Simple reasoning, no tools needed	No grounding, hallucination risk
ReAct	1 per step	Exploratory tasks, general-purpose	Expensive on long chains, no big picture
Plan-and-Execute	2+ per plan	Sequential workflows, pipelines	Rigid initial plan
ReWOO	2 total	Predictable multi-step tasks	Breaks on unexpected tool outputs
Tree-of-Thoughts	10–100× CoT	Hard puzzles, multi-constraint optimization	Very expensive
Reflexion	3+ per retry	Tasks with clear pass/fail criteria	Slow, needs evaluator

8. 2026 Reasoning Models: Internalizing the Loop

The most significant shift in 2025–2026 is the emergence of reasoning models — LLMs trained via reinforcement learning to "think" before responding, spending additional compute at inference time to explore solution strategies, verify answers, and self-correct [2][3][15].

The Major Players

Model	Approach	Key Innovation
OpenAI o3 / o4-mini	Private chain-of-thought; RL-trained on verifiable rewards	Highest accuracy on math/science; reasoning effort parameter
Claude 4.x (Extended Thinking)	Visible thinking tokens in separate block; adaptive effort levels	Interleaved thinking between tool calls; transparent reasoning
Gemini 2.5 Pro (Deep Think)	Parallel hypothesis generation and evaluation	Multimodal reasoning; 1M+ token context
DeepSeek R1	Open-weight; visible reasoning chain	Cost-effective; best with explicit reasoning prompts

How Reasoning Models Change Agent Architecture

Reasoning models fundamentally alter the planning landscape [5][16]:

Fewer external loop iterations — o3-class models often solve what used to be a 6-step ReAct trace in 1–2 tool calls plus a long internal chain of thought.
Better tool selection — Models RL-trained to prefer correct tools; hand-written Thought: prompts no longer help and sometimes hurt.
Internal backtracking and self-correction — "Reflexion"-style retry wrappers are often redundant against reasoning models that self-correct within their thinking phase.
Adaptive reasoning allocation — Claude 4.6's effort levels and o3's reasoning effort parameter let developers control how hard the model thinks, trading cost for accuracy [16].

Interleaved Thinking for Agents

Claude 4.6 introduced interleaved thinking for agentic workflows: when using tools, the model can think between tool calls, not just before the first response [16]. This is critical for multi-step tasks where each tool result changes what the model should do next — effectively implementing the ReAct pattern inside the model's native reasoning.

The Long-Context Challenge

Gemini 2.5 Pro supports 1M+ token context, but Google's own research reveals a critical finding: as agent context grows significantly beyond 100k tokens, the model tends to favor repeating actions from its history rather than synthesizing novel plans [17]. This highlights an important distinction between long-context for retrieval and long-context for multi-step generative planning — and remains an active research frontier.

9. Observability and Production Considerations

Observability is non-negotiable for agent systems [11]. Unlike a standard API call where you can inspect input and output, an agent's failure can be buried six steps deep in a chain of actions that individually looked reasonable.

Minimum viable observability:

Log every Thought, Action, and Observation with timestamps and token counts
Track cost per trajectory
Set maximum execution time and iteration limits
Implement human escalation for unresolvable tasks

Tools: LangSmith, Weights & Biases Weave, and similar platforms provide structured tracing for agent trajectories [11].

Cost awareness: The pattern you choose has dramatic cost implications. A ReWOO run uses ~2,000 tokens; the same task via ReAct uses ~10,000; via Tree-of-Thoughts, potentially 100,000+ [4][10]. Matching the pattern to the task's value and complexity is a core architectural decision.

Key Takeaways

ReAct remains the default — The Thought → Action → Observation loop is the dominant agent paradigm in 2026, now implemented via native function calling rather than text parsing [5][6].
Planning patterns are complementary, not competing — Use ReAct for exploration, Plan-and-Execute for structured workflows, ReWOO for predictable pipelines, ToT for hard optimization, and Reflexion for tasks with clear success criteria [10].
Reasoning models are absorbing the scaffolding — o3, Claude extended thinking, and Gemini Deep Think internalize multi-step reasoning, self-correction, and backtracking that previously required external orchestration [2][5][16].
Self-correction is the frontier — From Reflexion's episodic memory to Agent-R's MCTS-based trajectory repair, teaching agents to identify and recover from errors mid-execution is the most active research area [12][13].
Cost and latency drive pattern selection — The "best" pattern depends on the task's value. A $0.001 ReWOO call and a $5.00 ToT exploration serve fundamentally different use cases [4][8].
Long-horizon planning remains unsolved — Even with 1M+ token contexts, agents struggle to plan coherently over hundreds of steps. Hierarchical approaches like ReAcTree and explicit state-machine graphs (LangGraph) are the current best answers [14][17].
Observability is non-negotiable — Every production agent needs structured logging of its full reasoning trajectory, cost tracking, and human escalation paths [11].

References

[1] Grizzly Peak Software, "Planning and Reasoning in AI Agents," 2026. https://grizzlypeaksoftware.com/library/planning-and-reasoning-in-ai-agents-a140vad2

[2] Zylos Research, "AI Reasoning Models 2026: From OpenAI o3 to DeepSeek-R1 and the Test-Time Compute Revolution," January 2026. https://zylos.ai/research/2026-01-24-ai-reasoning-models

[3] AI Magicx, "AI Reasoning Models Explained: When to Use o3, Gemini 2.5, and DeepSeek R1 (2026 Guide)," March 2026. https://www.aimagicx.com/blog/ai-reasoning-models-o3-gemini-deepseek-guide-2026

[4] Cowork.ink, "AI Agent Reasoning: ReAct, CoT & Planning Patterns (2026)." https://cowork.ink/blog/ai-agent-reasoning/

[5] Cowork.ink, "The ReAct Pattern Explained: AI Agent Reasoning in 2026," March 2026. https://cowork.ink/blog/react-pattern-ai-agents/

[6] L. Fryer, "The Complete Guide to AI Agent Architectures: ReAct, CoT, and Tool Use," April 2026. https://dev.to/lukefryer4/the-complete-guide-to-ai-agent-architectures-react-cot-and-tool-use-4ab7

[7] Reactive Agents Documentation, "Reasoning." https://docs.reactiveagents.dev/guides/reasoning/

[8] M. S. Hossain, "Agentic AI Design Patterns: ReAct, Chain of Thought & Self-Reflection in Production (2026)," March 2026. https://mdsanwarhossain.me/blog-agentic-ai-design-patterns.html

[9] "The 7 Agentic AI Design Patterns Every Developer Should Know," April 2026. https://dev.to/emperorakashi20/the-7-agentic-ai-design-patterns-every-developer-should-know-react-reflection-tool-use-and-more-3bba

[10] P. Perrone, "ReAct vs Plan-and-Execute vs ReWOO vs Reflexion," The AI Engineer, April 2026. https://theaiengineer.substack.com/p/the-4-single-agent-patterns

[11] Endless.sbs, "How AI Agents Work: Memory, Tools & Planning Explained," February 2026. https://endless.sbs/How%20AI%20Agents%20Actually%20Work:%20Memory,%20Tools,%20Planning%20&%20Real-World%20Systems%20%282026%29

[12] Agent-R: "Training Language Model Agents to Reflect via Iterative Self-Training," arXiv, 2025. https://arxiv.org/html/2501.11425v2

[13] "Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement," arXiv, 2024. https://arxiv.org/html/2410.20285v1

[14] "ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning," arXiv, 2025. https://arxiv.org/abs/2511.02424

[15] AI Log, "AI Reasoning Models 2026: o3 vs Claude vs Gemini vs R1," February 2026. https://ailog.page/ai-reasoning-models-explained-o3-vs-claude-vs-gemini-vs-deepseek-r1/

[16] SurePrompts, "Prompt Engineering for Reasoning Models: How to Get the Most From o3, Claude Thinking, and Gemini Deep Think (2026)," April 2026. https://sureprompts.com/blog/prompting-reasoning-models-guide

[17] Gemini Team, Google, "Gemini 2.5 Technical Report," October 2025. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf