
Planning and reasoning are the capabilities that separate a chatbot from a genuine AI agent. They give the system the ability to decompose ambiguous goals into concrete steps, evaluate trade-offs between approaches, and recover gracefully when reality diverges from the plan [1]. In 2025–2026, the field has converged on a small family of reasoning patterns — Chain-of-Thought, ReAct, Plan-and-Execute, Tree-of-Thoughts, and Reflexion — while a new generation of reasoning models (OpenAI o3, Claude with extended thinking, Gemini 2.5 Deep Think, DeepSeek R1) is beginning to internalize many of these patterns directly into model weights [2][3].
This document surveys the core planning and reasoning architectures that power modern AI agents, traces their evolution through 2026, and examines how reasoning models are reshaping the landscape.
Chain-of-Thought (CoT) prompting, introduced by Wei et al. (2022) and extended by Kojima et al. (2022) with zero-shot CoT ("Let's think step by step"), is the foundation all other reasoning patterns build on [4]. CoT elicits step-by-step reasoning from a language model, giving agents the ability to reason within a single step before committing to an action.
How it works: The model generates intermediate reasoning steps — breaking a problem into sub-problems, working through each, and synthesizing a final answer — all within a single forward pass. No external tools are involved.
Strengths: Simple, cheap, and effective for problems where the model's parametric knowledge is sufficient. Zero-shot CoT requires no examples.
Limitations: CoT keeps reasoning entirely inside the model. It relies on the model's training data and cannot ground its reasoning in external observations. This makes it prone to hallucination on factual questions and unable to interact with the world.
┌─────────────┐
│ Question │
└──────┬──────┘
│
▼
┌─────────────┐
│ Step 1... │
│ Step 2... │ ← Internal reasoning only
│ Step 3... │
└──────┬──────┘
│
▼
┌─────────────┐
│ Answer │
└─────────────┘
CoT is the inner monologue of every agent pattern that follows. ReAct extends it with actions; Tree-of-Thoughts branches it; Reflexion critiques it.
ReAct (Reason + Act), introduced by Yao et al. at ICLR 2023, is the most important single-agent pattern and the default architecture for production agents in 2026 [5][6]. It interleaves chain-of-thought reasoning with tool-use actions in a single generation loop.
graph TD
A[Task / Question] --> B[Thought]
B --> C[Action: tool call]
C --> D[Observation: tool result]
D --> B
D --> E{Done?}
E -->|No| B
E -->|Yes| F[Final Answer]
Each iteration follows three steps:
This cycle repeats until the agent produces a final answer or hits a maximum iteration limit [7].
The key insight is that generating reasoning traces alongside actions dramatically reduces hallucination — the model grounds each reasoning step in observed evidence rather than relying solely on parametric memory [8]. On benchmarks like HotPotQA, FEVER, and WebShop, ReAct outperformed standalone CoT by anchoring every inference to real data [4].
Most production agents now use native function calling (GPT-4o, Claude 3.x/4.x, Gemini 2.x) which is functionally equivalent to ReAct but more reliable than text-parsed Thought/Action traces [5]. The ReAct mental model — reason, act, observe, repeat — remains the dominant paradigm, but the implementation has shifted from string parsing to structured tool-call APIs.
Production safeguards are essential [6]:
LangGraph (v1.0, GA October 2025) is the production standard for ReAct agents, implementing the pattern as a stateful graph where nodes handle tool calls and edges route based on the model's next action [5].
Plan-and-Execute addresses ReAct's fundamental limitation: the agent never sees the big picture [10]. Instead of deciding one step at a time, a powerful model analyzes the full task and generates a plan — a DAG of subtasks with dependencies. A simpler, cheaper model then executes each step. If a step fails, a replanner revises the remaining steps.
graph TD
A[Complex Task] --> B[Planner: strong model]
B --> C[Step 1]
B --> D[Step 2]
B --> E[Step 3]
C --> F[Executor: cheap model]
D --> F
E --> F
F --> G{All steps pass?}
G -->|No| H[Replanner]
H --> B
G -->|Yes| I[Final Result]
| Dimension | ReAct | Plan-and-Execute |
|---|---|---|
| Planning horizon | One step at a time | Full task upfront |
| LLM calls | One per step (expensive) | Fewer total calls |
| Inspectability | Emergent trajectory | Explicit plan before execution |
| Best for | Exploratory tasks | Sequential dependencies, pipelines |
Plan-and-Execute is better suited for tasks with clear sequential dependencies — data pipelines, deployment workflows, multi-step form processing — where the structure of the work is known in advance [6]. The plan is inspectable before execution starts, enabling human review.
ReWOO (Reasoning Without Observation), introduced by Xu et al. (2023), is the most token-efficient reasoning pattern for multi-step tasks [4][10]. It separates planning from execution entirely:
#E1, #E2)Only 2 LLM calls total. On HotPotQA, ReWOO achieved 42.4% accuracy using ~2,000 tokens vs. ReAct's 40.8% at ~10,000 tokens — a 5× token efficiency gain [4].
The catch: ReWOO breaks if a tool returns something unexpected that would have changed the plan. It assumes the plan is correct upfront, with no opportunity for mid-course correction. This makes it ideal for well-understood, predictable workflows but fragile for exploratory tasks.
Tree-of-Thoughts (ToT), introduced by Yao et al. (2023), generalizes CoT by allowing the model to explore multiple reasoning paths simultaneously, evaluate intermediate states, and backtrack to more promising branches [8]. Where CoT commits to a single linear trace, ToT maintains a tree of partial solutions.
graph TD
A[Problem] --> B1[Branch 1]
A --> B2[Branch 2]
A --> B3[Branch 3]
B1 --> C1[Evaluate: 0.3]
B2 --> C2[Evaluate: 0.8]
B3 --> C3[Evaluate: 0.5]
C2 --> D1[Expand Branch 2a]
C2 --> D2[Expand Branch 2b]
C3 --> D3[Expand Branch 3a]
D1 --> E[Best Path → Execute with ReAct]
ToT outperforms CoT on problems requiring hypothesis exploration: GPT-4 + ToT solved 74% of Game of 24 tasks vs. 4% with standard CoT [4]. It excels at:
A typical ToT run with branching factor 3 and depth 4 requires 3⁴ = 81 LLM calls at minimum, plus evaluation calls [8]. This makes ToT economically viable only for high-value, low-frequency tasks where accuracy outweighs speed and cost.
Reflexion, introduced by Shinn et al. (NeurIPS 2023), adds something the other patterns lack: the ability to learn from failure within a single session [10][11].
graph TD
A[Task] --> B[Attempt]
B --> C[Evaluator: pass/fail]
C -->|Pass| D[Return Result]
C -->|Fail| E[Self-Reflection]
E --> F[Episodic Memory]
F --> B
This mimics reinforcement learning — the agent improves at a specific task over multiple tries without any weight updates [11]. Reflexion achieved 91% pass@1 on HumanEval coding benchmarks, surpassing GPT-4's prior state-of-the-art of 80% [4].
Agent-R (2025) advances self-correction by using Monte Carlo Tree Search to construct training samples that recover correct trajectories from erroneous ones [12]. Rather than waiting until the end of a rollout to revise errors, Agent-R identifies the first error step within a failed trajectory and splices it with an adjacent correct path from the search tree. This enables timely, mid-trajectory correction and has shown +5.59% improvement over baseline methods across interactive environments [12].
A related approach, SWE-Search, extends MCTS with a hybrid value function that combines numerical evaluation with qualitative natural-language assessment, enabling software engineering agents to iteratively refine their debugging strategies [13].
In practice, the most effective agents combine multiple patterns [10]:
Run a ReAct loop for step-by-step adaptation. If the final result fails validation, enter a Reflexion retry cycle with episodic memory. This gives you per-step grounding for the common case and self-correction for the hard ones.
ReAcTree (2025) combines ReAct with hierarchical planning by decomposing a complex goal into manageable subgoals within a dynamically constructed agent tree [14]. Each node in the tree is a sub-agent running its own ReAct loop, with control flow managing dependencies between subtasks. This addresses ReAct's weakness on long-horizon tasks (100+ steps) where context grows linearly and the agent loses coherence.
| Pattern | LLM Calls | Best For | Weakness |
|---|---|---|---|
| CoT | 1 | Simple reasoning, no tools needed | No grounding, hallucination risk |
| ReAct | 1 per step | Exploratory tasks, general-purpose | Expensive on long chains, no big picture |
| Plan-and-Execute | 2+ per plan | Sequential workflows, pipelines | Rigid initial plan |
| ReWOO | 2 total | Predictable multi-step tasks | Breaks on unexpected tool outputs |
| Tree-of-Thoughts | 10–100× CoT | Hard puzzles, multi-constraint optimization | Very expensive |
| Reflexion | 3+ per retry | Tasks with clear pass/fail criteria | Slow, needs evaluator |
The most significant shift in 2025–2026 is the emergence of reasoning models — LLMs trained via reinforcement learning to "think" before responding, spending additional compute at inference time to explore solution strategies, verify answers, and self-correct [2][3][15].
| Model | Approach | Key Innovation |
|---|---|---|
| OpenAI o3 / o4-mini | Private chain-of-thought; RL-trained on verifiable rewards | Highest accuracy on math/science; reasoning effort parameter |
| Claude 4.x (Extended Thinking) | Visible thinking tokens in separate block; adaptive effort levels | Interleaved thinking between tool calls; transparent reasoning |
| Gemini 2.5 Pro (Deep Think) | Parallel hypothesis generation and evaluation | Multimodal reasoning; 1M+ token context |
| DeepSeek R1 | Open-weight; visible reasoning chain | Cost-effective; best with explicit reasoning prompts |
Reasoning models fundamentally alter the planning landscape [5][16]:
Thought: prompts no longer help and sometimes hurt.Claude 4.6 introduced interleaved thinking for agentic workflows: when using tools, the model can think between tool calls, not just before the first response [16]. This is critical for multi-step tasks where each tool result changes what the model should do next — effectively implementing the ReAct pattern inside the model's native reasoning.
Gemini 2.5 Pro supports 1M+ token context, but Google's own research reveals a critical finding: as agent context grows significantly beyond 100k tokens, the model tends to favor repeating actions from its history rather than synthesizing novel plans [17]. This highlights an important distinction between long-context for retrieval and long-context for multi-step generative planning — and remains an active research frontier.
Observability is non-negotiable for agent systems [11]. Unlike a standard API call where you can inspect input and output, an agent's failure can be buried six steps deep in a chain of actions that individually looked reasonable.
Minimum viable observability:
Tools: LangSmith, Weights & Biases Weave, and similar platforms provide structured tracing for agent trajectories [11].
Cost awareness: The pattern you choose has dramatic cost implications. A ReWOO run uses ~2,000 tokens; the same task via ReAct uses ~10,000; via Tree-of-Thoughts, potentially 100,000+ [4][10]. Matching the pattern to the task's value and complexity is a core architectural decision.
ReAct remains the default — The Thought → Action → Observation loop is the dominant agent paradigm in 2026, now implemented via native function calling rather than text parsing [5][6].
Planning patterns are complementary, not competing — Use ReAct for exploration, Plan-and-Execute for structured workflows, ReWOO for predictable pipelines, ToT for hard optimization, and Reflexion for tasks with clear success criteria [10].
Reasoning models are absorbing the scaffolding — o3, Claude extended thinking, and Gemini Deep Think internalize multi-step reasoning, self-correction, and backtracking that previously required external orchestration [2][5][16].
Self-correction is the frontier — From Reflexion's episodic memory to Agent-R's MCTS-based trajectory repair, teaching agents to identify and recover from errors mid-execution is the most active research area [12][13].
Cost and latency drive pattern selection — The "best" pattern depends on the task's value. A $0.001 ReWOO call and a $5.00 ToT exploration serve fundamentally different use cases [4][8].
Long-horizon planning remains unsolved — Even with 1M+ token contexts, agents struggle to plan coherently over hundreds of steps. Hierarchical approaches like ReAcTree and explicit state-machine graphs (LangGraph) are the current best answers [14][17].
Observability is non-negotiable — Every production agent needs structured logging of its full reasoning trajectory, cost tracking, and human escalation paths [11].
[1] Grizzly Peak Software, "Planning and Reasoning in AI Agents," 2026. https://grizzlypeaksoftware.com/library/planning-and-reasoning-in-ai-agents-a140vad2
[2] Zylos Research, "AI Reasoning Models 2026: From OpenAI o3 to DeepSeek-R1 and the Test-Time Compute Revolution," January 2026. https://zylos.ai/research/2026-01-24-ai-reasoning-models
[3] AI Magicx, "AI Reasoning Models Explained: When to Use o3, Gemini 2.5, and DeepSeek R1 (2026 Guide)," March 2026. https://www.aimagicx.com/blog/ai-reasoning-models-o3-gemini-deepseek-guide-2026
[4] Cowork.ink, "AI Agent Reasoning: ReAct, CoT & Planning Patterns (2026)." https://cowork.ink/blog/ai-agent-reasoning/
[5] Cowork.ink, "The ReAct Pattern Explained: AI Agent Reasoning in 2026," March 2026. https://cowork.ink/blog/react-pattern-ai-agents/
[6] L. Fryer, "The Complete Guide to AI Agent Architectures: ReAct, CoT, and Tool Use," April 2026. https://dev.to/lukefryer4/the-complete-guide-to-ai-agent-architectures-react-cot-and-tool-use-4ab7
[7] Reactive Agents Documentation, "Reasoning." https://docs.reactiveagents.dev/guides/reasoning/
[8] M. S. Hossain, "Agentic AI Design Patterns: ReAct, Chain of Thought & Self-Reflection in Production (2026)," March 2026. https://mdsanwarhossain.me/blog-agentic-ai-design-patterns.html
[9] "The 7 Agentic AI Design Patterns Every Developer Should Know," April 2026. https://dev.to/emperorakashi20/the-7-agentic-ai-design-patterns-every-developer-should-know-react-reflection-tool-use-and-more-3bba
[10] P. Perrone, "ReAct vs Plan-and-Execute vs ReWOO vs Reflexion," The AI Engineer, April 2026. https://theaiengineer.substack.com/p/the-4-single-agent-patterns
[11] Endless.sbs, "How AI Agents Work: Memory, Tools & Planning Explained," February 2026. https://endless.sbs/How%20AI%20Agents%20Actually%20Work:%20Memory,%20Tools,%20Planning%20&%20Real-World%20Systems%20%282026%29
[12] Agent-R: "Training Language Model Agents to Reflect via Iterative Self-Training," arXiv, 2025. https://arxiv.org/html/2501.11425v2
[13] "Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement," arXiv, 2024. https://arxiv.org/html/2410.20285v1
[14] "ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning," arXiv, 2025. https://arxiv.org/abs/2511.02424
[15] AI Log, "AI Reasoning Models 2026: o3 vs Claude vs Gemini vs R1," February 2026. https://ailog.page/ai-reasoning-models-explained-o3-vs-claude-vs-gemini-vs-deepseek-r1/
[16] SurePrompts, "Prompt Engineering for Reasoning Models: How to Get the Most From o3, Claude Thinking, and Gemini Deep Think (2026)," April 2026. https://sureprompts.com/blog/prompting-reasoning-models-guide
[17] Gemini Team, Google, "Gemini 2.5 Technical Report," October 2025. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf