backAgent Memory Systems

Agent Memory Systems

Overview

AI agents without memory are stateless functions — every conversation starts from zero, every user is a stranger, every past interaction erased [1]. Agent memory is the engineering discipline that fixes this, and by 2026 it has become the critical differentiator between toy demos and production-grade AI systems [1]. This document surveys the taxonomy of memory types, the architectural patterns that implement them, the frameworks leading the space, and the context management techniques that make it all work within finite token budgets.

The core challenge is straightforward: LLMs can only "see" what fits in their context window [2]. Everything else — user preferences learned over months, facts from prior sessions, procedural knowledge about how to complete tasks — must be stored externally and retrieved on demand. How that storage, retrieval, and eviction works defines the quality ceiling of any agent system.

Memory Taxonomy

Agent memory systems draw heavily from cognitive science, decomposing memory into distinct subsystems that mirror human cognition [3][4].

graph TD
    AM[Agent Memory] --> STM[Short-Term / Working Memory]
    AM --> LTM[Long-Term Memory]
    
    STM --> CW[Context Window]
    STM --> CB[Conversation Buffer]
    
    LTM --> EP[Episodic Memory]
    LTM --> SM[Semantic Memory]
    LTM --> PM[Procedural Memory]
    
    EP --> VDB[(Vector Store)]
    EP --> TKG[(Temporal KG)]
    SM --> KG[(Knowledge Graph)]
    SM --> VDB
    PM --> TB[(Tool/Behavior Store)]

Short-Term / Working Memory

Short-term memory is the agent's active context — everything currently in the prompt [5]. For most LLM-based agents, this maps directly to the context window. GPT-4o provides 128K tokens; Claude offers 200K; Gemini extends to 1M+ [5]. This is the "RAM" of the agent: fast, always visible, but strictly bounded.

Working memory holds the current conversation turns, the system prompt, any injected memory blocks, and tool call results. When this buffer fills up, something must be evicted — and how that eviction happens (truncation, summarization, or intelligent compaction) determines whether the agent degrades gracefully or catastrophically forgets critical context [6].

In frameworks like Letta, working memory is structured into memory blocks — discrete, labeled sections of the context window that the agent can read and write to directly [2]. This is a significant advance over treating the context window as a monolithic text blob. Each block has a label, a character limit, and persistence — the agent's "core memory" is always visible, while other information lives in retrievable external stores.

Long-Term Memory

Long-term memory persists across sessions and conversations. It is implemented through external storage — vector databases, knowledge graphs, relational databases, or hybrid combinations — and accessed via retrieval mechanisms [7].

Episodic Memory

Episodic memory stores records of specific past experiences: what happened, when, and in what context [3][4]. It answers questions like "What did the user ask me about last Tuesday?" or "How did I solve that deployment issue three sessions ago?"

Implementation typically involves storing interaction traces with temporal metadata. Zep's Graphiti engine exemplifies sophisticated episodic memory through its bi-temporal model — every fact carries two independent time axes: the event time (when the fact was true in the real world) and the ingestion time (when the system learned about it) [8]. This enables point-in-time queries like "What did we know about Alice's employment as of January 2024?" even if that information was first ingested months later.

Microsoft Research's AARI initiative has identified episodic memory as a major open challenge, noting that AI research has almost exclusively developed semantic memory systems while episodic memory — the ability to recall specific past experiences — remains underexplored, especially for vision [9].

Semantic Memory

Semantic memory stores structured factual knowledge — generalized information such as facts, definitions, rules, and entity relationships [7]. Unlike episodic memory which deals with specific events, semantic memory captures what is known independent of when it was learned.

The distinction between vector-based and graph-based semantic memory is precise: vector memory retrieves semantically similar facts ("this user mentioned Python"), while graph memory retrieves facts connected through relationships ("this user works with Python, specifically for data pipelines, using pandas, at a company that uses dbt, and they're migrating from Spark") [10].

Procedural Memory

Procedural memory — how to do things — is the newest addition to the taxonomy. Mem0's v1.0.0 API introduced explicit support for procedural memory alongside episodic and semantic types [10]. This captures learned behaviors, tool usage patterns, and workflow preferences that enable agents to improve their execution strategies over time.

Storage Backends: Vector Stores and Knowledge Graphs

Vector Databases

Vector databases serve as the backbone for similarity-based retrieval [7]. The agent embeds a query, searches for the closest stored embeddings, and retrieves the top-k results. As of early 2026, Mem0 alone supports 19 vector store backends — Qdrant, Chroma, Milvus, pgvector, Redis, and more — reflecting a market where developers have not converged on a single solution [10].

The simplest retrieval strategy is pure vector similarity search. More sophisticated systems run multi-strategy retrieval in parallel: semantic search, BM25 keyword matching, graph traversal, and temporal filtering, then rerank the combined results [11]. The HINDSIGHT architecture demonstrates this with four parallel retrieval channels fused via cross-encoder reranking [11].

Knowledge Graphs

Knowledge graphs capture entity relationships that pure vector similarity misses. Zep/Graphiti builds a temporally-aware knowledge graph that dynamically synthesizes unstructured conversational data and structured business data while maintaining historical relationships [8]. Cognee focuses on graph + vector hybrid architectures where relationships dominate the retrieval needs [12].

Unified Substrates

MemoriesDB proposes treating memories as a first-class data type that is simultaneously a temporal event, a semantic vector, and a relational node [13]. This "triality" — time, meaning, and connection — is implemented on standard PostgreSQL with pgvector, combining time-series, vector, and graph capabilities in a single append-only schema.

graph LR
    subgraph "Unified Memory Record"
        T[Temporal Event<br/>when it happened]
        S[Semantic Vector<br/>what it means]
        R[Relational Node<br/>how it connects]
    end
    T --- S --- R --- T
    
    subgraph "Query Modes"
        TQ[Time-bounded recall]
        SQ[Semantic similarity]
        GQ[Graph traversal]
    end
    
    T --> TQ
    S --> SQ
    R --> GQ

Context Compaction and Management

As agents run longer, they accumulate state — architectural decisions, user preferences, tool outputs, reasoning traces — that eventually exceeds the context window [6][14]. Context compaction is the set of techniques for managing this growth.

Compaction Strategies

Truncation is the simplest approach: drop the oldest messages when the window fills. It is fast and deterministic but loses potentially critical early context [6].

Summarization uses the LLM itself to compress older interactions into condensed summaries. OpenSearch agents, for example, intelligently summarize older interactions while preserving key information [15]. The trade-off is latency and cost — every summarization step requires an LLM call.

Sliding windows maintain a fixed-size buffer of recent messages, evicting the oldest as new ones arrive. This works well for conversational agents where recency matters most [6].

Semantic lossless compression, as implemented by SimpleMem, distills unstructured interactions into compact, multi-view indexed memory units through a three-stage pipeline: structured compression, online semantic synthesis (intra-session deduplication), and intent-aware retrieval planning [16]. This achieves a 26.4% F1 improvement over baselines while reducing token consumption by up to 30x.

DAG-based state management represents accumulated state as a directed acyclic graph, enabling structurally lossless trimming that preserves dependency relationships between decisions even as raw conversation history is evicted [14].

The Letta/MemGPT Approach

Letta (formerly MemGPT) treats context management as an operating system problem [2][17]. The agent's memory is organized into three tiers inspired by computer architecture:

TierAnalogyBehavior
Core MemoryRAMSmall block pinned in the context window. Agent reads/writes directly. Always visible.
Recall MemoryDisk cacheSearchable conversation history stored outside context. Retrieved on demand.
Archival MemoryCold storageLong-term storage queried via explicit tool calls. Unbounded capacity.

The key innovation is that agents self-edit their memory — the LLM decides what is worth remembering by calling memory functions (core_memory_append, core_memory_replace, archival_memory_search) during its reasoning loop [17]. When the context window fills, Letta compacts by evicting older messages to recall storage while preserving core memory blocks. All evicted messages remain accessible via the API and retrieval tools — nothing is permanently lost [2].

flowchart TB
    subgraph "Context Window (Compiled at Inference)"
        SP[System Prompt]
        MB[Memory Blocks<br/>- persona<br/>- human<br/>- custom]
        MSG[Recent Messages]
        TC[Tool Calls & Results]
    end
    
    subgraph "External Storage"
        RM[Recall Memory<br/>Full conversation history]
        AR[Archival Memory<br/>Long-term knowledge]
        FS[Filesystem<br/>Files & documents]
    end
    
    MB <-->|"core_memory_append<br/>core_memory_replace"| MB
    MSG -->|compaction/eviction| RM
    AR <-->|"archival_memory_search<br/>archival_memory_insert"| TC
    FS <-->|"search_files"| TC

2026 Framework Landscape

The agent memory space has consolidated around several distinct architectural philosophies [11][12][18].

Mem0: The Drop-In Memory Layer

Mem0 is a memory API that bolts onto any existing agent framework [10][18]. It extracts semantic memories from conversation text and stores them as discrete facts in a hybrid vector + optional graph store. The graph variant (Mem0g) adds entity relationship tracking, improving accuracy from 66.9% to 68.4% on the LOCOMO benchmark at the cost of higher latency (2.59s vs 1.44s p95) [18]. As of 2026, Mem0 integrates with 21 frameworks across Python and TypeScript [10].

Zep: Temporal Knowledge Graphs

Zep is built around Graphiti, a temporally-aware knowledge graph engine [8]. Its three-layer graph structure — episodic (raw events), semantic (entities and relationships), and community (clusters) — with bi-temporal modeling gives it a 15-point lead on LongMemEval's temporal reasoning tasks [18]. Zep achieves 94.8% on the DMR benchmark vs MemGPT's 93.4%, with up to 18.5% accuracy improvements on LongMemEval while reducing latency by 90% [8].

Letta: The Agent Runtime

Letta is not a memory layer — it is a full agent runtime where memory management is a first-class concern [2][17]. Agents run inside Letta, which manages the agent loop, tool execution, state persistence, and memory across the three-tier hierarchy. The MemGPT research paper's insight — that the model should manage its own context like an OS manages virtual memory — is now a production platform serving over 1 million stateful agents [12].

Research Frontiers

HINDSIGHT unifies long-term factual recall with preference-conditioned reasoning through four parallel retrieval strategies (semantic, BM25, graph traversal, temporal) with cross-encoder reranking [11].

BMAM (Brain-inspired Multi-Agent Memory) decomposes memory into episodic, semantic, salience-aware, and control-oriented subsystems with asynchronous consolidation inspired by complementary learning theory [4].

MEMORA introduces "primary abstractions" and "cue anchors" that balance abstraction and specificity, enabling scalable retrieval without fragmenting knowledge [19].

MemVerse extends memory to multimodal agents, maintaining short-term context while transforming raw multimodal experiences into hierarchical knowledge graphs with periodic distillation into parametric memory [20].

Framework Comparison

FrameworkArchitectureMemory ModelBest For
Mem0Drop-in APIVector + optional graphFast personalization, minimal infrastructure
ZepTemporal KG (Graphiti)Bi-temporal knowledge graphTemporal reasoning, evolving facts
LettaAgent runtimeCore/Recall/Archival tiersFull-stateful agents, self-editing memory
CogneeGraph + vector hybridKnowledge graph-heavyRelationship-dominated reasoning
LangGraphOrchestration frameworkFlat key-value + vectorWorkflow control with secondary memory

Key Takeaways

  1. Memory is not context. Context windows are working memory. RAG is retrieval. Neither is long-term memory. Production systems require five components: persistence, structure, retrieval, writeback, and forgetting [21].

  2. The tiered model has won. Whether it is Letta's Core/Recall/Archival, Taskade's five-component model, or BMAM's brain-inspired subsystems, successful agents use layered memory with different access patterns and retention policies at each tier [2][4][21].

  3. Temporal awareness is table stakes. Knowing what happened is insufficient — agents must track when facts were true and when they stopped being true. Zep's bi-temporal model demonstrates the accuracy gains this enables [8][18].

  4. Multi-strategy retrieval outperforms single-strategy. Pure vector similarity misses what keyword search catches, and neither captures graph relationships. The best systems fuse semantic, BM25, graph, and temporal retrieval with learned reranking [10][11].

  5. Forgetting is as important as remembering. Without effective memory garbage collection, agents suffer from context poisoning and increased reasoning costs [22]. Intelligent decay mechanisms that prune based on recency, relevance, and utility are essential for long-running agents.

  6. Self-editing memory is the frontier. Agents that curate their own memory — deciding what to store, update, and forget — outperform systems with static extraction pipelines. Letta's approach of exposing memory operations as tool calls the agent invokes during reasoning represents the current state of the art [2][17].

References