Memory for Agents

Ask most engineers where agent memory lives and they'll say "the context window." Ask them what happens when the context fills up and they'll say "we summarize it." Ask them what happens when the user comes back three days later and they'll go quiet.

Memory in agent systems is treated as an afterthought. A vector store bolted on, a conversation history array passed along, a summarization call when things get too long. It works well enough in demos. It falls apart at scale, across sessions, across users, across time.

The field is starting to wake up to this. The most interesting recent thinking — including a prior art disclosure on what's being called the Personal Small Model (PSM) — argues that the entire industry has been solving the wrong problem. We've been treating memory as a storage problem. It's actually a cognitive skill problem.

This issue is a full map of agent memory — every type, every tradeoff, and where the frontier is heading.

Why Memory Is Hard for Agents Specifically

Standard software has solved memory. Databases are good. Caches are fast. File systems are reliable.

The problem with agents is that memory isn't just stored — it has to be reasoned over, selectively retrieved, weighted by relevance, and injected into a context window that has a hard token limit. And the relevance of any piece of memory is not static — it changes based on the current task, the current user intent, and the time elapsed since the memory was formed.

That's not a database problem. That's a cognition problem.

And unlike humans, agents have no native mechanism to form memories. Every piece of information the agent encounters lives only in the context window — a volatile, session-scoped scratch pad that evaporates the moment the conversation ends.

Unless you build something to catch it.

The Memory Taxonomy

Let me walk through every meaningful type of agent memory, how it works, and when you need it.

1. Sensory / In-Context Memory

This is the raw context window — the conversation so far, the tool results, the current task state. Every agent has this by default. It's fast, zero-latency, requires no infrastructure, and is the only memory the model can reason over natively.

The problem: it's bounded. GPT-4o has 128K tokens. Claude has 200K. Sounds large until you're running a long research task, passing in tool outputs, and maintaining conversation history simultaneously. And critically — it's ephemeral. Session ends, memory gone.

Use it for: current task context, tool results, immediate conversation history. Limitation: no persistence, no cross-session continuity, cost scales with length.

2. Working Memory (Managed Context)

Working memory is what you do when the raw context gets too long — you manage it. Summarize old turns. Drop low-relevance content. Keep only what's needed for the current step.

Most frameworks handle this crudely: "if token count exceeds N, summarize the oldest M turns." Better implementations are selective — they score each piece of context for relevance to the current task before deciding what to keep, compress, or discard.

Summarization is the core technique here. When implemented well, it's not just truncation — it's semantic compression. The agent (or a dedicated summarization call) distills a long exchange into the essential facts, decisions made, and open threads. The summary replaces the raw history in the context window.

The risk: bad summarization loses critical context. A user who mentioned three turns ago that they're working under a regulatory constraint doesn't want the agent to forget that because it got summarized away.

Use it for: long sessions, cost management, keeping reasoning focused. Implementation: LLM-based summarization over sliding windows, with entity extraction to preserve key facts.

3. Episodic Memory

Episodic memory is the record of what happened — specific events, interactions, decisions, outcomes — stored outside the context window and retrievable across sessions.

In biological terms this is the hippocampus: fast capture of specific experiences, tagged with context (when, where, what was the outcome).

In agent terms: a structured store of past interactions. "User asked me to draft a proposal on Tuesday. I drafted it. They revised it twice. Final version was approved." That's an episode. Stored with metadata — timestamp, task type, outcome, user feedback.

The value is continuity. When the user comes back next week and says "can you update that proposal?" — the agent that has episodic memory knows what proposal they mean. The agent without it asks them to paste it in again.

Use it for: cross-session task continuity, personalization, user preference tracking. Implementation: structured database (Postgres or Firestore) with session IDs, timestamps, outcome tags. Retrieve by recency + semantic similarity to current task.

4. Semantic Memory

Semantic memory is generalized knowledge — facts about the world, or facts about the user, extracted from episodic events and consolidated into stable beliefs.

Where episodic memory stores "on Tuesday the user preferred formal tone in their proposal," semantic memory stores "this user prefers formal tone in written communication." The specific episode is compressed into a general pattern.

This is the slow, consolidation-based memory — the neocortex analogue. It doesn't form instantly. It emerges from patterns across multiple episodes.

The critical engineering challenge: semantic memory has to handle conflicts. What happens when new behavior contradicts an established belief? If the user who "always prefers formal tone" suddenly asks for a casual email — do you update the belief? Flag it? Create a conditional belief ("formal for client-facing, casual for internal")?

This is not a simple upsert. It requires conflict detection, confidence scoring, and in some cases, explicit user confirmation.

Use it for: persistent user preferences, long-term knowledge about the domain, agent beliefs that should survive across months. Implementation: vector store for semantic similarity retrieval + structured store for explicit facts. Beliefs should carry confidence scores and timestamps.

5. Procedural Memory

Procedural memory is knowledge of how to do things — not facts, but processes. In agents, this manifests as learned tool usage patterns, successful workflow templates, and task-specific strategies.

"For financial research tasks, always check the SEC filing before the news article." That's procedural memory. The agent that has it is faster and more accurate. The agent without it re-discovers the same optimal workflow every time.

Most current agent frameworks have no native support for procedural memory. Engineers implement it crudely as few-shot examples in the system prompt — which works, but doesn't evolve based on outcomes.

Use it for: optimizing repeated task patterns, multi-step workflows, domain-specific best practices. Implementation: a retrievable library of successful task traces, tagged by task type. Retrieved at task start and injected as few-shot context.

6. Embedding-Based Retrieval (Vector Memory)

This is the most widely deployed form of agent memory right now, and it's often misunderstood as the memory solution rather than one component of a broader system.

The mechanism: information (documents, past interactions, knowledge base entries) is converted to vector embeddings and stored in a vector database. At retrieval time, the current query is embedded and the nearest neighbors are returned. The retrieved chunks are injected into the context window.

This is RAG (Retrieval Augmented Generation) applied to agent memory. It works well for knowledge retrieval — "what does our internal policy say about X?" It works less well for episodic continuity — "what did we decide last Tuesday?" — because semantic similarity doesn't reliably surface temporally relevant memories.

Use it for: knowledge base access, document retrieval, finding relevant past interactions at scale. Limitations: retrieval quality degrades with chunk quality; temporal relevance requires hybrid retrieval (vector similarity + recency weighting); hallucinated embeddings are a real failure mode.

The hybrid approach — combining vector similarity with structured metadata filters (date range, user ID, task type, outcome) — is significantly more reliable than pure vector retrieval for production agent systems.

7. Associative / Graph Memory

Beyond vector similarity, some advanced memory architectures represent memories as nodes in a knowledge graph — with typed relationships between entities. Not just "the user mentioned Project X" but "Project X is owned by User A, depends on System B, was approved by C, has deadline D."

Graph memory enables multi-hop reasoning: "what are all the systems that would be affected if we change Project X's deadline?" A flat vector store can't answer that. A knowledge graph can.

Use it for: complex domain knowledge with rich relationships, multi-entity reasoning, organizational memory. Implementation: Neo4j or a graph layer on top of Postgres. Higher complexity, higher payoff for the right use cases.

8. The PSM Insight: Memory as a Learned Cognitive Skill

The most interesting recent thinking in this space challenges the entire architecture described above.

The argument, from the Personal Small Model (PSM) framework: every current memory system treats memory as a storage and retrieval problem. Build a database, implement retrieval, inject results into context. The LLM consumes the fragments. The cycle repeats.

This is architecturally wrong — because it loads memory operations onto the same model doing task reasoning. Every memory decision (what to remember, when to consolidate, how strongly to weight a retrieved memory) competes with the agent's primary reasoning budget.

The PSM proposal: train a separate small model (1–3B parameters) whose sole job is memory operations — relevance gating, consolidation, recall weighting, interference detection, decay scheduling. Not a database. A learned cognitive skill.

The architecture separates two things that current systems conflate:

PSM weights  →  the skill of memory (shared, trained once)
Memory store →  the content of memory (per-user, personal, isolated)

The PSM never stores user content in its weights. It learns how to remember, operating on per-user memory stores it curates. This means no catastrophic forgetting, no privacy leakage between users, and a memory system that improves through reinforcement — memories that helped the LLM perform better get strengthened; irrelevant retrievals get decayed.

It also proposes sleep-time consolidation — an async background process that runs after sessions end, reorganizing episodic memories into semantic patterns, applying decay schedules, and pruning what's been superseded. The user's next session starts with a cleaner, more consolidated memory store — at no additional retrieval cost.

This isn't a production system yet. But it's the right direction. The field has been building better filing cabinets. The PSM is arguing for a different cognitive architecture entirely.

The Memory Stack You Actually Need in Production

Practically speaking, here's the layered architecture for a production agent memory system:

Layer 1: Context Window
  └── Raw session state, tool outputs, immediate history

Layer 2: Working Memory Manager
  └── Summarization, context pruning, entity extraction

Layer 3: Episodic Store (Postgres / Firestore)
  └── Structured past interactions, timestamped, outcome-tagged

Layer 4: Semantic Store (Vector DB + Structured DB)
  └── User preferences, domain knowledge, consolidated beliefs
  └── Hybrid retrieval: semantic similarity + metadata filters

Layer 5: Procedural Library
  └── Successful task traces by task type, injected as few-shot

Layer 6: Consolidation Process (async, background)
  └── Episodes → semantic patterns, decay, conflict resolution

Most production agent systems today implement layers 1-2, partially implement layer 4, and skip 3, 5, and 6 entirely. That's why their agents feel stateless, expensive, and forgetful.

⚡ Santosh's Take

I've thought about this problem from two angles — building conversational AI at AWS where session continuity was a core product requirement, and now designing agent systems where memory isn't a feature but a compliance and reliability concern.

The honest state of the industry is that most teams implement a vector store, call it a memory system, and move on. It's not enough. Memory is what separates an agent that feels like a tool from an agent that feels like a system you can actually trust with ongoing work.

The PSM framing resonates with me precisely because it reframes the problem correctly — memory isn't a retrieval problem, it's an architecture problem. We need dedicated cognitive infrastructure for it, not a smarter database query. The teams that build that infrastructure in the next 18 months will have a meaningful moat over those bolting on a vector store after the fact.

Start with episodic memory. It's underbuilt, high value, and not complicated to implement. The vector store can wait.

👀 Also Watching

mem0 and Zep — the current leading OSS memory layers. Worth knowing their tradeoffs before you build your own.
LangMem (LangChain) — their emerging memory abstraction layer, actively evolving.
The PSM paper on Zenodo — public domain, worth a read if you want to understand where the frontier is heading.

Until next time,

Learn to use AI. Use AI to learn.

If someone forwarded this to you, subscribe at whattheagent.com. If this was useful, forward it to one engineer who needs it.

Memory for Agents

Keep Reading

Santosh Ameti's Agentic AI Newsletter