Everyone's building agents.

I see demos every week. Slick videos of an agent browsing the web, summarizing documents, writing code, sending emails — all autonomously, all seemingly flawlessly. The frameworks have gotten good enough that a competent engineer can have something impressive running in an afternoon.

That part is genuinely easy now.

What nobody's making videos about is what happens three weeks later, when that agent is running in production, handling real users, real data, and real failure modes. When it's 2am and the agent is stuck in a retry loop burning $400/hour in API costs. When two agent instances wrote conflicting state to the same database. When a user's session got corrupted mid-task and now nobody knows what the agent did or didn't do.

That's the gap between an agent and a production agent system. And it's enormous.

This issue is about closing that gap.

Why Production Is a Different Problem

In a demo, the agent runs once, on a clean input, with no concurrent users, no cost pressure, no regulatory scrutiny, and a developer watching every step ready to restart it.

In production, none of those things are true.

You have concurrent sessions. State that must persist across failures. Users who will do unexpected things. Costs that compound with every unnecessary LLM call. Downstream systems that don't tolerate duplicate actions. Compliance requirements that demand auditability of every decision the agent made.

The frameworks are built for the demo case. You have to engineer for the production case.

Here's how.

1. State Management: The Root of Most Production Failures

An agent that can't reliably manage state is not a production agent — it's a liability.

State in an agentic system is more complex than in a standard application because the agent's reasoning context is itself state. The conversation history, tool call results, intermediate reasoning steps, current task position — all of this needs to be externalized, versioned, and recoverable.

What breaks: Agents that keep state in memory. The moment a pod restarts, a timeout occurs, or you scale horizontally, that state is gone. The agent either starts over (wasting work and money) or errors out (breaking the user experience).

Best practices:

  • Externalize everything. Use a persistent checkpoint store — Redis, Postgres, or a purpose-built solution like LangGraph's checkpointing layer. Every meaningful step in the agent loop should write a checkpoint. If the agent dies mid-task, it should be able to resume from the last good state, not restart from scratch.

  • Version your state schema. Agents evolve. The structure of your state object will change. If you don't version it, a deployment will break all in-flight sessions. Treat agent state like a database migration — plan for backward compatibility.

  • Separate short-term and long-term memory explicitly. Short-term is the current task context (fits in the context window). Long-term is user preferences, past interactions, accumulated knowledge (lives in a vector store or structured DB). Conflating them creates both performance and cost problems.

  • Make state transitions atomic. If your agent reads state, acts, then writes new state — that's three operations that can fail independently. Wrap them in transactions where possible. At minimum, detect and handle partial writes.

2. Isolation: When Agents Share Infrastructure

In a single-user demo, isolation is irrelevant. In a multi-tenant production system, it's the difference between a contained failure and a catastrophic one.

What breaks: Agent A's runaway tool call consumes the shared rate limit budget, starving Agent B. A buggy agent leaks context from one user's session into another's. One agent's database writes corrupt state that another agent is mid-read on.

Best practices:

  • Tenant-level rate limiting. Every agent session should have its own token budget and API call budget, enforced before the LLM call is made — not discovered after the bill arrives. Build a rate limit layer that's aware of session identity, not just aggregate traffic.

  • Namespace all external resources by session. Vector store collections, database rows, file system paths, message queue topics — everything the agent touches should be keyed by a session or tenant ID. This sounds obvious until you're debugging a production incident and realizing two users' contexts are bleeding together.

  • Process isolation for high-risk operations. If your agent executes code, runs shell commands, or accesses sensitive data, run those operations in isolated containers or sandboxed environments. The agent's reasoning layer should be separated from its execution layer. A compromised or misbehaving agent should not be able to affect the host system or other sessions.

  • Circuit breakers per agent instance. If a specific agent session is behaving anomalously — excessive tool calls, circular reasoning loops, escalating costs — isolate and terminate that session without affecting others. This requires per-session observability, which we'll get to.

3. Resilience: Designing for Failure, Not Against It

Agents fail. Tools time out. Models return malformed output. External APIs go down. The question is not whether your agent will encounter failures — it's whether those failures are handled gracefully or catastrophically.

What breaks: Naive retry logic that retries everything, including non-idempotent operations. No differentiation between transient failures (retry) and permanent failures (abort and surface to user). Agents that swallow errors and hallucinate a successful result.

Best practices:

  • Classify your failures before you write retry logic. Transient (network blip, rate limit, temporary API unavailability) → retry with exponential backoff and jitter. Semantic (model returned something that doesn't parse, tool returned unexpected schema) → retry with adjusted prompt or fallback tool. Permanent (invalid credentials, resource doesn't exist, policy violation) → abort immediately, log, surface to user. Treating all three the same is how you get infinite loops and compounding costs.

  • Make tool calls idempotent where possible. If the agent sends an email, creates a ticket, or writes a database record, and then fails after the tool call but before writing the checkpoint — it will retry and do it again. Design tools with idempotency keys. Check before you act. "Did I already do this?" should be a first-class question in your tool implementations.

  • Set hard limits, not just soft ones. Maximum steps per session. Maximum tokens per task. Maximum wall-clock time. Maximum cost per run. These are circuit breakers for your entire system. An agent in a reasoning loop will happily run forever if you let it. Don't let it.

  • Build a human escalation path. For any task where an agent uncertainty exceeds a threshold, or where an irreversible action is about to be taken, design an explicit handoff to a human. This is not a failure mode — it's a feature. The best production agent systems are human-in-the-loop by design, not as an afterthought.

4. Fault Tolerance: Surviving the Unsurvivable

Resilience is about handling expected failure modes. Fault tolerance is about surviving the ones you didn't design for.

What breaks: Everything, eventually. The question is how much data you lose and how long recovery takes.

Best practices:

  • Implement dead letter queues for agent tasks. When a task fails past all retry attempts, it should land in a dead letter queue — not disappear silently. This gives you a recovery path, an audit trail, and a dataset for improving your agent over time.

  • Design for exactly-once delivery, accept at-least-once reality. Most message systems give you at-least-once delivery guarantees. That means your agent will process the same task twice sometimes. Your state management and tool implementations need to handle this gracefully. Idempotency at the tool layer and state-based deduplication are your friends here.

  • Build recovery runbooks before you need them. What does an on-call engineer do when an agent is stuck? How do they inspect its current state? How do they safely terminate a runaway session? How do they replay a failed task from a checkpoint? These runbooks should exist and be tested before you go to production — not written during an incident.

  • Chaos test your agent systems. Kill pods mid-task. Inject tool failures. Introduce network latency. Drop messages. If your agent system can't recover cleanly from these, it's not production-ready. This is the same discipline you'd apply to any distributed system — agents are not exempt because they're "AI."

5. Observability: You Can't Fix What You Can't See

This is the most underbuilt part of the agentic stack, across the industry, right now.

Standard application monitoring tells you that something failed. Agent observability needs to tell you why the agent made the decision that led to the failure — which is a fundamentally different and harder problem.

What you need:

  • Step-level tracing. Every action the agent takes — every tool call, every reasoning step, every model invocation — should produce a trace event with timing, inputs, outputs, and cost. Not just aggregate metrics. Step-level traces.

  • Decision logging. When the agent chooses between actions, log the options it considered and the one it picked. This is what lets you reconstruct why an agent behaved unexpectedly. Without this, debugging a production agent is archaeology.

  • Cost attribution per session. You need to know exactly how much each agent run cost — broken down by model calls, token counts, and tool invocations. Aggregate cost monitoring tells you you're over budget. Session-level cost attribution tells you which sessions are burning money and why.

  • Anomaly detection on agent behavior. Define what "normal" looks like for your agent — average steps to completion, typical tool call distribution, usual token consumption. Alert when sessions deviate significantly. A session taking 10x the normal number of steps is not a user problem — it's a signal your agent is stuck.

⚡ Santosh's Take

I've deployed agent systems at AWS scale and in the risk-sensitive environment of a top hedge fund. Here's the honest truth: the engineering discipline required to run agents reliably in production is not AI engineering. It's distributed systems engineering, applied to a new class of workload.

The teams that are succeeding aren't the ones with the best prompts. They're the ones that treated their agent orchestration layer with the same rigor they'd apply to a payment processing system or a trading engine. Checkpointing, idempotency, circuit breakers, dead letter queues, chaos testing — none of this is new. What's new is that you now have to apply it to a system whose intermediate steps are probabilistic rather than deterministic.

That's the real challenge. And it's why "we have a working demo" and "we're production-ready" are separated by months of unglamorous infrastructure work.

The demo is the beginning. The checkpoint store is the product.

Until next time,

Learn to use AI. Use AI to learn.

If someone forwarded this to you, subscribe at whattheagent.com. If this was useful, forward it to one engineer who needs it.

Keep reading