There's a moment every team hits.

The prototype works. The agent does the thing. Someone runs it locally, it handles the task cleanly, everyone's excited. The decision is made to go to production.

Six weeks later, a senior engineer is staring at a dashboard showing $40,000 in monthly LLM API costs, a p99 latency of 47 seconds, three ongoing incidents of agents stuck in retry loops, and a Slack message from legal asking why the agent accessed data it wasn't supposed to.

This is not a story about bad engineering. This is a story about a category mismatch. Running one agent locally and running ten thousand agents simultaneously in production are not the same problem with more instances. They are fundamentally different engineering problems that require fundamentally different infrastructure.

This issue is about that gap — what changes, what breaks, and what you have to build.

What "Running Locally" Actually Means

When you run an agent locally, you control everything implicitly:

  • One instance. No concurrency, no resource contention.

  • Your machine. No network hops, no shared infrastructure.

  • Your data. No multi-tenancy, no isolation concerns.

  • You watching. You're the observability system. You see the output in real time.

  • Your API key. Rate limits are loose at low volume. Costs are negligible.

  • Clean inputs. You're testing with inputs you chose. Not inputs real users will throw at it.

  • Manual recovery. Something goes wrong, you restart the script. No SLA.

These implicit controls create a false sense of reliability. The agent isn't robust — it just hasn't been stressed.

What Changes at Scale

Here's every assumption that breaks when you go from one agent to ten thousand:

1. Compute: From Laptop to Distributed System

Locally, your agent runs on one process on one machine. Memory is the machine's RAM. Execution is sequential or trivially parallel.

At scale, you're running a distributed system. Agents run across a fleet of workers. Work has to be distributed, load-balanced, and recovered when workers fail. State can't live in process memory — it has to live in shared external storage.

What you need:

  • A task queue (Celery + Redis, AWS SQS, Cloud Tasks) — agents don't run inline, they pull work from a queue. This gives you back-pressure, retry logic, and dead letter handling for free.

  • Horizontal scaling — workers scale based on queue depth, not fixed instance count.

  • Stateless workers — every worker must be able to pick up any task. No local state. Everything in the external store.

  • Resource limits per worker — CPU, memory, and time limits enforced at the infrastructure layer, not just the application layer.

2. LLM API: Rate Limits Become a First-Class Problem

Locally, you hit rate limits occasionally and restart. At scale, rate limits are an architectural constraint you design around before you write a single line of agent code.

The math is unforgiving: if each agent run makes 20 LLM calls, and each call averages 2,000 tokens, and you're running 500 concurrent agents — you're burning 20 million tokens per minute. Most API tiers don't support that without enterprise agreements and careful traffic shaping.

What you need:

  • A token budget layer — every agent session is allocated a token budget at creation. The agent cannot exceed it. Period. This is enforced before the API call, not discovered on the bill.

  • Rate limit-aware scheduling — the task queue respects API rate limits. If you're at 80% of your RPM limit, new tasks are held, not dropped.

  • Multi-model routing — different tasks route to different models based on complexity and cost. Simple classification goes to a cheap model. Complex reasoning goes to a frontier model. This alone can cut LLM costs by 60%.

  • Caching — identical prompts return cached responses. Surprisingly common in production workflows. A semantic cache (using vector similarity to match near-identical prompts) extends this further.

  • Fallback chains — if the primary model is rate-limited or unavailable, route to a fallback. The agent shouldn't fail because one provider is having a bad hour.

3. Latency: From "Fast Enough" to a Hard SLA

Locally, 8 seconds per LLM call is fine. You're watching the output in a terminal. At scale, 8 seconds per call × 15 calls = 2 minutes per agent run — which may violate your SLA, frustrate users, and cascade into queue buildup.

What you need:

  • Async everywhere. Agents should not block synchronously on LLM calls. Every tool call, every model invocation is async, with proper futures/callbacks.

  • Streaming responses. For user-facing agents, stream the model output rather than waiting for the full response. Perceived latency drops dramatically.

  • Step-level timeouts. Each individual step has a timeout, not just the overall task. A single stuck tool call shouldn't block the entire agent run.

  • Latency budgeting. Know your expected latency breakdown: orchestration overhead, LLM latency, tool execution time. Monitor each independently. Regressions hide in the aggregate.

  • Pre-warming. For latency-sensitive workflows, keep a pool of initialized agent contexts warm rather than cold-starting every session.

4. State Management: From In-Process to Distributed

Locally, state lives in a Python dict in memory. The agent is one process. There's no concurrency issue.

At 10,000 concurrent agents, state is a distributed systems problem with all the classic failure modes: race conditions, partial writes, stale reads, split-brain scenarios.

What you need:

  • External state store — Redis for hot state (active agent sessions), Postgres for durable state (completed steps, checkpoints, audit trail).

  • Optimistic locking — when multiple processes might update the same agent state, use version-based optimistic locking. Detect conflicts, retry safely.

  • Checkpoint-first architecture — before executing any step, checkpoint the intent. After completing, checkpoint the outcome. If the worker dies between those two checkpoints, the next worker knows exactly where to resume.

  • Session affinity for short tasks — where possible, route all steps of one agent session to the same worker. Reduces state synchronization overhead and simplifies debugging.

5. Multi-Tenancy: Isolation at Every Layer

Locally, there's one user — you. There's no isolation concern.

At scale, you have thousands of users whose agent sessions must be completely isolated from each other. Data, compute, rate limits, errors — none of it should cross tenant boundaries.

What you need:

  • Tenant-scoped resource namespacing — every database row, every vector store collection, every file path, every queue message carries a tenant ID. This is foundational. Retrofit it and you'll miss cases.

  • Per-tenant rate limiting — one user's runaway agent shouldn't consume the API budget for everyone else. Enforce per-tenant token budgets at the gateway, not just globally.

  • Compute isolation for high-risk tenants — enterprise customers or high-compliance workloads get dedicated worker pools. No resource sharing with other tenants.

  • Data access enforcement — the agent's tool calls must be scoped to data the tenant is authorized to access. This isn't optional for any regulated use case. Implement it as a middleware layer in your tool execution path, not as agent prompt instructions. Prompt instructions can be worked around. Middleware enforcement cannot.

6. Observability: From Terminal Output to Production Telemetry

Locally, you watch the logs scroll by. Something goes wrong, you see it immediately.

At scale, you have 10,000 agent sessions running simultaneously. You cannot watch them. You need systems that surface what matters and let you dig into anything.

The three layers of agent observability:

Metrics — aggregate, real-time:

  • Sessions active, sessions completed, sessions failed

  • P50/P95/P99 latency per session and per step

  • Token consumption by model, by task type, by tenant

  • Tool call success rate and error distribution

  • Cost per session, per tenant, per day

Traces — per-session, step-level:

  • Every LLM call with prompt, response, latency, token count

  • Every tool call with inputs, outputs, duration, success/failure

  • Every state transition with before/after snapshots

  • Decision points — what options the agent considered, what it chose

Alerts — anomaly detection:

  • Session exceeding 2× expected token budget → alert

  • Tool error rate spike → alert

  • p99 latency crossing SLA threshold → alert

  • A single tenant consuming >20% of global API budget → alert

  • Agents stuck in loop patterns (same tool called >N times) → alert + auto-terminate

The tooling landscape: OpenTelemetry for trace instrumentation, Langfuse or Arize Phoenix for LLM-specific observability, Grafana for dashboards, PagerDuty for alerting. The agent-specific observability layer is the piece most teams underinvest in — and regret first.

7. Cost: From Negligible to Line Item

Locally, you're spending a few dollars a day in API costs. It doesn't matter.

At scale, LLM API costs are a material business expense — and unlike traditional infrastructure, they scale with usage patterns, not just instance count. An agent that takes one unnecessary reasoning step per run, multiplied by 500,000 runs per month, is a meaningful cost problem.

Cost control at scale:

  • Token budgets per session — hard limits enforced before the call.

  • Model tiering — route by task complexity. Don't use a frontier model for tasks a cheaper model handles well.

  • Prompt caching — Anthropic and OpenAI both offer prompt caching for repeated system prompts. At scale this is significant.

  • Batch processing — for non-latency-sensitive workloads, use batch APIs (significantly cheaper). Don't use the realtime API for everything.

  • Cost attribution — know the cost of every task type, broken down by model and step. You can't optimize what you can't measure.

  • Auto-termination — sessions that exceed cost thresholds are terminated automatically and flagged for review. Not as a punitive measure — as a debugging signal. Runaway cost is almost always a symptom of a stuck agent.

8. Failure Modes That Only Exist at Scale

Some failure modes are invisible locally and only emerge at scale:

The thundering herd. A spike in incoming tasks overwhelms the task queue. Workers back up. Latency climbs. Users retry (making it worse). The system falls over. Solution: back-pressure in the queue, rate-limited task ingestion, circuit breakers at the API gateway.

The cost explosion. A bug causes agents to loop. Each loop burns tokens. At 10,000 concurrent sessions, a looping bug can generate a five-figure API bill in hours. Solution: per-session token budgets with hard cutoffs, anomaly alerts on burn rate.

The state corruption cascade. A bad deployment writes malformed state for all active sessions. Now thousands of sessions are in an unrecoverable state simultaneously. Solution: schema versioning, canary deployments, state migration procedures.

The noisy neighbor. One tenant's agents consume so much API capacity that other tenants experience degraded performance. Solution: per-tenant rate limiting, dedicated pools for high-volume tenants.

The silent failure. An agent completes but produced wrong output — and nothing flagged it because the success metric was task completion, not output quality. Solution: output quality evaluation as part of the pipeline, not just success/failure tracking.

The Maturity Model

Here's where most teams are and where they need to get:

Maturity

Characteristics

Level 1: Local

In-process state, synchronous calls, no observability, manual recovery

Level 2: Basic Prod

External state store, async execution, basic logging, manual scaling

Level 3: Scaled

Task queues, per-tenant isolation, token budgets, trace-level observability

Level 4: Production-grade

Multi-model routing, cost attribution, anomaly detection, chaos-tested, full audit trail

Level 5: Platform

Self-healing, auto-scaling, predictive cost management, SLA-backed, compliance-ready

Most teams launch at Level 2 and discover they needed Level 3 on day one of real traffic. Level 4 is where you need to be before any regulated or enterprise deployment.

⚡ Santosh's Take

I've operated at both ends of this spectrum — running prototype agents in a Jupyter notebook and designing agent infrastructure that handles financial workloads where a misfire has real consequences. The gap between them is not engineering complexity. It's engineering discipline.

The patterns required to run agents reliably at scale — task queues, stateless workers, checkpoint-first state management, per-tenant isolation, token budget enforcement — are not new. They're the same patterns that made distributed systems reliable a decade ago. What's new is that you have to apply them to a workload whose individual steps are probabilistic, expensive per call, and potentially irreversible.

My advice: don't wait until you have scale to build for scale. The architectural decisions you make at 10 users are the ones you're living with at 10,000. Build the token budget layer now. Externalize your state now. Add step-level tracing now. These are not premature optimizations — they're the foundation that makes everything else possible.

The teams treating agent infrastructure as a distributed systems problem will win. The teams treating it as a scripting problem will rebuild from scratch.

👀 Also Watching

  • Ray and Anyscale — the most mature distributed compute layer for running agent workloads at scale. Worth evaluating seriously.

  • Langfuse — open source LLM observability. The best OSS option for step-level agent tracing right now.

  • AWS Bedrock Agents — if you're on AWS, the managed agent runtime handles several of these problems out of the box. Know what it gives you and what it doesn't.

Until next time,

Learn to use AI. Use AI to learn.

If someone forwarded this to you, subscribe at whattheagent.com. If this was useful, forward it to one engineer who needs it.

Keep reading