Ask any engineer about AI infrastructure and they'll talk about GPUs. H100s. A100s. Reserved instances. Spot capacity. The GPU is the star of the AI infrastructure story — expensive, scarce, and universally treated as the constraint that everything else is sized around.

For pure inference workloads, that framing is correct.

For agentic AI, it's dangerously incomplete.

The more agent complexity you add — tools, memory, multi-agent coordination, long-running sessions, RAG pipelines, conversation storage, context management, observability, MCP servers — the more your system spends its time doing work that has nothing to do with the GPU. The model forward pass, the only step that actually needs a GPU, becomes a progressively smaller fraction of total compute time as agent sophistication grows.

What fills the rest of that time?

CPU work. An enormous amount of it. And most teams don't see it coming until it's already a problem.

The anatomy of a single agent step

To understand why CPUs matter so much, you have to look at what actually happens during an agent run — not just the LLM call, but everything around it.

A single agent step looks like this:

1.  Receive task or trigger
2.  Load agent state and conversation history (deserialize from store)
3.  Evaluate context window budget — can we fit full history?
4.  If not: run summarization pass on older turns (LLM call or extractive)
5.  Retrieve relevant memory (embed query → search → rank results)
6.  Assemble and pack context window
7.  Tokenize and format prompt
8.  → LLM inference (GPU) ←
9.  Parse model output (JSON extraction, validation)
10. Route to tool based on model decision
11. Execute tool call (API, DB, code, file I/O)
12. Process tool response (parse, validate, format)
13. Update agent state, write checkpoint (serialize to store)
14. Store conversation turn to persistent conversation store
15. Evaluate loop condition (continue, terminate, escalate)
16. Emit trace events to observability layer
17. Repeat

Count the GPU steps: one. Count the CPU steps: fifteen.

This ratio worsens as agent complexity increases. Add more tools, richer memory, longer conversation histories, more rigorous observability — the GPU slice stays roughly constant while every surrounding step grows. At sufficient scale, the GPU is idle a meaningful fraction of the time, waiting for CPU work upstream or downstream to complete. Here is a comprehensive architecture for an Agentic AI system and let's go through each layer.

1. Orchestration and control flow

The orchestrator — the component that decides what the agent does next — runs entirely on CPU. In a simple single-agent system this is lightweight. In a multi-agent system it compounds fast.

A planner agent coordinating five specialist agents needs to track each sub-agent's state, sequence execution, manage output dependencies, handle failures and retries, and synthesize results. Every one of those operations is CPU work — branching logic, state evaluation, conditional routing, priority queuing.

Frameworks like LangGraph, CrewAI, and AutoGen add their own abstraction overhead on top of your application logic. Graph traversal, edge evaluation, node state management — all CPU, all before a single token is generated. As agent graphs grow in complexity, orchestration overhead compounds non-linearly. A coordinator managing ten concurrent specialist agents isn't doing ten times the work of one — it's doing the work of ten agents plus the cross-agent coordination, which doesn't scale linearly.

2. Conversation storage and context window management

This is the layer most infrastructure discussions skip entirely — and it's one of the most CPU-intensive in a production agent system.

Agents that persist across sessions need to store their conversation history somewhere durable. That means every turn is written to a conversation store — serialized, tagged with session metadata, tenant ID, timestamp, and message role, and committed to a database. At scale, this is a high-throughput write workload running continuously across all active sessions.

But storage is only the beginning. The harder problem is what you do when the conversation history grows too long to fit in the context window.

This is where summarization becomes a first-class CPU concern.

Every agent step that involves a session with substantial history requires a context window evaluation: how many tokens does the full history consume? Do we have room for the current task, the retrieved memories, and the system prompt alongside it? If not — and in long-running sessions, the answer is frequently "not" — something has to give.

The naive approach is truncation: drop the oldest turns. This is fast and cheap, and it loses critical context. A constraint established in turn 3 of a 200-turn session gets silently discarded, and the agent proceeds without it.

The better approach is summarization: periodically compress older conversation history into a structured summary that retains the essential information at a fraction of the token cost. The agent carries a rolling summary of prior context plus the most recent turns verbatim — giving it both long-term continuity and short-term precision.

Summarization has its own cost profile:

  • Extractive summarization (rule-based, key sentence selection) is CPU-only and fast, but loses nuance

  • Abstractive summarization (LLM-based) produces better summaries but adds another model call — which means another GPU hop, plus all the CPU work around it

  • Hybrid approaches (extract candidates on CPU, compress with a small model) balance quality and cost but add architectural complexity

At scale, the decision of when to summarize, how aggressively, and what to preserve is not a one-time prompt engineering decision — it's an ongoing CPU workload running across every active long-running session simultaneously. If you have ten thousand active sessions and 30% of them need a summarization pass at any given step, that's three thousand summarization operations happening concurrently. Size for it.

There's also the memory interaction to consider. Last week we covered how episodic and semantic memory are generated from conversation history — consolidating specific interactions into durable patterns over time. That consolidation process runs asynchronously on CPU, typically after sessions end. It ingests conversation logs, extracts meaningful events, updates the memory store, and resolves conflicts with previously stored beliefs. This is a background CPU workload that scales with your conversation volume, not your concurrent session count — meaning it runs continuously regardless of how many agents are active right now.

3. Tool execution is entirely CPU-bound

Every tool your agent invokes runs on CPU. The GPU is not involved.

What makes this expensive at scale isn't any single tool call — it's the volume and the work surrounding each one:

  • Parsing the model's tool selection output (JSON extraction, schema validation)

  • Constructing the request (parameter assembly, auth header injection, serialization)

  • Executing the call and waiting on I/O

  • Parsing the response (deserialization, schema mapping, error detection)

  • Formatting the result back into context (truncation, summarization if oversized, structured injection)

  • Evaluating whether the result is sufficient or another call is needed

At one agent doing two tool calls per step, negligible. At a thousand concurrent agents doing four tool calls per step, it's a significant CPU and memory bandwidth workload — running continuously, with all the concurrency management that implies.

4. MCP servers and gateway overhead

Model Context Protocol is becoming the standard agent-to-tool interface — and every component in the governed MCP stack consumes CPU that's easy to overlook.

Each MCP server is a running process handling concurrent requests from agent sessions. At low volume, invisible. At scale, a real line item. Add a governed MCP gateway — the auth validation, policy enforcement, rate limiting, and audit logging layer covered in a previous issue — and you've added a proxy tier that processes every single tool invocation before it reaches the server.

Consider what this fleet looks like under load: the gateway handling 50,000 tool invocations per minute is a high-throughput service, not a sidecar. It needs to be provisioned, monitored, and scaled like one. The registry running continuous health checks across your tool server fleet adds steady-state CPU consumption that runs regardless of agent activity.

If your MCP infrastructure is under-provisioned, the symptom is tool call latency that compounds through your entire agent loop. Each step takes slightly longer. Timeouts increase. Retries multiply. The bottleneck is invisible without instrumentation at the right layer.

5. RAG pipeline overhead

In a RAG-heavy agentic system — which covers most serious production deployments — the entire retrieval pipeline runs on CPU:

  • Embedding generation coordination (batching queries, dispatching to embedding model)

  • Query preprocessing (cleaning, expansion, compound query splitting)

  • Vector similarity search orchestration and result handling

  • Re-ranking (cross-encoder re-ranking is CPU-intensive and meaningfully improves precision)

  • Deduplication across retrieved chunks

  • Context window packing (which chunks fit, in what order, within the token budget)

At a thousand concurrent sessions each doing one retrieval operation per reasoning step, this pipeline runs thousands of times per second. The re-ranking step alone — frequently skipped in prototype systems, essential in production ones — adds meaningful CPU cost per retrieval operation. At scale, plan for it explicitly.

6. Observability has its own CPU tax

Proper agent observability — step-level tracing, decision logging, cost attribution, anomaly detection — doesn't come free. The consumption is proportional to the richness of what you capture.

Every trace event needs to be serialized, tagged, buffered, and flushed to your observability backend. At minimal tracing, the overhead is negligible. At step-level tracing with full input/output capture and in-process anomaly detection, the overhead on a busy agent worker handling hundreds of concurrent sessions can consume 10-15% of available CPU.

The solution isn't to reduce observability — you need that data, both for debugging and for compliance. The solution is to instrument efficiently: async writes, batched flushes, sampling strategies for high-volume events, and out-of-process aggregation so your tracing pipeline doesn't compete with your agent execution for the same CPU resources.

7. Long-running sessions amplify everything

Short-lived sessions — complete a task in ten steps, terminate, clean up — have bounded CPU cost. Long-running sessions are a different category.

An agent session running for hours or days accumulates state continuously. Checkpointing happens repeatedly, not once. Memory consolidation runs periodically. Context window management is an active ongoing operation. Conversation summarization triggers multiple times as history grows. Heartbeat processes keep the session alive across infrastructure events. TTL management, stale entry eviction, and cache invalidation run in background loops.

The CPU cost of long-running sessions isn't linear with session count — it grows with session duration. A system that handles ten thousand short sessions efficiently may struggle with one thousand long-running ones. Design and size for the session profile you actually expect in production, not the one that's easiest to test.

8. Triggers, schedules, and event-driven infrastructure

Production agents don't just respond to user requests. They respond to events.

A scheduled agent running nightly analysis. A trigger-based agent that fires when a document is updated. A continuous monitoring agent polling a data source on an interval. A webhook-driven agent activating when an external system sends a notification.

Every event source adds infrastructure that runs on CPU continuously:

  • Scheduler processes managing cron-based tasks, execution windows, and missed run handling

  • Event listener processes monitoring queues, webhooks, and polling targets

  • Trigger evaluation logic deciding whether an incoming event warrants agent activation

  • Task instantiation overhead for spinning up agent context on each trigger

  • Deduplication logic preventing the same event from spawning multiple instances

This infrastructure runs always-on regardless of whether any agent work is happening. It's the steady-state cost of being reactive — and it's frequently invisible in load tests that focus on peak request handling rather than the background cost of maintaining event infrastructure continuously.

9. Concurrency mechanics

Agentic workloads are inherently concurrent. Multiple agents firing tool calls simultaneously. Async callbacks returning in unpredictable order. Retry logic spinning up new requests while old ones are still in flight.

The CPU work here is in the concurrency machinery itself — event loop scheduling, thread synchronization, lock contention resolution, async callback management, connection pool management for database and API clients.

Python, which powers most LLM application frameworks, has well-known concurrency constraints from the Global Interpreter Lock. For I/O-bound agent workloads, async Python handles this reasonably. For CPU-bound orchestration — complex state evaluation, heavy JSON processing, synchronous computation in the agent loop — the GIL becomes a real constraint at scale. Teams that hit this wall move CPU-intensive orchestration to worker processes (Celery, Ray) or rewrite critical-path components in Go or Rust.

The full picture

Pulling it all together:

Layer

CPU Cost

Scales With

Orchestration

Medium

Agent complexity × session count

Conversation storage

Medium

Turn volume × active sessions

Context summarization

High

Long-running session count

Memory consolidation

Medium

Conversation volume (async)

Tool execution

High

Tool calls/step × concurrent sessions

MCP gateway

Medium

Total tool invocations/sec

RAG pipeline

High

Retrieval ops/step × sessions

Observability

Low–Medium

Trace richness × event volume

Long-running state mgmt

Medium

Session count × duration

Trigger/event infra

Low–Medium

Event source count (always-on)

Concurrency management

Medium

Concurrent session count

The GPU row is absent because the GPU handles inference. Everything else — every row in this table — is CPU.

The practical implication: if you size agentic infrastructure based on LLM inference requirements alone, you will systematically under-provision CPU and over-provision GPU. The symptoms arrive gradually — slightly elevated latency, occasional timeouts, cost creep — and then suddenly acutely as session volume crosses your CPU headroom threshold.

⚡ The practitioner take

The GPU scarcity narrative has shaped how teams budget, how vendors price, and how engineers think about bottlenecks for three years. For inference at scale, it's broadly correct.

For agentic systems, it creates a dangerous blind spot. The pattern I've seen repeatedly: a team builds an agent system that works beautifully under moderate load. They scale up users. Latency climbs. They assume the LLM API is the bottleneck. They optimize prompts, reduce tokens, add caching, upgrade their GPU tier. Latency stays elevated. Eventually someone profiles the full request path and discovers the orchestration layer, the conversation summarization pipeline, the tool executors, and the observability stack are saturating CPU long before the GPU is under any meaningful pressure.

Profile end to end from the start. Measure CPU consumption per agent step, per tool type, per concurrent session count. Understand the ratio of CPU to GPU work in your specific loop — in most production agent systems, the model call is 20-40% of wall clock time per step. The other 60-80% is CPU. The conversation and memory layer alone — summarization, consolidation, context packing — can account for 20-30% of that remainder in memory-rich deployments.

Size accordingly. The GPU gets the credit. The CPU runs the show.

Until next time,

Learn to use AI. Use AI to learn.

If someone forwarded this to you, subscribe at whattheagent.com. If this was useful, forward it to one engineer who needs it.

👀 Also Watching

  • AWS Graviton Chips optimization for the Agentic AI Era.

  • Ray's distributed object store design — the most thoughtful public treatment of CPU-efficient task scheduling for agentic workloads

  • py-spy and Austin — Python async profilers that actually show you where your agent loop's CPU time is going. Most teams don't look until it's already a problem.

  • ARM-based compute for agent orchestration — AWS Graviton and Azure Cobalt offer strong multi-core CPU performance at meaningfully lower cost than x86 for orchestration-heavy workloads

Keep reading