Every mature technology wave produces the same arc.

First, developers build things directly. Raw HTTP calls, hardcoded credentials, no abstraction layer. It works fine at small scale. Then scale arrives, and with it: security incidents, cost overruns, inconsistent behavior across teams, zero visibility into what's actually happening.

Then someone builds a gateway.

REST APIs went through this arc. GraphQL went through it. Internal microservices went through it. Every time, the lesson was the same — direct access doesn't scale, and the gateway isn't overhead, it's infrastructure.

Model Context Protocol is going through this arc right now, and most teams haven't realized it yet.

MCP is fast becoming the standard interface through which agents access tools, data, and services. Every major platform is building MCP servers. Anthropic, OpenAI, Google, enterprise software vendors — the ecosystem is expanding rapidly. Agents that could only call a few hardcoded tools can now, in theory, discover and invoke thousands of capabilities through a standardized protocol.

In theory.

In practice, most teams are connecting agents directly to MCP servers the same way developers in 2005 were making raw database calls from application code. It works until it doesn't — and when it doesn't, the failure is usually invisible until it's expensive.

This issue is about what governed MCP infrastructure looks like, why you need it, and how to build it.

What MCP Is and Why It Changes the Problem

For readers who need the quick grounding: MCP is an open protocol that standardizes how agents (clients) communicate with tools and data sources (servers). Instead of every agent hardcoding integrations with every tool — custom APIs, bespoke authentication, one-off implementations — MCP provides a common interface. The agent speaks MCP. The tool speaks MCP. They interoperate regardless of who built either.

Think of it as USB-C for agents and tools. Standardized connector, works with anything that implements the spec.

This is genuinely powerful. It dramatically lowers the cost of adding capabilities to an agent. It enables tool reuse across agent systems. It creates the conditions for an ecosystem of composable agent capabilities.

It also creates a new attack surface, a new cost vector, a new compliance blind spot, and a new operational dependency — all simultaneously.

The same properties that make MCP powerful (standardized access, broad tool discovery, dynamic invocation) make ungoverned MCP dangerous at scale. When any agent can invoke any registered tool, the blast radius of a misbehaving agent expands dramatically. When tool invocations happen outside any centralized visibility layer, you lose the ability to audit, attribute, or control what your agents are doing.

Governed MCP infrastructure is how you get the power without the risk.

The Four Problems Ungoverned MCP Creates

Before the architecture, let's be precise about what breaks.

Problem 1: No discovery governance. Agents discover available tools through the MCP registry — but if the registry has no governance layer, any agent can discover and attempt to invoke any tool. In a multi-team enterprise environment, this is an access control problem waiting to happen. The research agent shouldn't be able to discover, let alone invoke, the trading execution tool.

Problem 2: No invocation control. Direct MCP connections give you no interception point. There's nowhere to enforce rate limits, token budgets, or policy rules before a tool is invoked. An agent in a loop will happily invoke the same tool hundreds of times. An agent with misconfigured scope will invoke tools it was never intended to use.

Problem 3: No observability. When agents connect directly to MCP servers, tool invocations are invisible to any centralized monitoring layer. You know the agent ran. You don't know which tools it called, how many times, with what inputs, at what cost, or whether the outputs were used. In a regulated environment, this is a compliance problem. In any environment, it's an operational problem.

Problem 4: No tool lifecycle management. Tools change. Schemas evolve. Servers go down. Without a registry with lifecycle management, agents that depend on a specific tool version break silently when that tool changes — and there's no centralized place to manage deprecations, migrations, or fallbacks.

These four problems compound each other. Ungoverned MCP infrastructure isn't just a security risk — it's an operational liability that gets more expensive with every tool and every agent you add.

The Architecture: MCP Gateway with Registry, Discovery, and Governance

Here's the full governed MCP infrastructure stack, layer by layer.

┌─────────────────────────────────────────────────┐
│                  Agent Layer                    │
│         (LangGraph / CrewAI / Custom)           │
└─────────────────┬───────────────────────────────┘
                  │ MCP protocol
┌─────────────────▼───────────────────────────────┐
│              MCP Gateway (Proxy)                │
│  ┌──────────┐ ┌──────────┐ ┌─────────────────┐ │
│  │  AuthN/Z │ │Rate Limit│ │ Policy Enforcer │ │
│  └──────────┘ └──────────┘ └─────────────────┘ │
│  ┌──────────────────────────────────────────┐   │
│  │         Audit Logger / Tracer            │   │
│  └──────────────────────────────────────────┘   │
└─────────────────┬───────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────┐
│              Tool Registry                      │
│  ┌──────────┐ ┌──────────┐ ┌─────────────────┐ │
│  │ Catalog  │ │ Versions │ │  Health Status  │ │
│  └──────────┘ └──────────┘ └─────────────────┘ │
│  ┌──────────┐ ┌──────────┐ ┌─────────────────┐ │
│  │ Schemas  │ │  Owners  │ │  Access Policy  │ │
│  └──────────┘ └──────────┘ └─────────────────┘ │
└─────────────────┬───────────────────────────────┘
                  │
┌─────────────────▼───────────────────────────────┐
│            MCP Server Fleet                     │
│  ┌────────┐ ┌────────┐ ┌────────┐ ┌──────────┐ │
│  │  File  │ │  Web   │ │  DB   │ │ Internal │ │
│  │ System │ │ Search │ │ Query │ │   APIs   │ │
│  └────────┘ └────────┘ └────────┘ └──────────┘ │
└─────────────────────────────────────────────────┘

Every agent talks to the gateway. The gateway talks to the registry. The registry knows about the servers. No agent ever talks to a server directly.

Let's go through each layer.

Layer 1: The MCP Gateway (The Proxy)

The gateway is the single entry point for all agent-to-tool communication. Its job is to intercept every MCP request and enforce policy before the request reaches any tool server.

Authentication and Authorization

Every agent must present a verifiable identity when connecting to the gateway. Not a shared API key. Not a hardcoded credential. A scoped identity — an agent identity token that carries:

  • Which agent this is (agent ID, version)

  • Which tenant/user it's operating on behalf of

  • Which tools it's authorized to invoke (explicit allowlist, not default-allow)

  • What scope constraints apply (read-only, specific data namespaces, time-bounded)

The gateway validates this token on every request. Not just on connection establishment — on every tool invocation. A token that grants read access to the document store does not grant write access. A token scoped to tenant A's data cannot be used to access tenant B's tools.

Authorization is enforced at the gateway, not left to prompt instructions. This is the critical distinction. An agent prompted to "only access documents the user owns" can be jailbroken or misconfigured into accessing others. A gateway that enforces access control in middleware cannot.

Rate Limiting and Budget Enforcement

The gateway tracks and enforces resource consumption in real time:

  • Tool invocation rate per agent session (calls/minute)

  • Cumulative invocations per session (total call budget)

  • Data volume per invocation (response size limits)

  • Concurrent session limits per agent type

When a session hits a budget threshold, the gateway rejects further requests and returns a structured error the agent can handle gracefully. The agent isn't left burning resources — it's told clearly that it's hit a limit and given the option to escalate or terminate.

Policy Enforcement

Beyond auth and rate limits, the gateway enforces behavioral policies:

  • Tool pairing rules — certain tools can only be invoked after certain preconditions (e.g., a write tool can only be called after a corresponding read and confirmation step)

  • Data classification enforcement — tools that return sensitive data require elevated agent permissions, regardless of whether the tool server itself enforces this

  • Irreversibility flags — tools marked as irreversible (send email, execute trade, delete record) trigger a confirmation checkpoint before invocation, even if the agent didn't request one

  • Prohibited patterns — sequences of tool calls that match known attack or misuse patterns are blocked and flagged

Audit Logging

Every tool invocation passing through the gateway generates an immutable audit record:

json

{
  "event_id": "evt_01J...",
  "timestamp": "2026-04-24T09:14:32.411Z",
  "agent_id": "agent_research_v2",
  "session_id": "sess_8f3a...",
  "tenant_id": "tenant_XYZ",
  "tool": "web_search",
  "tool_version": "2.1.0",
  "inputs_hash": "sha256:...",
  "outputs_hash": "sha256:...",
  "latency_ms": 342,
  "tokens_consumed": 0,
  "policy_decisions": ["rate_limit:pass", "auth:pass", "scope:pass"],
  "outcome": "success"
}

Note: inputs and outputs are hashed, not stored verbatim in the audit log. Full content is stored separately with appropriate access controls. The audit record proves what happened without exposing sensitive content to everyone with log access.

Layer 2: The Tool Registry

The registry is the source of truth for every tool available in your agent ecosystem. It's not a static config file — it's a living catalog with lifecycle management.

Tool Catalog

Every registered tool has a complete entry:

yaml

tool_id: web_search
display_name: "Web Search"
owner_team: platform-infra
description: "Search the web and return structured results"
server_endpoint: mcp://tools-internal/web-search
current_version: "2.1.0"
status: stable          # stable | beta | deprecated | sunset
input_schema:
  query: string (required)
  max_results: integer (default: 5, max: 20)
  safe_search: boolean (default: true)
output_schema:
  results: array[{title, url, snippet}]
access_policy:
  default: deny
  allowed_agent_types: [research, summarization]
  required_scopes: [web_access]
  excluded_tenants: []
cost_profile:
  avg_latency_ms: 280
  external_api_cost: true
  cost_per_call_cents: 0.2
sla:
  availability_target: 99.5%
  max_response_ms: 2000

This is not overhead — it's the information your gateway needs to make real-time policy decisions. A gateway without a registry is guessing. A gateway with a complete registry is enforcing known policy against known tools.

Versioning and Schema Management

Tools evolve. The registry manages this:

  • Semantic versioning — major versions may break schema compatibility, minor versions are backward compatible, patches are transparent

  • Multi-version support — the registry can route to multiple active versions simultaneously, enabling gradual migration

  • Schema validation — the gateway validates every invocation against the registered input schema before forwarding to the server. Malformed tool calls are rejected at the gateway, not silently mishandled by the server

  • Deprecation workflow — tools don't disappear; they move through beta → stable → deprecated → sunset with defined timelines. Agents depending on deprecated tools get warnings in their audit logs before sunset

Health and Availability

The registry maintains real-time health status for every registered server:

  • Active health checks per server (configurable interval)

  • Circuit breaker state per tool (closed / open / half-open)

  • Degraded mode routing — if the primary server is unhealthy, route to a registered fallback if one exists

  • Planned maintenance windows — agents can query tool availability before starting a long task, not discover unavailability mid-execution

Access Policy Management

Tool access policies live in the registry, not hardcoded in agent configurations. This means:

  • Policy changes apply immediately without agent redeployment

  • Access can be granted or revoked per agent type, per tenant, per scope — all in one place

  • Audit queries can answer "which agents had access to this tool on this date?" retroactively

Layer 3: Governed Discovery

Discovery is how an agent learns what tools are available. Ungoverned discovery means the agent sees everything. Governed discovery means the agent sees exactly what it's authorized to use — and nothing else.

Scoped Discovery Responses

When an agent queries the gateway for available tools, the response is filtered by:

  1. The agent's identity and authorized tool list

  2. Current tool health (unhealthy tools not surfaced for new tasks)

  3. Tenant-specific tool restrictions

  4. Regulatory or environment constraints (production agents don't discover staging tools)

The agent receives a tailored catalog, not the global registry. A research agent sees search, summarization, and document tools. It doesn't see execution tools. It has no knowledge that execution tools exist — not because you trust the agent, but because information minimization is a security principle.

Semantic Discovery

Beyond filtered listing, governed discovery enables semantic tool search — an agent that doesn't know exactly which tool it needs can describe its intent and receive ranked tool recommendations:

Agent: "I need to find recent news articles about a company"
Registry: [
  { tool: "web_search", relevance: 0.94, note: "General web search" },
  { tool: "news_search", relevance: 0.97, note: "News-specific, structured output" },
  { tool: "sec_filings_search", relevance: 0.61, note: "Regulatory filings only" }
]

This requires the registry to maintain semantic embeddings of tool descriptions — a small investment that dramatically improves tool selection quality in complex agent systems.

Capability Negotiation

For agent systems that need to compose multi-step workflows dynamically, the registry supports capability queries:

  • "Find all tools that can accept a URL and return structured text"

  • "Find all tools authorized for this tenant that support batch operation"

  • "Find tools that can write to the document store with confirmation checkpointing"

This enables agents to build workflows from available capabilities rather than hardcoded tool names — a meaningful step toward more adaptive agent behavior.

Layer 4: Governance Operations

The gateway and registry are infrastructure. Governance operations are the human processes and tooling built on top of them.

The Tool Approval Workflow

New MCP servers don't join the registry by filing a pull request. They go through an onboarding process:

  1. Submission — tool owner submits a registration request with full schema, ownership info, cost profile, and access policy proposal

  2. Security review — security team validates authentication implementation, data handling, and scope constraints

  3. Schema review — platform team validates input/output schemas for consistency with registry standards

  4. Access policy review — relevant stakeholders (compliance, data governance) review and approve the proposed access policy

  5. Staging validation — tool is registered in staging environment and validated by representative agent workflows

  6. Production registration — tool enters the registry as beta status with monitoring

  7. Promotion to stable — after defined reliability and usage thresholds are met

This sounds heavyweight. For a two-person startup, it is. For any organization deploying agents in regulated or multi-team environments, it's the minimum viable process for not having a governance incident.

The Incident Response Playbook

When something goes wrong — and it will — the gateway gives you the tools to respond:

  • Isolate a tool — flip a tool to suspended status in the registry. All invocations immediately return a structured error. No agent redeployment required.

  • Isolate an agent — revoke an agent identity token. That agent can no longer invoke any tool through the gateway.

  • Isolate a tenant — suspend all tool access for a specific tenant without affecting others.

  • Replay an audit trail — reconstruct exactly what a specific agent session did, in order, with inputs and outputs. This is your forensic capability.

  • Cost kill switch — hard limit on total API spend per hour. Triggers automatic throttling and alerts when hit.

Compliance Reporting

For regulated industries — and if you're at a hedge fund, you know what this means — the audit log supports regulatory queries:

  • All tool invocations by agent type in date range

  • All data accesses for a specific user's data across all agents

  • Evidence that access controls were enforced for a specific session

  • Tool availability and change log for a given date

These reports shouldn't require custom queries against raw logs. They should be first-class capabilities of your governance tooling.

Building It: Practical Starting Points

You don't build all of this on day one. Here's the pragmatic sequencing:

Week 1–2: The Thin Gateway Implement a simple proxy that intercepts all MCP traffic, validates agent identity tokens, and writes structured audit logs. No policy enforcement yet — just visibility. You'll immediately discover things you didn't know were happening.

Week 3–4: The Basic Registry Formalize your tool catalog in a structured format. Add health checks. Define ownership. This forces clarity about what you actually have.

Month 2: Access Control Implement scoped discovery and authorization enforcement at the gateway. This is the highest-value governance capability — start here before rate limiting or policy rules.

Month 3: Rate Limiting and Budget Enforcement Add per-session token budgets and invocation rate limits. By now you'll have enough production data to know where the real limits should be.

Month 4+: Policy Rules and Lifecycle Management Implement the advanced policy rules, deprecation workflows, and semantic discovery. These are refinements on top of a working foundation.

OSS options to evaluate rather than build from scratch: Kong or Envoy as the proxy foundation (with MCP-aware plugins being an emerging category), a simple Postgres schema for the registry, OpenTelemetry for instrumentation. The space is early — there's no dominant governed MCP infrastructure solution yet. That's both a challenge and an opportunity.

⚡ Santosh's Take

The MCP ecosystem is moving at the speed of the broader agentic AI wave — fast, exciting, and slightly ahead of the governance thinking required to deploy it responsibly. I've seen this pattern before. At AWS, the period between "developers can now easily call external services" and "enterprises can deploy this in regulated workloads" was defined entirely by the infrastructure built in between — IAM, VPC endpoints, CloudTrail, Service Control Policies. The capability was always there. The governance layer is what made it enterprise-deployable.

MCP is at the same inflection point. The protocol is solid. The tool ecosystem is growing fast. What's missing is the governed infrastructure layer that lets enterprises say with confidence: "Our agents can only access what they're authorized to access, every invocation is auditable, and we can respond to an incident in minutes not days."

If you're building agent infrastructure for any organization where governance matters — financial services, healthcare, legal, enterprise software — building this layer is not optional. The question is whether you build it proactively or reactively after the first incident.

Build it proactively. The forensic audit trail alone is worth the investment.

Until next time,

Learn to use AI. Use AI to learn.

If someone forwarded this to you, subscribe at whattheagent.com. If this was useful, forward it to one engineer who needs it.

Keep reading