What Does Production AI Agent Infrastructure Actually Look Like?

What infrastructure do I need to run AI agents in production?

You need six layers: compute, containers, networking, monitoring, model infrastructure, and security. None of them require GPU clusters for orchestration-layer work. ClawRevOps has deployed this stack across 400+ production builds for companies doing $5M to $50M in revenue. What follows is the practical checklist, not a vendor whitepaper.

Most "AI infrastructure" guides describe training infrastructure: GPU clusters, distributed compute, model serving at scale. That is not what you need for deploying AI agents that run business operations. Agent orchestration is CPU-bound, not GPU-bound. The models run on provider APIs (Anthropic, OpenAI). Your infrastructure handles orchestration, memory, integrations, monitoring, and security. The compute requirements are modest. The reliability requirements are not.

How should I size compute for AI agents?

Start with a dedicated VPS, not serverless. Agent systems maintain state, run persistent processes, and need predictable performance under load. Serverless platforms charge per invocation and introduce cold start latency that kills agent responsiveness.

For a single-department deployment (one to three agents), a VPS with 4 vCPU, 8GB RAM, and 80GB SSD handles the workload comfortably. For multi-department deployments running agents across marketing, sales, finance, and ops simultaneously, size up to 8 vCPU, 16GB RAM, and 160GB SSD. These are not large machines. Hetzner, DigitalOcean, or Vultr provide this capacity for $20 to $100 per month.

GPU is not required for agent orchestration. The reasoning happens on model provider APIs. Your VPS handles task scheduling, memory retrieval, API calls to business tools, log aggregation, and health monitoring. The Jarvis multi-venture deployment runs 5 businesses across 138+ integrations on this class of infrastructure. TelexPH runs 5 specialized agents with 30 custom API tools serving a 300+ employee BPO. Neither requires GPU compute on the orchestration side.

Avoid running agents on the same VPS as your application database or web server. Agent workloads spike unpredictably during parallel task execution. Isolate agent infrastructure so a burst of activity does not degrade customer-facing services.

What are the containerization best practices for AI agents?

Every agent deployment should be containerized with Docker. This is not optional for production. Containers provide process isolation, reproducible environments, clean dependency management, and the ability to restart individual agents without affecting the rest of the system.

The pattern from 400+ ClawRevOps builds:

One container per agent function. A Marketing Claw, Finance Claw, and Ops Claw each run in their own container. If the marketing agent crashes or enters a bad state, you restart that container. The finance and ops agents keep running.

Shared network, isolated filesystems. Agents communicate through a Docker network for inter-agent coordination, but each container has its own filesystem. This prevents one agent's logs, temp files, or memory artifacts from interfering with another.

Volume mounts for persistent data. Agent memory, knowledge bases, and audit logs must survive container restarts. Mount these as Docker volumes, not container-internal storage. The Pest Control build maintains a 39-file knowledge base that persists across agent restarts and updates. Losing that knowledge base to a container restart would be a production incident.

Health check endpoints per container. Each container exposes a health check endpoint that the monitoring layer polls. If an agent fails to respond within the health check window, the container gets restarted automatically. More on monitoring below.

Pinned image versions. Never use :latest tags in production. Pin every base image and dependency to a specific version. An upstream update that changes behavior will break your agents in ways that are difficult to diagnose because the failure mode is semantic, not syntactic. The agent still runs. It just makes different decisions.

How should I handle networking for AI agents?

No public ports. Period. AI agents make outbound API calls to model providers and business tools. They do not need to accept inbound connections from the public internet. Every publicly exposed port is an attack surface.

Tailscale for private networking. ClawRevOps deploys Tailscale across all builds for agent-to-agent communication and administrative access. Tailscale creates a private mesh network using WireGuard encryption. Your agents communicate over this mesh. Your team accesses agent dashboards and logs through it. Nothing is exposed to the public internet.

Outbound-only firewall rules. Configure UFW (Uncomplicated Firewall) to allow outbound HTTPS traffic to model provider APIs, business tool APIs, and monitoring endpoints. Block all inbound traffic except Tailscale. The Pest Control build runs this exact configuration: 413 API operations flow outbound through the firewall, zero inbound ports exposed.

DNS resolution for business tools. Agents integrate with SaaS platforms (CRM, email, accounting, project management) via their APIs. Ensure your VPS can resolve DNS reliably and that your outbound rules allow traffic to the IP ranges these platforms use. Some platforms rotate IPs, so domain-based rules are safer than IP-based rules.

Webhook ingestion. If your agents need to receive webhooks from external services (Stripe events, CRM updates, form submissions), route them through a reverse proxy with authentication, not directly to agent containers. A lightweight Caddy or nginx instance that validates webhook signatures before forwarding to the agent network keeps the attack surface minimal.

What does agent monitoring look like in production?

Monitoring is the difference between a production system and a demo. Every ClawRevOps build runs monitoring on a 30-minute heartbeat cycle. The agent checks in every 30 minutes. If it misses a heartbeat, alerting fires.

Heartbeat cycles. Each agent sends a heartbeat to the monitoring system at a fixed interval. The heartbeat includes: agent status, last task completed, current task queue depth, memory utilization, and any error counts since the last heartbeat. A missed heartbeat triggers an alert. Two consecutive missed heartbeats trigger an automatic container restart.

Task-level logging. Every agent task is logged with: start time, end time, model used, token count, input summary, output summary, and success/failure status. This is your audit trail. When a client asks "why did the agent send that email?" you can trace the exact reasoning chain, model call, and decision point.

Error classification. Not all errors are equal. Transient errors (API rate limits, temporary network issues) are retried automatically. Persistent errors (authentication failures, schema changes in business tool APIs) are escalated to humans. The monitoring system classifies errors and routes them appropriately. An alert that fires 50 times for a rate limit is noise. An alert that fires once for an auth failure is signal.

Performance baselines. After the first two weeks of deployment, establish baselines for task completion time, token consumption, and error rates. Deviations from baseline are more informative than absolute thresholds. If your marketing agent normally processes content in 45 seconds and suddenly takes 3 minutes, something changed, even if 3 minutes is not "slow" in absolute terms.

Reporting delivery. Every ClawRevOps build delivers monitoring summaries to Discord or Slack. The Jarvis build pushes real-time reporting across 5 businesses. Operators see agent health alongside business metrics in the channels they already monitor. No separate dashboard to check.

How should I architect the model infrastructure layer?

Not every agent task needs the most powerful model. Tiered model architecture is the single highest-impact optimization for cost and performance. The Jarvis deployment achieves 70-90% token cost reduction through intelligent tiering.

Tier 1: Complex reasoning. Claude Opus or equivalent for strategic decisions, multi-step analysis, nuanced communication, and financial interpretation. This tier handles 10-15% of total agent tasks. These are the decisions where model quality directly affects business outcomes.

Tier 2: Parallel execution. Claude Sonnet or equivalent for data processing, report generation, routine workflows, and standard decision-making. This is the bulk of agent work: 60-70% of tasks. Sonnet-class models handle these with strong quality at a fraction of Tier 1 pricing.

Tier 3: Monitoring and triage. Claude Haiku or equivalent for heartbeat processing, log scanning, simple classification, and alert routing. These high-frequency, low-complexity tasks run continuously and represent 20-30% of activity.

Caching layer. Many agent tasks produce outputs that are reusable. Competitive analysis does not change hourly. Content pillar definitions do not change daily. Cache model responses for deterministic queries and set appropriate TTLs. This reduces redundant API calls without sacrificing freshness.

Failover routing. If your primary model provider has an outage, your agents should not stop working. Configure failover from Anthropic to OpenAI (or vice versa) for Tier 2 and Tier 3 tasks. Tier 1 tasks can queue for the preferred provider with a timeout before failing over, since reasoning quality matters most there.

What does the memory and storage infrastructure look like?

AI agents without persistent memory are chatbots with extra steps. Production agents need memory that survives restarts, scales across months of operation, and supports fast retrieval.

Persistent storage. Agent memory (conversation history, learned patterns, codified rules, client context) lives on Docker volumes backed by SSD storage. The Jarvis build maintains persistent memory across 138+ integrations. Every pattern the agent learns, every rule it codifies, every client preference it records persists indefinitely.

Backup strategy. Daily automated backups of all agent memory volumes. Store backups in a separate location from the primary VPS. A VPS failure that takes out both the agent and its backup is not a backup strategy. The recovery target is full agent restoration within 1 hour, including memory state.

Hybrid search. Agents need to retrieve relevant context from potentially large knowledge bases. Keyword search alone misses semantic connections. Vector search alone misses exact matches. Hybrid search (combining both) gives agents the best retrieval accuracy. The Pest Control build's 39-file knowledge base uses this pattern to surface relevant operational procedures regardless of how the query is phrased.

How do I secure production AI agents?

Security for AI agents follows defense-in-depth principles with agent-specific additions.

fail2ban on SSH and any exposed endpoints. Automated ban after 3 failed attempts. This is baseline server security and applies to the VPS regardless of what runs on it.

UFW firewall with default-deny. Only allow outbound HTTPS and Tailscale traffic. Everything else is blocked. The Pest Control build runs the full security stack: Docker isolation, Tailscale networking, fail2ban, and UFW. Zero publicly exposed ports.

Audit trails for every agent action. Every API call, every decision, every output is logged with timestamp, model used, and reasoning context. This is not just for debugging. It is for accountability. When an agent sends an email to a client, you need to know what triggered it, what model reasoned about it, and what data it used.

Secrets management. API keys for model providers and business tools must not live in configuration files or environment variables visible to other containers. Use Docker secrets or a dedicated secrets manager. Rotate keys quarterly and immediately after any team member departure.

Model output validation. Agents that take real-world actions (sending emails, updating CRMs, processing payments) must validate outputs before execution. Build validation rules into the agent execution pipeline, not as an afterthought.

What is the first step for a CTO planning agent deployment?

Audit your current operations for the functions that are most repetitive, most documented, and most costly in human hours. Those are your first agent candidates. Then size the infrastructure using the guidelines above: one VPS for orchestration, Docker containers per agent function, Tailscale for networking, and a 30-minute heartbeat monitoring cycle.

The discovery call through the War Room is a 30-minute technical conversation that maps your infrastructure requirements to a specific deployment plan. No slides. No generic pitch. A conversation about your stack, your operations, and where agents create the most leverage.

Book Your Discovery Call