From MLOps to AgentOps: The Next Evolution in AI Operations

In 2019, I helped build our first ML pipeline. We trained a model, serialized it to a pickle file, and deployed it behind a Flask endpoint. Monitoring meant checking if the endpoint returned 200. Retraining meant someone remembered to run the notebook again.

We've come a long way. But not far enough.

Gartner's latest data shows a 1,445% surge in enterprise interest around AI agent platforms since early 2025. That's not a typo. Organizations aren't just deploying models anymore—they're deploying autonomous systems that reason, plan, use tools, and take actions with minimal human oversight.

And the operational playbook for managing them doesn't exist yet.

What follows is my attempt to map the evolution from DevOps to AgentOps—where we've been, where we are, and what the next generation of AI operations actually demands.

The Ops Evolution: A Brief History

Each wave of operational practice emerged because the previous one couldn't handle new complexity.

DevOps (2010s): Ship Code Faster

DevOps solved the wall between development and operations. CI/CD pipelines, infrastructure as code, monitoring, alerting. The unit of deployment was a service—stateless, deterministic, version-controlled. If version 2.3.1 behaves differently from 2.3.0, you diff the code and find the change.

MLOps (2020s): Ship Models Reliably

ML broke the DevOps contract. Models aren't just code—they're code plus data plus training configuration plus hyperparameters plus random seeds. The same code with different data produces a different model. Monitoring a model means tracking not just uptime but data drift, prediction drift, and feature importance shifts.

MLOps added:

Experiment tracking (MLflow, Weights & Biases)
Feature stores (Feast, Tecton)
Model registries (versioned artifacts, not just code tags)
Data quality monitoring (Great Expectations, Evidently)
A/B testing frameworks for model comparison

LLMOps (2024-2025): Ship Prompts and Chains

Large language models broke the MLOps contract. You don't retrain GPT-4—you prompt it. The "model" is fixed; the behavior changes through prompt engineering, retrieval-augmented generation, and tool chains.

LLMOps added:

Prompt versioning and management (LangSmith, Promptfoo)
RAG pipeline monitoring (retrieval quality, chunk relevance)
Token cost tracking (the new cloud bill line item)
Guardrails and content filtering (safety layers around outputs)
Evaluation frameworks (LLM-as-judge, human-in-the-loop scoring)

AgentOps (2026+): Ship Autonomous Systems

Now we're entering uncharted territory. AI agents aren't models responding to queries—they're systems that observe, plan, act, and iterate in loops. They call APIs, write code, make decisions, and chain multiple steps together without human intervention.

This breaks everything we've built so far.

Why Agents Break Existing Ops

Let me be specific about what makes agents fundamentally different from models:

Non-Deterministic Execution Paths

A model takes input X and produces output Y. An agent takes goal G and might execute steps A→B→C→D or A→C→F→B→D depending on intermediate results. The execution path is emergent, not predefined.

Ops implication: You can't write integration tests for every possible execution path. You need runtime observability of each decision point—why the agent chose path B over path C, what information it had, what it considered.

Tool Use and Side Effects

Agents don't just return text. They send emails, create Jira tickets, execute database queries, call external APIs, and modify files. Each action has real-world consequences that can't be rolled back by redeploying a previous version.

Ops implication: You need an action audit trail—every tool call, every parameter, every result. And you need circuit breakers that halt an agent when its actions exceed defined boundaries. An agent that decides to "clean up" your production database at 3 AM isn't a bug you fix in the morning.

State Accumulation

Models are stateless (or have session context). Agents accumulate state across interactions—memory, learned preferences, updated knowledge bases. This state drifts over time in ways that are hard to predict or reproduce.

Ops implication: You need state snapshots and rollback capabilities. When an agent starts behaving unexpectedly, you need to pinpoint which state change caused the drift. This is model debugging meets database forensics.

Multi-Agent Coordination

Production systems increasingly use multiple agents collaborating—a planner agent, a researcher agent, a coder agent, a reviewer agent. Failures cascade in unpredictable ways. Agent A gives bad information to Agent B, who uses it to make a decision that Agent C acts on.

Ops implication: You need distributed tracing across agent boundaries. Traditional APM tools track HTTP requests across microservices. AgentOps tools need to track reasoning chains across agent conversations.

What AgentOps Actually Looks Like

Based on the patterns emerging in production agent deployments, here's what the AgentOps stack needs to include:

1. The Agent Control Plane

Every agent deployment needs a central control plane that manages:

Agent lifecycle: Deploy, version, A/B test, rollback
Permission boundaries: Which tools can each agent use? What data can it access?
Rate limits and budgets: Maximum API calls per hour, maximum token spend per task
Kill switches: Immediate halt capability for any agent, with state preservation

Think of it as Kubernetes for agents—not managing containers, but managing autonomous decision-makers.

2. Observability Beyond Logs

Traditional logging captures what happened. Agent observability needs to capture why it happened:

Decision traces: At each step, what did the agent consider? What alternatives were evaluated?
Confidence scoring: How certain was the agent at each decision point?
Tool call monitoring: Latency, success rate, and output quality for every external interaction
Cost attribution: Token usage, API costs, and compute time per agent per task

Tools like LangSmith, Arize Phoenix, and emerging platforms like AgentOps.ai are building this layer. But the space is immature—expect significant evolution in the next 12 months.

3. Governance and Compliance

When an agent makes a decision that affects customers, regulators want to know:

Who authorized this agent to act?
What were its instructions and constraints?
Can you reproduce its decision-making process?
Is there a human in the loop for high-stakes decisions?

This isn't theoretical. Financial services, healthcare, and regulated industries are already asking these questions. If you're building agents for enterprise, governance is a day-one requirement, not a nice-to-have.

4. Testing Autonomous Behavior

You can't unit test an autonomous system the way you test a function. Agent testing requires:

Scenario simulation: Create environments where agents can be tested with realistic but sandboxed tool access
Behavioral boundaries: Define what the agent should never do, then verify it doesn't
Regression suites: When you change an agent's prompt or tools, verify it still handles known scenarios correctly
Adversarial testing: What happens when the agent receives contradictory instructions, hallucinated context, or malicious input?

The Maturity Model

Most organizations are somewhere on this spectrum:

Level	Description	What You Have
0	Ad hoc	Agents in notebooks, no monitoring
1	Managed	Basic logging, manual deployment
2	Systematic	Control plane, observability, CI/CD for agents
3	Governed	Audit trails, compliance, automated testing
4	Optimized	Self-improving agents with human oversight loops

In my experience, most teams shipping agents today are at Level 0 or 1. They're focused on making agents work, not on operating them reliably. This is exactly where MLOps was in 2020—functional but fragile.

What You Should Do Now

If your team is building or planning to build AI agents, here's my practical advice:

Start with Boundaries, Not Capabilities

Before asking "what can this agent do?", define "what must this agent never do?" Explicit constraints are more important than capabilities. Build the guardrails before you build the highway.

Invest in Observability Early

Don't bolt on monitoring after your agent is in production. Instrument decision traces from day one. When something goes wrong—and it will—you need to understand the reasoning chain, not just the final output.

Treat Agent Deployments Like Infrastructure Changes

An agent with database access is infrastructure. Treat it with the same rigor you'd apply to a Terraform change—code review, staging environment, gradual rollout, rollback plan.

Build Human-in-the-Loop Checkpoints

For any action with real-world consequences above a defined threshold, require human approval. An agent can draft an email. A human should approve sending it to 10,000 customers. Automate the thinking, gate the acting.

Plan for Multi-Agent Debugging

If you're running multiple agents, you need distributed tracing across agent boundaries. When a customer complaint arrives, you need to trace the decision chain across every agent that touched that workflow.

The Bottom Line: The ops evolution from DevOps to MLOps to AgentOps isn't just adding new tools to the stack. Each transition represents a fundamental shift in what we're operating—from deterministic code, to probabilistic models, to autonomous systems. The teams that recognize this shift early and build the operational foundations now will lead. The rest will learn the hard way that shipping an agent without AgentOps is like shipping a microservice without DevOps—it works until it doesn't, and when it fails, you have no idea why.

← Back to All Posts