In 2019, I helped build our first ML pipeline. We trained a model, serialized it to a pickle file, and deployed it behind a Flask endpoint. Monitoring meant checking if the endpoint returned 200. Retraining meant someone remembered to run the notebook again.
We've come a long way. But not far enough.
Gartner's latest data shows a 1,445% surge in enterprise interest around AI agent platforms since early 2025. That's not a typo. Organizations aren't just deploying models anymore—they're deploying autonomous systems that reason, plan, use tools, and take actions with minimal human oversight.
And the operational playbook for managing them doesn't exist yet.
What follows is my attempt to map the evolution from DevOps to AgentOps—where we've been, where we are, and what the next generation of AI operations actually demands.
The Ops Evolution: A Brief History
Each wave of operational practice emerged because the previous one couldn't handle new complexity.
DevOps (2010s): Ship Code Faster
DevOps solved the wall between development and operations. CI/CD pipelines, infrastructure as code, monitoring, alerting. The unit of deployment was a service—stateless, deterministic, version-controlled. If version 2.3.1 behaves differently from 2.3.0, you diff the code and find the change.
MLOps (2020s): Ship Models Reliably
ML broke the DevOps contract. Models aren't just code—they're code plus data plus training configuration plus hyperparameters plus random seeds. The same code with different data produces a different model. Monitoring a model means tracking not just uptime but data drift, prediction drift, and feature importance shifts.
MLOps added:
- Experiment tracking (MLflow, Weights & Biases)
- Feature stores (Feast, Tecton)
- Model registries (versioned artifacts, not just code tags)
- Data quality monitoring (Great Expectations, Evidently)
- A/B testing frameworks for model comparison
LLMOps (2024-2025): Ship Prompts and Chains
Large language models broke the MLOps contract. You don't retrain GPT-4—you prompt it. The "model" is fixed; the behavior changes through prompt engineering, retrieval-augmented generation, and tool chains.
LLMOps added:
- Prompt versioning and management (LangSmith, Promptfoo)
- RAG pipeline monitoring (retrieval quality, chunk relevance)
- Token cost tracking (the new cloud bill line item)
- Guardrails and content filtering (safety layers around outputs)
- Evaluation frameworks (LLM-as-judge, human-in-the-loop scoring)
AgentOps (2026+): Ship Autonomous Systems
Now we're entering uncharted territory. AI agents aren't models responding to queries—they're systems that observe, plan, act, and iterate in loops. They call APIs, write code, make decisions, and chain multiple steps together without human intervention.
This breaks everything we've built so far.
Why Agents Break Existing Ops
Let me be specific about what makes agents fundamentally different from models:
Non-Deterministic Execution Paths
A model takes input X and produces output Y. An agent takes goal G and might execute steps A→B→C→D or A→C→F→B→D depending on intermediate results. The execution path is emergent, not predefined.
Ops implication: You can't write integration tests for every possible execution path. You need runtime observability of each decision point—why the agent chose path B over path C, what information it had, what it considered.
Tool Use and Side Effects
Agents don't just return text. They send emails, create Jira tickets, execute database queries, call external APIs, and modify files. Each action has real-world consequences that can't be rolled back by redeploying a previous version.
Ops implication: You need an action audit trail—every tool call, every parameter, every result. And you need circuit breakers that halt an agent when its actions exceed defined boundaries. An agent that decides to "clean up" your production database at 3 AM isn't a bug you fix in the morning.
State Accumulation
Models are stateless (or have session context). Agents accumulate state across interactions—memory, learned preferences, updated knowledge bases. This state drifts over time in ways that are hard to predict or reproduce.
Ops implication: You need state snapshots and rollback capabilities. When an agent starts behaving unexpectedly, you need to pinpoint which state change caused the drift. This is model debugging meets database forensics.
Multi-Agent Coordination
Production systems increasingly use multiple agents collaborating—a planner agent, a researcher agent, a coder agent, a reviewer agent. Failures cascade in unpredictable ways. Agent A gives bad information to Agent B, who uses it to make a decision that Agent C acts on.
Ops implication: You need distributed tracing across agent boundaries. Traditional APM tools track HTTP requests across microservices. AgentOps tools need to track reasoning chains across agent conversations.
What AgentOps Actually Looks Like
Based on the patterns emerging in production agent deployments, here's what the AgentOps stack needs to include:
1. The Agent Control Plane
Every agent deployment needs a central control plane that manages:
- Agent lifecycle: Deploy, version, A/B test, rollback
- Permission boundaries: Which tools can each agent use? What data can it access?
- Rate limits and budgets: Maximum API calls per hour, maximum token spend per task
- Kill switches: Immediate halt capability for any agent, with state preservation
Think of it as Kubernetes for agents—not managing containers, but managing autonomous decision-makers.
2. Observability Beyond Logs
Traditional logging captures what happened. Agent observability needs to capture why it happened:
- Decision traces: At each step, what did the agent consider? What alternatives were evaluated?
- Confidence scoring: How certain was the agent at each decision point?
- Tool call monitoring: Latency, success rate, and output quality for every external interaction
- Cost attribution: Token usage, API costs, and compute time per agent per task
Tools like LangSmith, Arize Phoenix, and emerging platforms like AgentOps.ai are building this layer. But the space is immature—expect significant evolution in the next 12 months.
3. Governance and Compliance
When an agent makes a decision that affects customers, regulators want to know:
- Who authorized this agent to act?
- What were its instructions and constraints?
- Can you reproduce its decision-making process?
- Is there a human in the loop for high-stakes decisions?
This isn't theoretical. Financial services, healthcare, and regulated industries are already asking these questions. If you're building agents for enterprise, governance is a day-one requirement, not a nice-to-have.
4. Testing Autonomous Behavior
You can't unit test an autonomous system the way you test a function. Agent testing requires:
- Scenario simulation: Create environments where agents can be tested with realistic but sandboxed tool access
- Behavioral boundaries: Define what the agent should never do, then verify it doesn't
- Regression suites: When you change an agent's prompt or tools, verify it still handles known scenarios correctly
- Adversarial testing: What happens when the agent receives contradictory instructions, hallucinated context, or malicious input?
The Maturity Model
Most organizations are somewhere on this spectrum:
| Level | Description | What You Have |
|---|---|---|
| 0 | Ad hoc | Agents in notebooks, no monitoring |
| 1 | Managed | Basic logging, manual deployment |
| 2 | Systematic | Control plane, observability, CI/CD for agents |
| 3 | Governed | Audit trails, compliance, automated testing |
| 4 | Optimized | Self-improving agents with human oversight loops |
In my experience, most teams shipping agents today are at Level 0 or 1. They're focused on making agents work, not on operating them reliably. This is exactly where MLOps was in 2020—functional but fragile.
What You Should Do Now
If your team is building or planning to build AI agents, here's my practical advice:
Start with Boundaries, Not Capabilities
Before asking "what can this agent do?", define "what must this agent never do?" Explicit constraints are more important than capabilities. Build the guardrails before you build the highway.
Invest in Observability Early
Don't bolt on monitoring after your agent is in production. Instrument decision traces from day one. When something goes wrong—and it will—you need to understand the reasoning chain, not just the final output.
Treat Agent Deployments Like Infrastructure Changes
An agent with database access is infrastructure. Treat it with the same rigor you'd apply to a Terraform change—code review, staging environment, gradual rollout, rollback plan.
Build Human-in-the-Loop Checkpoints
For any action with real-world consequences above a defined threshold, require human approval. An agent can draft an email. A human should approve sending it to 10,000 customers. Automate the thinking, gate the acting.
Plan for Multi-Agent Debugging
If you're running multiple agents, you need distributed tracing across agent boundaries. When a customer complaint arrives, you need to trace the decision chain across every agent that touched that workflow.
The Bottom Line: The ops evolution from DevOps to MLOps to AgentOps isn't just adding new tools to the stack. Each transition represents a fundamental shift in what we're operating—from deterministic code, to probabilistic models, to autonomous systems. The teams that recognize this shift early and build the operational foundations now will lead. The rest will learn the hard way that shipping an agent without AgentOps is like shipping a microservice without DevOps—it works until it doesn't, and when it fails, you have no idea why.