All work

Multi-Agent Chatbot: Architecture, Routing & Evaluation

Agentic · LLM    Intent-based routing to specialized agents

One model trying to do everything does most of it adequately and none of it well. For genuinely complex queries the better answer isn't a bigger prompt — it's a system that works out what's being asked and routes it to the right specialist. Designing that, and proving it beat the simple baseline, was the work.

The problem & constraints

A single monolithic assistant degrades as scope grows: the prompt balloons, behaviours interfere, cost climbs because every query hits the biggest model, and quality across genuinely different task types (lookup vs. reasoning vs. action) is uneven. The brief was an assistant that stayed reliable, debuggable, and affordable as the range of requests widened — the point at which monoliths usually fall over.

Architecture & why this stack

CLIENT EDGE ORCHESTRATION AGENTS & MODELS REQUEST FLOW AKS · EU REGION Userquery Responsevalidated API Managementauth · rate-limit Intent routerclassify + route Orchestratorroute + combine Lookup agentsmall, fast model Reasoning agentlarge model Action agentscoped tools CROSS-CUTTING Observability & eval Traces Monitor Drift Langfuse Cost control Cost Management Model-per-job Delivery & Resilience Azure DevOps Registry Backup

Route first, then specialise — the router picks the right agent, each bound to the cheapest Azure OpenAI model that does its job; cost, tracing and delivery are cross-cutting.

  • Router + specialized workers. An intent classifier identifies what the user wants; an orchestrator routes to a specialized agent built for that job, each with its own tools and prompt. Why: narrow agents are testable and reliable in a way a sprawling super-prompt never is — and you can improve one without destabilising the rest.
  • Model-per-job economics. Routing unlocks cost/latency control: small, fast models handle intent classification and simple turns; the expensive, capable models are reserved for queries that genuinely need them. You stop paying top-tier token prices for "hello."
  • Bounded state & tools. Shared, summarised conversation state passed across agents; tool access scoped per agent so each can only do what its job requires — least privilege, which also contains blast radius.

The LLM layer & guardrails

Each agent had a focused prompt contract and explicit output schemas where structure mattered, so downstream steps could rely on the shape of what came back. Guardrails were per-agent and system-wide: validation on tool inputs/outputs, a confidence threshold on the router with a safe clarifying-question fallback when intent was ambiguous, and hard cost/iteration caps so an agent loop could never run away. Multi-agent systems fail in cascades — agent A handing bad output to agent B — so containment was a first-class design goal, not an afterthought.

Measuring accuracy: evaluation & A/B testing

I evaluated the parts and the whole separately:

  • Routing accuracy: a labelled intent set scored as a classification problem (confusion matrix, per-intent precision/recall) — because a misroute dooms everything after it, the router is the highest-leverage thing to measure.
  • End-to-end resolution: task-success scored on a fixed evaluation set with an LLM-as-judge calibrated to human ratings; for agents that retrieved and grounded answers, I added RAGAS faithfulness/relevance checks.
  • Tracing & human eval with Langfuse: every turn was traced in Langfuse — which agent ran, its inputs, tokens, latency and cost — with eval scores and user feedback attached per trace. A recurring human review of sampled conversations calibrated the judges and surfaced misroutes the automated metrics underweighted.
  • A/B against the baseline: the multi-agent system run head-to-head against a single-strong-model baseline on the same queries, comparing resolution quality, latency, and cost — so "multi-agent" had to earn its complexity with evidence, not assertion.

Testing, deployment & operations

The router had a dedicated test set that ran as a regression gate; each agent had its own evaluation suite; deterministic glue (tool wrappers, state handling) had ordinary unit/integration tests. In production I instrumented per-agent traces — which agent handled each turn, with what inputs and cost — because you cannot debug a multi-agent system you can't observe. Deployed as containerised services with CI/CD, monitored for routing drift and per-agent cost/latency.

Leading the team

I designed the orchestration architecture and led the build. The clean agent boundaries were also a team design: engineers could own individual agents in parallel against agreed interfaces, which kept us fast without stepping on each other. I ran the design reviews on the routing and containment logic, set the rule that every agent ships with its own eval, and made the A/B-versus-baseline a gate on the whole approach — if we couldn't prove the added complexity paid off, we wouldn't keep it. That kept the team honest about engineering for its own sake.

The hardest part & what I learned

The hardest part was reliability at the seams: intent ambiguity and cascading errors between agents, where a single bad route quietly corrupts the rest of the conversation. The lesson: agentic systems win on architecture, not model size — decompose deliberately, measure the router hardest, contain failure, and make the system prove it beats the simple baseline. It's the same discipline I argue for in Building the Harness.