Enterprise RAG Chatbot — Architecture, Evaluation & Delivery

The easy version of this is a chatbot that sounds confident and quietly makes things up. The only useful version is one whose answers people can trust and verify — grounded in the organisation's own documents, current, cited, and built to survive an audit. That gap, between demo and production, was the whole project.

The problem & constraints

Staff lost real time hunting through scattered internal documents, and a plain LLM was worse than nothing here: it hallucinates, can't cite a source, and its knowledge freezes at training time. We needed an assistant that answered from internal content, stayed current, showed its sources, and did it under enterprise data-handling and EU residency constraints. "Mostly right and unverifiable" was a failure condition, not a launch.

Architecture & why this stack

Production RAG on Azure — retrieval (AKS, EU region) decides accuracy; the LLM writes only from cited context. Indexing runs offline; observability, secrets and delivery are cross-cutting.

A RAG system's accuracy is decided in retrieval, not generation, so most of the architecture went there:

Ingestion & chunking: document loaders normalised mixed formats; I chose structure-aware chunking (respecting headings/sections) over naive fixed windows, because clean chunk boundaries are the cheapest large win in retrieval quality.
Hybrid retrieval + reranking: dense vector search for semantics, combined with keyword/BM25 for exact terms and acronyms, then a cross-encoder reranker to order the final context. Why hybrid: pure vector search misses exact identifiers, pure keyword misses paraphrase; the rerank step is what turns "approximately relevant" into "actually relevant."
Vector store & embeddings: chosen for EU-region hosting and operational fit, not hype — the data residency requirement constrained the option set, which is exactly how it should drive the decision.
LLM choice: a capable model behind an EU data boundary (Azure OpenAI with EU residency), traded off on quality vs. cost vs. compliance — not the flashiest model, the one that met the constraints.
Freshness: live web-search integration for questions beyond the static corpus, with the same citation discipline so users can tell internal sources from external.

The LLM layer: grounding & guardrails

The generation prompt was strict: answer only from the retrieved, cited context, and when the context doesn't support an answer, say so rather than guess. That "refuse-or-cite" contract is the difference between a tool people trust and one they learn to distrust after the first confident hallucination. Guardrails layered: retrieval-confidence thresholds (no context → no answer), citation enforcement on every claim, PII handling appropriate to the data, and prompt-injection mitigations on ingested documents.

Measuring accuracy: evaluation & A/B testing

We evaluated retrieval and generation separately, because they fail differently — combining automated metrics, RAG-specific framework tooling, production tracing, and human review:

Retrieval: recall@k and MRR against a curated gold set of questions with known source passages — this tells you whether the right context even reached the model.
Generation, scored with RAGAS: I used RAGAS for the RAG-specific metrics — faithfulness (is every claim supported by the retrieved context?), answer relevance, and context precision/recall — so groundedness was a tracked number, not an opinion. The LLM-as-judge behind these was calibrated against human ratings before we trusted it.
Tracing & online eval with Langfuse: every request was traced in Langfuse (retrieved chunks, prompt, tokens, latency, cost), with eval scores and thumbs-up/down user feedback attached to each trace — turning live traffic into a continuously growing evaluation set rather than just a dashboard.
Human evaluation: a recurring review panel scored a sample of real answers for correctness and usefulness — this both calibrated the automated judges and caught failure modes the metrics missed.
A/B testing: we ran retrieval variants head-to-head — chunking strategies, embedding models, rerank on/off, top-k — on the gold set and on live traffic, choosing configurations on measured RAGAS scores and human ratings rather than intuition.

Testing, deployment & operations

The evaluation suite ran as a regression gate: a change that lowered groundedness or recall on the gold set blocked the release, the same way a failing unit test would. Deterministic components (ingestion, chunking, the retrieval API) had ordinary unit/integration tests; the probabilistic layer had the eval suite. It deployed cloud-native on Azure AKS with GitOps (Flux) and CI/CD via GitHub Actions, behind the EU data boundary, with monitoring on latency, retrieval quality, and the user-feedback signal — and the captured "bad answer" feedback fed straight back into improving retrieval and prompts.

Leading the team

I owned the architecture and led delivery across a small team. I split ownership cleanly — a data engineer on the ingestion/indexing pipeline, an MLOps engineer on the AKS deployment and observability, a data scientist on retrieval evaluation, with me holding the retrieval architecture and the eval strategy. I set the standard that no retrieval change ships without a number from the gold set, ran design reviews on the chunking and reranking decisions, and mentored a junior member through building the evaluation harness — deliberately, because the eval suite is the asset that keeps the system honest after launch.

The hardest part & what I learned

The hardest part was retrieval quality on messy, inconsistent real documents — the gap between a clean demo corpus and the actual enterprise content is enormous, and it's all in chunking and reranking. The lesson: a trustworthy RAG system is a retrieval-and-evaluation problem wearing an LLM costume. Invest in the gold set and the grounding discipline early, and the model almost takes care of itself; skip them and no model will save you.

← Back to all work