KI Interview — A Real-Time AI Interviewer for Live Television

A pre-recorded demo is forgiving; a live broadcast is not. This system listened, reasoned, and spoke back in near real time to conduct an interview — on air, in German, with no second takes. It ran live on national television, which meant every architectural decision had to survive an audience and a fixed broadcast clock.

The problem & constraints

The goal was a speech-to-speech agent that could hold a responsive, coherent, on-camera interview. "Live" turns the comfortable assumptions of an LLM product into hard engineering constraints: no retries, visible latency, overlapping and noisy audio, an unpredictable human guest, German throughout, and a hard requirement that nothing inappropriate ever leave the speakers. The non-functional requirements — latency, safety, and graceful failure — were the actual project.

Architecture & why this stack

Streaming speech-to-speech on Azure — each stage starts on the previous one’s partial output; a producer holds a kill-switch across the live path. Observability, safety and delivery are cross-cutting.

I designed it as a streaming, three-stage pipeline — speech-to-text → dialogue manager (LLM) → text-to-speech — with an event bus between stages so each could start work on partial output from the previous one instead of waiting for completion.

Why streaming, not request/response: perceived responsiveness is the sum of three stages. Streaming partial STT transcripts let the LLM begin planning before the guest finished speaking, and streaming LLM tokens let TTS begin synthesising the first words while the rest were still being generated. That overlap was the single biggest lever on "does it feel conversational."
Why a fine-tuned Whisper for STT: German broadcast audio with accents and domain vocabulary; general ASR transcription errors propagate into wrong answers, so accuracy at the front of the pipeline mattered most. I used Whisper large-v2 fine-tuned with LoRA on ~400 hours of labelled in-domain audio — the same fine-tuning work as my Whisper case study.
Why cloud + EU data boundary: deployed on Azure with EU data residency, because broadcast content and guest audio are sensitive; I'd rather inherit a compliant boundary than bolt privacy on later.

The LLM layer: dialogue, prompting, guardrails

The model didn't "freestyle." I treated it as a constrained dialogue manager: a structured interview plan defined the arc and the must-hit beats, and the LLM improvised wording within that frame. Prompts were layered — a stable system contract (persona, tone, hard "never do" rules), the interview plan, and a rolling, summarised conversation state to keep context bounded and latency predictable.

Guardrails were defence-in-depth, not a single prompt: input checks on the transcript, output moderation before anything reached TTS, topic/length constraints in the prompt, deterministic fallbacks for low-confidence turns, and — the last line of defence — a human producer with a hard kill-switch. On live TV you assume the model will eventually say something you don't want, and you engineer so that it can't reach air when it does.

Measuring quality: evaluation & A/B testing

You can't improve what you don't measure, so before any live attempt we evaluated on three axes:

Latency: end-to-end response time tracked as a budget per stage, with explicit targets so the conversation stayed in a natural rhythm; we profiled and optimised the slowest stage each iteration.
Transcription accuracy: word error rate on held-out, domain-representative audio — the metric that predicted downstream answer quality.
Dialogue quality: a rubric (relevance, coherence, safety, staying on-plan) scored by an LLM-as-judge, first calibrated against human ratings so we trusted the automated score before relying on it.
Tracing & human review with Langfuse: every run — rehearsal and live — was traced in Langfuse (per-stage latency, transcript, model output, rubric scores), and a structured human evaluation with feedback collection from the production crew fed directly back into the prompts and the safety rules.

We A/B-tested offline: prompt variants and model/voice configurations run against a fixed bank of recorded interview turns and adversarial inputs, scored on the same rubric and judged against the human ratings, so changes were chosen on evidence rather than vibes before they ever went near a broadcast.

Testing, deployment & operations

An LLM system needs more than unit tests. We built a scenario test harness that replayed real and adversarial audio — silence, interruptions, cross-talk, profanity, off-topic detours — and asserted on safe, graceful behaviour. Deterministic components had ordinary unit/integration tests; the probabilistic ones had an evaluation suite that ran like a regression gate, so a prompt or model change that lowered the rubric score blocked the release.

It deployed on Azure (containerised, with the model services behind the EU boundary), and the live event ran against a rehearsed runbook: dry runs, health checks, real-time latency monitoring, and a clear human-in-the-loop fallback if any stage degraded on air.

Leading the team

I led this as the architect and the engineering lead. I split the work along the pipeline — a data-scientist owning the speech/ASR side, an MLOps engineer owning deployment and the real-time infrastructure, and myself owning the dialogue/LLM layer and the overall latency and safety budget — with shared interfaces agreed up front so the parts integrated cleanly. I ran tight code review on the safety-critical paths, paired with a junior team member on the guardrail logic, and kept the team focused on the non-functional requirements that actually decided success. The discipline I insisted on: nothing reaches the live path without passing the scenario harness.

The hardest part & what I learned

The hardest part was the unhappy paths under a real-time clock — making the system degrade gracefully when the audio was messy or the guest went off-script, without adding latency that broke the conversational feel. The lesson, which has shaped how I build agentic systems since: the model is the easy 20%. Latency engineering, layered guardrails, evaluation you can trust, and a human fallback are the 80% that decide whether it survives contact with the real world — the conviction behind my writing on the harness.

← Back to all work