Fine-Tuning Whisper for German — ~40% Fewer Errors

Out of the box, Whisper is impressive — until it meets a regional German accent or domain jargon, and the transcript quietly falls apart. Everything downstream (subtitles, search, analytics) inherits those errors. So I fixed the model rather than patching the symptoms — and cut word error rate by roughly 40% on the content we actually handle.

The problem & constraints

General-purpose ASR underperforms exactly where it matters in practice: dialects, accents, names, and domain-specific vocabulary. For our German speech, the base model's word error rate was high enough that downstream products — especially subtitling — suffered visibly. The brief: a measurable, deployable accuracy gain on representative audio, achieved efficiently enough to retrain as data grew.

Architecture & why this approach

Data-centric training on Azure ML — the dataset is the primary artifact; the WER eval gate is a hard release gate, so a model ships only if it beats the current one on the held-out set.

Data first, because data is the real lever. The largest gains came from data, not modelling tricks. I collected and labelled roughly 400 hours of in-domain German audio as ground truth — deliberately capturing the broadcast-domain jargon, names and accents that general ASR mangles — and built a reproducible pipeline around it. I treated the dataset as the primary artifact: versioned, with clean train/validation/test splits drawn to reflect production conditions.
Augmentation for robustness. I used segmentation to slice and recombine clips into more training examples, plus speed, pitch and noise/reverb perturbation, so the model generalised to real recording conditions instead of overfitting clean studio audio — effectively multiplying the labelled hours.
LoRA over full fine-tuning — a deliberate trade-off. Parameter-efficient fine-tuning of Whisper large-v2 with LoRA (low-rank adapters on the attention projections) instead of retraining the whole model. Why: it's far cheaper to train and store, it runs on accessible hardware (including Apple-Silicon/MPS for iteration and GPUs for full runs), and it avoids the catastrophic forgetting a naive full fine-tune risks — you specialise the model while keeping its general strength. The approach mirrors my open-source whisper-german-finetuning work.

Measuring accuracy: evaluation & A/B testing

Accuracy was tracked as word error rate on a held-out, domain-representative test set — deliberately not a flattering public benchmark, because the internal set is what predicts downstream quality. I broke WER down by dialect and content type so a headline average couldn't hide a regression on a hard slice. Improvements were validated by A/B-style comparisons: candidate data mixes, augmentation strategies, and LoRA configurations trained and evaluated against the same frozen test set, so every change was a measured decision. Because WER alone can hide errors that matter to readers, I paired it with human evaluation — native speakers spot-checking transcripts, plus feedback collection from the subtitling team on real output — to confirm the metric tracked perceived quality. The headline result: a ~40% relative WER reduction versus base Whisper large-v2.

Testing, deployment & operations

The evaluation ran as a regression gate in the training pipeline — a candidate that didn't beat the current model on the held-out set didn't ship. Data and config were versioned so any model was reproducible from scratch. The fine-tuned model was packaged for the downstream transcription/subtitling pipeline, with WER monitored on fresh production samples to catch drift as the kind of audio evolved.

Leading the team

I led the work and used it to develop the team. I set the evaluation protocol up front (the held-out set, the per-slice breakdown, the regression gate) so everyone optimised against the same honest number, and I mentored master's students on the data-curation and augmentation side — turning "collect some audio" into a rigorous, versioned dataset practice. I kept the focus on the metric that mattered to the business (downstream transcription quality), not on chasing benchmark vanity.

The hardest part & what I learned

The hardest part was the long tail — strong dialects and domain terms where even humans disagree — and resisting the urge to overfit the average at their expense. The lesson, which generalises across applied ML: the win is almost never a bigger model. It's the right data, an efficient adaptation (LoRA), and an honestly-measured evaluation. A 40% error cut from data + adapters beats waiting for the next foundation model.

← Back to all work