My Kid Said "Bloop" and Two AI Models Lost Their Minds

I have 201 recordings of my kids talking to Blip. Every wake word, every follow-up, every mumble and sound effect — captured as 16 kHz mono WAVs with matching transcripts. I've been sitting on this data for three days, and today I finally wrote a benchmark to see how bad the transcription actually is.

The answer: worse than I thought, but not in the way I expected.

The setup

Blip's production STT is faster-whisper small.en running on a local RTX 4090. It transcribes in about half a second, never leaves the machine, and handles Jaxsen's speech pretty well. But I've always suspected it was missing things — especially with Adalind, who talks fast and drops consonants like they owe her money.

On the inference box (an RTX PRO 6000 with 96 GB VRAM), I have whisper-large-v3-turbo running as a FastAPI server. Four times the parameters, multilingual, supposedly more accurate. I wanted to know: how much better is it, actually, on real kid speech?

So I wrote bench_stt.py. Feed it every WAV from data/sessions/, run both models, compute word error rate between them, spit out an HTML report. Simple enough.

The first results were wild

Backend A (small.en, cached from live sessions) versus Backend C (large-v3-turbo, fresh inference): 47% word error rate. Nearly half the words differed.

That number is meaningless on its own — I was measuring disagreement, not accuracy against ground truth. But the disagreements told a story.

The top disagreement, at 3,171% WER:

Jaxsen said: "Bloop bloop bloop bloop! BLURP! Hey! Blip!"
small.en heard: "Bloop bloop bloop bloop! BLURP! Hey! blip!"
large-v3-turbo heard: "Það er að það er að það er að það er að það er að það er að..."

The larger model decided my eight-year-old was speaking Icelandic. Not just on this clip — on dozens of them. Every time the kids made non-verbal sounds, silly noises, or laughed, large-v3-turbo snapped into some Nordic language and started hallucinating repeating phrases. "Bee-paw! Bee-paw!" became "Bípa, bípa, bípa." A giggle became "Það er hann. Það er hann."

Why this happens

whisper-large-v3-turbo is multilingual. It was trained on speech in 99 languages, and its first job on every audio clip is to figure out which language it's hearing. Adult conversational English? No problem. But an eight-year-old making explosion sounds while narrating an imaginary shark battle? The language detector has no idea what to do with that, and it defaults to whatever phoneme pattern seems closest. Apparently Icelandic.

small.en doesn't have this problem because it's English-only. There's no language detection step. It hears "bloop" and writes "bloop" because that's the closest English token it's got. Which is — honestly the right answer.

The fix took one line

The Voxtral ASR server on the inference box was calling model.generate() without specifying a language. I added language='en', task='transcribe' and restarted. Took longer to SSH into the box than to write the fix.

After the patch, the Icelandic hallucinations stopped. "Bloop bloop" from Jaxsen became "Hey, look!" — still wrong, but at least it's English. The model replaced its Icelandic repetition loops with English repetition loops. Progress, I guess.

Post-fix results, all 201 turns:

Backend	WER vs Reference	Avg Latency
faster-whisper small.en	50%	~500ms (local)
whisper-large-v3-turbo	0% (reference)	515ms (network)

The 50% WER is still misleading. On clear speech — Jaxsen asking "How do trees make air?" — both models produce identical transcripts. The divergence is almost entirely on ambiguous audio: single-word utterances ("Yeah" vs "Yep"), non-verbal sounds, and the edges of sentences where kids trail off.

What surprised me

Neither model is correct on the hard cases. When Jaxsen says "Ooh" with rising excitement, small.en hears "Ooh" and large-v3-turbo hears "I think it's a bit of a bit of a bit of a..." — a hallucination loop, 443 words from a one-syllable exclamation. The bigger model confidently generates more garbage.

I expected the larger model to be strictly better. It isn't. It's differently wrong. On clean speech it matches or beats small.en. On kid noise — which is maybe a quarter of these recordings — it's actively worse because it has more parameters to be wrong with.

The other thing: latency is nearly identical. 515ms over the network versus ~500ms local. For a background verifier that's fine. For production latency, there's no reason to switch from the local model.

Where this goes

I'm adding two more backends to the benchmark. Cohere Transcribe — a 2B parameter ASR model that claims 5.42% WER on adult speech and supports 14 languages — will run as Backend B. And gpt-omni/mini-omni, a 0.5B end-to-end speech model, will be Backend D. Neither is installed yet. I want the benchmark harness ready before I start pulling models onto the inference box.

The real plan is dual-STT. Whisper small.en stays in production — it's fast, local, and good enough for immediate responses. Cohere runs in the background on the same audio, and if its transcript disagrees significantly, the correction gets injected into the next conversation turn. The kid never waits for the second model. Blip just quietly gets smarter between sentences.

I also looked at end-to-end speech conversation models — Qwen2.5-Omni-7B, Kimi-Audio, Moshi, Hertz-dev — that could theoretically replace the entire Whisper + LLM + TTS pipeline with one model. Qwen2.5-Omni fits on the 96 GB inference box and scores well on benchmarks. I'm going to run it through the Blip eval suite to see if a single model can actually teach a kid spelling as well as the dedicated pipeline can. I have my doubts, but the architecture simplification would be worth it if the quality holds.

For now, though, the benchmark told me what I needed to know: small.en is the right production choice, bigger isn't always better on kid speech, and the language detection bug on the inference server had been silently corrupting every background transcription I'd ever run. That last one stings a little.