Skip to content

When Two Kids Talk at Once, Blip Hears Gibberish

Jaxsen was mid-sentence — "Can you help me spell dinosaur?" — when Adalind walked in and started talking over him. "Blip, tell me a story about a dragon!" Blip transcribed it as "Can you help me spell dragon story about a dinosaur." Which is not a sentence anyone has ever said.

This is the cocktail party problem, and it's been an open research area in speech processing for sixty years. Humans solve it effortlessly — you can follow one voice in a noisy room without thinking about it. Machines are still terrible at it. Whisper, the STT engine Blip runs on, handles one speaker well. Two speakers and it blends them into word salad.

What the problem sounds like

Here's a clean recording of one kid asking for spelling help:

One speaker. Whisper gets this right every time.

Now two kids at the same time — the sibling starts talking 1.5 seconds in:

Two speakers overlapping. Whisper picks whichever is louder at each moment.

Add a parent calling from the kitchen:

And the real scenario — everything at once. Two kids, a parent, TV in the background:

The nightmare input. Whisper produces something, but it's nobody's actual sentence.

Why Whisper fails here

Whisper was trained on 680,000 hours of labeled audio — mostly single-speaker recordings, podcasts with clean turn-taking, audiobooks. It learned to transcribe one dominant voice and treat everything else as background noise. When two voices are equally loud, it doesn't pick one and ignore the other. It interleaves words from both speakers into a single transcript, producing sentences that neither person said.

For Blip, this matters more than it would for a dictation app. A wrong word in a dictation is annoying. A wrong word in a kid's spelling lesson is a wrong answer the model confidently "corrects." Adalind says "B-A-R-N" and Blip hears "B-A-R-N dinner ready" because the parent was talking — and now the LLM responds to a sentence about dinner instead of validating the spelling.

Three fixes, increasing effort

I'm testing all three.

Approach A — just see how bad it is

Before building anything, measure the actual damage. Take the four mixed clips above, run them through faster-whisper (small.en, the model Blip uses in production), and compare the transcripts against the known ground truth. If Whisper naturally favors the dominant speaker and the kid is usually closest to the mic — the Jabra Speak2 40 sits right in front of them — then maybe the problem is smaller than I think.

Cost: one hour. Zero new dependencies.

Approach B — diarize first, then transcribe

pyannote/speaker-diarization-3.1 has 11 million downloads and an MIT license. It takes raw audio and outputs "who spoke when" — time-stamped speaker segments, including overlapping regions. I'd run it before Whisper: for non-overlapping segments, transcribe directly. For overlapping segments, use pyannote's companion model (speech-separation-ami-1.0) to split the voices into separate streams, then transcribe each one.

The speaker IDs are anonymous ("Speaker 1", "Speaker 2"). To know which one is Jaxsen, I'd need to match the voice embeddings against an enrollment clip — a 5-second recording of Jaxsen saying something at setup time. pyannote supports this via speaker embedding comparison, so the matching is built in.

Cost: a day of integration. Adds ~300ms of preprocessing latency per turn. The models are small (~80 MB) and run on CPU.

Approach C — extract the target kid's voice before transcription

Forget diarization. Forget figuring out who spoke when. Instead: enroll Jaxsen's voice once (short reference clip at device setup), then use a target speaker extraction model to pull only his voice out of every audio input. Everything else — sister, parent, TV, dishwasher — gets suppressed before Whisper ever sees the audio.

ClearerVoice-Studio from ModelScope does this with their MossFormer2 architecture. Apache 2.0 license. The input is the mixed audio plus a short enrollment embedding of the target speaker. The output is a clean waveform of just that speaker. Feed that to Whisper and you get a transcript of only what the enrolled kid said.

This is the cleanest solution for Blip's specific problem. Blip already knows which kid is active (the profile switch system from the memory post handles that). It just needs the audio to match.

Cost: two to three days. Larger model (~200 MB), runs on GPU. Adds ~500ms of preprocessing but eliminates all competing audio — not just other speakers, but TV, music, household noise.

What I haven't tested yet

All three approaches assume the problem is real and frequent. Is it? I don't actually know how often Jaxsen and Adalind talk over each other during a Blip session. The session recorder logs transcripts but not raw audio (by design — I don't want to store recordings of my kids). I could add a temporary overlap detector that flags when pyannote sees multiple speakers but doesn't record the audio, just the frequency. That'd tell me whether this is a "happens every session" problem or a "happens once a week" problem.

If it's once a week, Approach A is fine and the kid just repeats themselves. If it's every session, Approach C is worth the integration cost. The answer determines whether I spend two days on this or move on to something else.

The datasets I found for this

If I do need to train or evaluate a separation model, there's good data available. LibriheavyMix has 100-9,000 hours of 1-4 speaker mixtures with reverberation (CC-BY-4.0, on HuggingFace). CHiME-5/6 has 40 hours of real dinner party recordings in actual homes — acoustically identical to Blip's living room. SparseLibriMix has partial overlap (more realistic than full overlap) for quick evaluation. And DIHARD III explicitly includes child-adult recordings from the CHILDES corpus, which is the closest match to Blip's actual scenario.

SimClass is the most interesting one I found — 391 hours of simulated classroom audio with 25 spatialized child speakers, built in a Unity game engine with realistic room acoustics. Twenty percent of the files have deliberate overlap. It's the only dataset I know of that combines children's voices with realistic multi-speaker overlap. Published in 2025, from Stanford.

Where this goes next

I'm starting with Approach A tonight — run the four mixed clips through faster-whisper and see what comes out. If the WER is above 30% on the two-kids-overlapping scenario, I'll move to B (diarization) and then C (target extraction) in sequence. If Whisper handles it better than I expect — which is possible, the Jabra mic does have beamforming — I'll add the overlap frequency detector and wait for real-world data before investing further.

The audio clips above are synthetic (generated with edge-tts, different voices mixed in ffmpeg). Real kids' overlapping speech is messier — half-words, giggles, one kid's voice cracking into a yell. I won't know how well any of this works until I test it on actual Jaxsen-and-Adalind audio. But the synthetic clips give me a starting point and a baseline WER to improve against.