Getting a Computer to Understand My 7-Year-Old

Adalind is six — almost seven. She talks fast, drops word endings, mixes up sentence structures mid-sentence, and sometimes just trails off into silence when she gets distracted by something across the room. Getting Blip to reliably understand her turned out to be the hardest single problem in Phase 1.

Why kids' speech is harder

Speech recognition models are trained on data. The majority of that data is adult speech — podcasts, audiobooks, phone calls, recorded meetings. Children's voices are systematically underrepresented. Higher pitch, different formant frequencies, inconsistent pronunciation, more disfluencies (the "um"s and "uh"s and false starts). Error rates on standard models can be two to three times higher for young children versus adults.

Jaxsen, at nine, is close enough to adult speech patterns that Whisper handles him well. Adalind at six is still in the harder zone. "Blip, can we do a spelling bee" came through as "Blip, can we do a stealing bee" on the first pass. Not terrible, but the failure modes matter when you're building something for a kid.

Watching them use it

These are raw clips from sessions at the desk — Jaxsen and Adalind working through spelling rounds with Blip. Faces blurred to keep it off the internet.

Session 1 — first spelling round

Session 2 — trying harder words

Session 3 — Adalind solo

Session 4 — longer back-and-forth

What actually helped

The biggest improvement came from the microphone, not the model. The Anker PowerConf S330 has a six-microphone array designed for conference room pickup — far-field, with beamforming and echo cancellation built in. Switching from a standard USB mic to this reduced my transcription errors by more than half. A better input signal is worth more than a better model.

Second: prompt conditioning. Whisper accepts an optional text prompt that biases the output toward expected vocabulary. I built a dynamic prompt from the current activity — if we're doing a spelling bee, the prompt seeds Whisper with common spelling-bee vocabulary. If Adalind says a word that's on the list, Whisper is more likely to transcribe it correctly.

Third: graceful degradation. When the transcription score is low or the result is short and ambiguous (one or two words that don't form a clear intent), Blip asks to repeat rather than guessing. Kids are fine with "Sorry, I didn't catch that — can you say it again?" Adults find it annoying. Kids just repeat.

What I still haven't solved

The trail-off problem. Adalind starts a sentence and stops when she gets distracted — leaving Blip waiting for a silence timeout and then transcribing half a thought. I haven't found a clean solution yet. Shorter silence timeouts help but create false cut-offs mid-sentence. Longer timeouts mean waiting longer when she genuinely finishes. There's no right answer; there's just a tradeoff.

I'm also planning to fine-tune a small model specifically on kids' speech at some point — not Whisper, but the local routing model (Llama-based). That's further out. For now, the hardware + prompt conditioning approach gets us to good-enough.

The practical lesson

If you're building voice interfaces for kids: spend the money on the microphone before you spend time tuning the model. The hardware ceiling on cheap mics is lower than you'd think, and no amount of model tuning compensates for a noisy, low-gain audio signal. Get the input right first.