Blip Build Log #2: Adding Speech Practice — and Why It's Harder Than It Sounds

I see kids in the clinic. Not as a primary pediatric practice — I'm focused on men's health — but parents come in, and sometimes their kids are with them, and you notice things. One thing I've noticed is how many kids are working through speech articulation issues. The R sound. The S sound. The TH. Their parents are scheduling speech therapy, doing homework between sessions, and trying to find ways to make a six-year-old practice the same sound fifty times without a meltdown.

It's hard. Consistent practice between therapy sessions is one of the strongest predictors of progress in articulation treatment. But getting a kid to practice is a different problem than getting a kid to attend a session. The SLP is trained. The session is structured. The homework is a sheet of paper with words on it.

So I'm adding a speech practice module to Blip.

What this is — and what it isn't

I want to be precise here, because the line matters. This module is a practice supplement. It helps kids repeat their therapy targets in a structured, patient, low-stakes environment between sessions. It does not assess speech, does not diagnose disorders, and does not replace a speech-language pathologist. Any child using this feature should already be working with an SLP who has identified target sounds and given the parents a practice plan.

What Blip adds: patience, consistency, and a reason to practice. Kids will practice for a voice they've named and a hedgehog they've bonded with in a way they will not practice for a worksheet.

The technical challenge: Whisper wasn't built for this

Blip uses faster-whisper for speech-to-text. It's good. But it wasn't designed for phoneme-level articulation analysis — it was designed to transcribe what someone said, not to evaluate how they said it. Those are different problems.

A child who substitutes W for R ("wabbit" instead of "rabbit") will often get transcribed as saying the correct word, because Whisper knows what was intended from context. That's useful for general transcription. It's useless for articulation practice, where the whole point is to catch the substitution.

The workaround: compare the transcription to what was expected. If the target word is "rabbit" and the child says it, check whether the transcription actually shows the R or whether it's a W-substitution pattern. Layer that with Claude to generate appropriate feedback. It's not phoneme-level acoustic analysis — we're working with transcription artifacts — but in practice it catches the most common substitutions reliably enough to be useful.

I'm also keeping the target list tight: R, S, L, TH (voiced and voiceless), SH, CH, J, K, G. The phonemes that actually show up in clinical practice. Not a complete phoneme inventory — a practical one.

ADHD-friendly by default

This is the design principle I keep coming back to. Every child benefits from practice tools designed for the child with the shortest attention span, the lowest tolerance for repetition, and the strongest reaction to correction. If you build for that kid, you've built for everyone. If you build for the compliant, patient, average child, you've built something that works for a narrower population than you think.

What that means concretely:

No correction language. Blip never says "wrong" or "try again." It reflects and redirects. "That one's tricky — here's how my tongue does it: [demo]. Want to try the sneaky way?"
Disguised repetition. Practicing the R sound twelve times in a row is the goal. Feeling like you practiced it twelve times is what kills the session. The activities cycle through the same target in different game contexts — word hunt, story fill-in, silly sentence — so the repetition is embedded rather than naked.
Micro-burst structure. Sessions start at three minutes. Not because three minutes is optimal for acquisition — it's not — but because a child who succeeds at three minutes and wants more will ask for more. A child who is forced through fifteen minutes will refuse tomorrow.
Variable rewards. Fixed reward schedules habituate. Variable ones maintain engagement. Blip's celebration tier (a quiet "nice," a medium reaction, a full character animation) fires on a weighted random schedule, not after every correct trial.
Voluntary continuation. At the end of each micro-burst, Blip asks if the child wants to keep going. It doesn't push. Over weeks, the average session length naturally extends — because the child is choosing it, not being required to endure it.

The SLP relationship

I've thought about this carefully. The parent dashboard for this module includes a session export — date, target phoneme, trial count, approximate accuracy — that a parent can share with their child's SLP. The format is plain text, not a clinical report, and the language reflects that: "Jaxsen practiced his target sound for 4 minutes on Tuesday. He got 18 out of 24 trials right." The SLP can decide what to do with that information.

What I'm not doing: building a "score" that parents or kids fixate on. Articulation work is not about accuracy rates; it's about generalization — getting the sound to appear spontaneously in conversation. A practice module can support that process. It cannot measure it.

Privacy, again

All voice processing stays on the device. This matters more for this module than for spelling bees. We're talking about children with speech differences practicing sounds they find difficult. That audio is sensitive. It doesn't go anywhere.

The only external call is text to Claude — the transcription and the target word, not the audio. Claude generates the feedback response. No audio is recorded or stored.

What's next

This module builds on top of Phase 2 (Claude integration), which I'm finishing now. I'm planning to test it with Jaxsen first — he's been in speech therapy and he is an honest critic. If he asks to use it again, it's working. If he doesn't, I have more work to do.

More from the build log as it comes together.