Blip Build Log #1: Getting the Audio Pipeline Working
Phase 1 of Blip is done. The loop works: say "Hey Blip," the system wakes up, listens, transcribes what you said, sends it to Claude, gets a response, and speaks it back. Start to finish, under two seconds on good hardware. Here's what it took.
The wake word problem
The always-on layer has to be cheap — it runs 24 hours a day waiting for a trigger. A full speech recognition model running continuously would eat CPU and burn memory. Instead, wake word detection uses a purpose-built library called Porcupine from Picovoice. It uses under 1% CPU and runs entirely on-device. No audio ever leaves the machine until after the wake word fires.
Getting a custom "Hey Blip" wake word trained was the first real friction point. Picovoice has a web console where you record samples and it trains a model file for you — their free tier covers personal use. The model is a small binary file (~3KB) that loads into the library. Works well.
Speech-to-text with kids' voices
This is harder than it sounds. Most voice recognition is trained predominantly on adult speech. Kids speak faster in some ways and slower in others, mispronounce things, use incomplete sentences, and talk over themselves. The error rates on standard models are noticeably worse.
I'm using faster-whisper — a reimplementation of OpenAI's Whisper model that's built for speed. The "small" English-only model runs comfortably on the mini PC's integrated GPU. Accuracy is good enough for the use case: Jaxsen and Adalind are understood correctly the vast majority of the time. The system handles misrecognitions gracefully — if it's not sure what was said, Blip asks to repeat.
Text-to-speech that sounds like a character
Blip needs a voice that sounds friendly and a little playful — not a generic assistant voice. I'm using Piper, a fast local TTS engine from Rhasspy. The Lessac medium voice is the best balance of quality and naturalness I've found that runs locally without a GPU. Edge-TTS (Microsoft's cloud TTS, free tier) is the fallback if Piper has issues.
The voice isn't perfect. It's noticeably synthetic. But kids adapt faster than adults do — within a few sessions, Jaxsen stopped noticing it wasn't a "real" voice and just started talking to Blip like it was a person.
Stitching it together
The audio pipeline is managed by a Python state machine: IDLE → LISTENING → THINKING → SPEAKING → back to IDLE. Each state transition plays a sound effect (a short chime when waking up, a soft tone while processing) so kids have non-verbal feedback about what Blip is doing.
The Anker speakerphone handles echo cancellation automatically — Blip's own voice doesn't trigger the wake word while it's speaking. This was one of the things I was most worried about, and it just works out of the box.
What's next
Phase 2 is wiring in Claude as the actual brain. Right now Blip can listen and speak but it's not intelligent — it echoes back what you said, like a parrot. The state machine needs a real orchestrator: conversation history, activity detection, age-appropriate guardrails. That's the next piece.
More in the next build log. If you want to follow along, the Building & Tinkering category is where I'm posting these.