Blip Is Getting Faster — Latency Fixes, a Local Model Win, and What Comes Next

The SER5 was taking about 8 seconds from when a kid finished talking to when Blip started responding. Eight seconds is not a conversation — it's a timeout. Jaxsen started repeating himself, assuming Blip hadn't heard. Adalind stopped mid-question a few times and wandered off.

I spent the better part of a week on this, and the number is now around 2.5 seconds. Here's what I changed, what I learned along the way, and where I'm taking this next.

Why 8 seconds

The SER5 has no GPU. Every stage of Blip's pipeline that I'd built assuming CUDA had to fall back to CPU or network, and I hadn't fully accounted for what that meant until I started watching the timing logs.

STT was the worst offender. faster-whisper small.en on the SER5's AMD Ryzen CPU was running about 4 seconds per kid utterance. Kids talk in short bursts — "Hey Blip, how do you spell dinosaur" is about 3 seconds of audio — and 4 seconds to transcribe it is already longer than the thing being transcribed. That's just broken.

After that: Claude Sonnet takes around 2.5 seconds for a typical response. OmniVoice on the inference box returns in about 1.5 seconds. Add some Python overhead and you get 8.

Three fixes, not one

I'd hoped there was one thing I could change. There wasn't. The latency was spread across three different stages, and each one needed its own fix.

STT: go remote. The inference box runs Voxtral on a Blackwell GPU. A 3-second kid utterance comes back in about 300ms including the network round-trip over the 10Gb direct link. That's a 13x improvement on the transcription stage. Yes, this adds a network dependency — if the inference box is down, STT fails. I made peace with that. The alternative was 4-second transcription times, which is worse than an occasional timeout.

The wait during LLM: play something. There's an unavoidable gap while the language model generates the response. The fix isn't to make the LLM faster — it's to fill the silence. Blip now starts playing a short filler phrase ("Hmm, let me think about that..." or "Ooh good one...") the moment STT finishes, then overlaps the beginning of the real response with the tail of the filler audio. The gap is still there. It just doesn't feel like a gap anymore. Kids don't notice pauses if something is happening. They notice silence.

Phrase cache for common responses. Greetings, session openers, and a handful of drill prompts ("Let's try another one!" / "Ready for the next word?") get pre-generated at startup and cached. When Blip needs one of those, it plays from the cache rather than hitting OmniVoice. Zero latency for a decent chunk of the things Blip actually says.

Combined: 8 seconds to 2.5. The session recorder is running and I've been watching the logs. The kids stopped repeating themselves.

A detour into spelling

While I was reworking the latency stack, I ran a benchmark I'd been putting off: head-to-head, blip-edu versus Claude Sonnet on spelling drills specifically.

The routing layer had been sending all spelling requests to Claude. My assumption was that spelling required accurate phonics knowledge and careful pacing — the kind of thing a 7B local model wouldn't do as well. I was wrong.

blip-edu scored 8 out of 8 on the test set. Claude scored 7 out of 8. And blip-edu's average response time was 434ms, compared to Claude's 2,990ms. Seven times faster on a category where it also wins on quality.

I changed one line in the router. Spelling now goes local.

What I think is happening: blip-edu was trained on 14,000 teacher-student conversations that included a lot of spelling drill examples — call a word, confirm spelling, give a hint, encourage the right letter sounds. That's exactly the interaction pattern in the test. The model knows the script. It doesn't need the full capacity of a cloud LLM to execute it well. If anything, Claude's extra reasoning overhead was slowing it down without adding anything.

What the routing layer looks like now

The hybrid router I built classifies each request into an activity type and picks a backend. After the spelling change, the split is:

Local (blip-edu): spelling, math, voice-quality responses
Cloud (Claude): creative requests, trivia, emotional moments, multi-turn conversations, safety signals

Safety always goes to Claude. That's a hard rule — blip-edu's false negative rate on a child distress signal has consequences that a math drill mistake doesn't. Everything else is a quality/cost/latency trade-off, and the benchmark data is what drives the routing decisions.

Where this is going

The SER5 fix works, but it exposed something I've been avoiding thinking about: the architecture assumes the AI processing happens near the hardware. The SER5 has to reach out to the inference box for STT and TTS already. Claude is cloud-only. blip-edu is local on the inference box. The SER5 is basically a thin client that happens to run the wake word detector and the Electron UI.

Once I accepted that, a different architecture became obvious.

I'm building a cloud brain. The plan is to run the full orchestration layer — session management, intent detection, LLM routing, TTS — on a GPU instance on AWS. A g4dn.xlarge with a T4 GPU, scheduled to run 15 hours a day at about $150/month. DynamoDB for session and learning progress state. FastAPI wrapper that lets any device make a text-in, audio-out request over HTTPS.

The thin clients send text (or audio, transcribed locally) and get back a voice response. The SER5 still handles the wake word and the room audio. But the iPhone the kids carry around? That can just be Blip too. No special hardware, no CUDA, no inference box dependencies. The session state lives in DynamoDB and follows the kid from room to room.

This is a bigger architectural change than anything I've done since Phase 1. I don't know how long it'll take. The first milestone is getting the FastAPI layer running on the instance with blip-edu loaded, and being able to send it a text turn and get audio back. Everything else builds from there.

Longer post on that architecture coming when there's something running to write about.