Deploying Blip on a $300 Mini PC Is Humbling

The whole time I was building Blip on the workstation, I was also lying to myself a little. The workstation has an RTX 4090. Whisper runs in 200ms. OmniVoice TTS is local. Everything is fast, and I took credit for the snappy feel of the system without admitting how much of it was just raw hardware doing the work.

The Beelink SER5 MAX is the machine that actually goes in the kids' room. AMD Ryzen 5825U, integrated Radeon graphics, 32GB RAM. About $300 on Amazon. It's a good mini PC. It has no CUDA cores.

That distinction matters a lot when you've built a system that depends on GPU inference at every stage.

The STT problem

faster-whisper on the workstation takes around 200ms to transcribe a 3-second kid utterance. On the SER5, running the same small.en model on CPU, that number is closer to 4 seconds. Which is unacceptable for a voice-interactive system. The whole premise falls apart if there's a 4-second dead silence after the kid finishes talking.

My first instinct was to just use the smallest Whisper model possible — tiny.en gets down to about 1.5 seconds on that CPU. But the accuracy is noticeably worse on kid speech, and "noticeably worse" on a spelling tutor means wrong answers and confused kids. I spent a Friday testing every combination of model size and compute type before accepting that the answer wasn't going to be local CPU inference.

The SER5 now sends audio to the inference box over the 10Gb direct link. Transcription happens on Machine B's Blackwell GPU and comes back in under 300ms including network round-trip. Works well. Adds a dependency I don't love.

Audio device configuration is never simple

The workstation uses an Anker PowerConf S3 — a conference speaker with decent noise cancellation, omnidirectional mic, and audio that sounds pretty good through PipeWire. Plug it in, run wpctl set-default on the right sink and source, update the config, done.

The SER5 production setup uses a Jabra Speak2 40. Same category of device, different everything. Different USB descriptors, different sound card numbering, different PipeWire behavior, different volume normalization characteristics. The first time I plugged it in and ran Blip, the wake word sensitivity was completely wrong — calibrated for the S3's input curve, and the Jabra just sounded different enough that detection was firing on ambient noise and missing actual "Hey Blip" calls about 30% of the time.

Not 100% of the time. 30%. Which is the worst case, because the kids just thought Blip was being "silly" instead of broken. It took me longer to track down than it should have because I wasn't there watching every session.

The fix was recalibrating the silence threshold and microphone gain in the SER5-specific config, and setting PipeWire to lock to the Jabra source explicitly by name rather than by device index — because device indices change when you unplug and replug things, which kids do constantly.

Two machines, two configs

This sounds obvious, but it took me a few bad syncs before I fully internalized it: the workstation and the SER5 cannot share a config.yaml. They run different audio devices, point at different TTS endpoints, and have different silence thresholds that reflect the acoustic properties of different rooms.

The workstation TTS endpoint is http://localhost:8100 — OmniVoice runs locally. The SER5 endpoint is http://inference:8100 — it reaches across the LAN to the inference box, because there's no local GPU to run OmniVoice. I accidentally pushed the wrong config once and spent 20 minutes confused about why TTS was failing before I noticed the endpoint mismatch.

They're now in separate directories and managed separately. No symlinking, no "shared base with overrides," nothing clever. Just two configs, clearly named, touched independently. The cost of that simplicity is that I have to remember to update both when something structural changes. I've forgotten twice.

What happens when inference goes down

The inference box is a desktop machine running 24/7 in my office. It's reliable until it isn't. I've had three unplanned restarts since I set this up — one was a BIOS update that auto-rebooted, one was a kernel panic I still don't fully understand, and one was my own fault (I was benchmarking something and pushed the GPU harder than I expected).

When inference goes down, the SER5 loses TTS entirely. Blip can still hear the kids and transcribe what they say — that part runs locally. But it can't respond. The failure mode is silence, which is alarming to an 8-year-old who asked a question and got nothing back.

There's a Piper fallback configured. When OmniVoice on the inference box is unreachable, Blip falls back to a local Piper voice running on the SER5's CPU. It sounds different — noticeably flatter — but it works, and at least Blip can say "I'm having a little trouble with my voice right now, but I can still hear you." The kids seem to accept this. Jaxsen once asked if Blip had a cold. I said yes.

The latency difference is real

On the workstation: say something, hear a response in about 800ms start-to-finish. On the SER5: closer to 1.2 seconds, sometimes more. Most of that difference is the extra network hops — audio goes out to inference for STT, response comes back from inference for TTS.

800ms feels fast. 1.2 seconds feels like a pause. That's not intuition — there's research on conversational turn-taking showing that humans start to interpret silences longer than about 700ms as discomfort or disengagement. The SER5 is right at the edge where the interaction starts feeling slightly sluggish, and I notice it more now than when I was testing exclusively on the workstation.

The fix is on the roadmap: torch.compile on the OmniVoice server, moving Whisper large-v3-turbo to the inference box permanently, and tuning the sentence buffer so the first words come out faster even if the full response takes longer. I haven't implemented any of that yet. Ask me again in a month.

What Phase 8 actually means

The plan has always been to eventually run Blip on Windows, since that's what the SER5 would ship with to a non-technical family. Right now it runs Linux. The Windows port is Phase 8, and it's been Phase 8 for a while.

The problem is that the audio stack is completely different on Windows. PipeWire doesn't exist there. openWakeWord has a Windows build but it's finicky. PySoundDevice works but device enumeration behaves differently, and all the device-index-by-name logic I wrote assumes PipeWire semantics. I'm not looking forward to it.

For now, both kids are using the Linux SER5, and it mostly works. "Mostly works" is a lower bar than I wanted, but it's an honest description of where things are. The session recorder is on, I'm watching the JSONL logs, and I'm tuning based on what I actually see happening rather than what I expect to happen. That's a better development loop than what I had when everything was running on the workstation.