Testing Blip Without a Kid in the Room

For a long time my only way to test Blip was to actually run it. Say "Hey Blip," wait for the chime, ask something, wait for the response, decide if it was wrong. This is not a good testing workflow. It's slow, it depends on me being awake, and it only covers whatever I happen to say. There's no systematic coverage.

I wanted something that would let me run every interaction type — math drill, spelling, profile switch, emotional support, boundary push — in one shot, measure every stage, and tell me what's working and what isn't. No child required.

What the test harness does

The core idea: generate synthetic child-voice audio with edge-tts, then pipe it through Blip's full stack from STT to TTS response. Every stage is timed and logged independently.

The pipeline has five stages:

Voice generation — edge-tts renders the test phrase as a child-like voice (pitch-shifted, rate-adjusted). Saved as 16kHz WAV.
STT — faster-whisper small.en transcribes the WAV. CUDA on the workstation, CPU on SER5.
Intent detection — regex + pattern matcher that catches drill entry, profile switches, parent mode. Runs in under a millisecond.
LLM — either a mock (canned response, instant) or live (blip-edu via Ollama proxy).
TTS — OmniVoice synthesizes the response. Measures round-trip bytes and latency.

I wrote 12 test utterances covering the main interaction types. Two child personas: a 9-year-old boy (slightly pitched-up, neutral rate) and a 5-year-old girl (softer, slower). Running the full suite takes about 30 seconds in mock mode.

First run results

12 of 12 passed on the first real run. The numbers:

Stage	Range	Notes
Voice gen (edge-tts)	643–1231ms	Longer for longer phrases
STT (CUDA, small.en)	73–357ms	Fast enough for production
Intent detection	<1ms	Pure pattern match, no model
LLM (blip-edu via proxy)	~1400ms	First token latency
TTS (OmniVoice)	~1250ms	Full audio synthesis
Total	2.6–4.9s	End-to-end

Under 5 seconds end-to-end for a full voice interaction. That's livable. The voice gen stage isn't part of the real pipeline — it's synthetic input — so the actual user-facing latency is closer to 2.7 seconds from speech end to spoken response. That's what a kid waits.

The speech correction problem

One of my test phrases was "The wabbit wan weally fast" — deliberate articulation errors to simulate a younger child with an /r/ → /w/ substitution. This is a real speech pattern. Lots of kids do it.

Whisper transcribed it as: "The rabbit run really fast."

That is impressive and also a problem, depending on what you're trying to do. For understanding intent — great. The child was talking about a rabbit running fast, Whisper got it right. For speech therapy — that's exactly the information I needed, destroyed. The whole point of the speech practice mode is to catch the substitution, give feedback, help the child practice the /r/ sound. If Whisper corrects it before the analyzer sees it, there's nothing to analyze.

This is why Blip has a separate phoneme analysis layer (WavLM) that runs on raw audio, not on Whisper output. The diagnostic confirmed that the separation is necessary, not optional.

Intent detection accuracy

The intent router did well on the structured commands. "Let's do some math" → drill_enter: math. "I want to practice spelling" → drill_enter: spelling. "Switch to Addie" → Whisper transcribed "Addy" but the profile matcher still mapped it correctly to the right profile.

"Parent mode" didn't trigger a parent_mode intent. That one needs a fix — there's a mismatch between the phrase I expected to work and what the intent router is actually looking for. Easy fix, useful catch.

"Tell me a story about a dragon in space" came back as drill_enter: story, which is technically correct (Blip treats creative storytelling as a mode) but debatable — freeplay conversation would probably be a better classification for this specific phrasing. Something to tune.

The whisper / number problem

When the test utterance was just "Seven," Whisper transcribed it as "7." The digit, not the word. This is consistent behavior — Whisper normalizes number words to numerals whenever it's confident. For intent detection based on string matching, that matters. If the math drill answer scorer is looking for "seven" and gets "7," it fails the match.

The fix is a text normalization step between STT output and any downstream string comparison: seven → 7, three → 3, etc. I haven't built that yet. The test found it.

SER5 latency: the numbers are worse than I estimated

I ran the same 12-utterance suite with STT sent over SSH to the SER5 instead of running locally on the workstation. The SER5 is a Beelink AMD mini PC with no CUDA — small.en runs on CPU only.

Machine	STT range	Total round-trip
Workstation (CUDA)	73–357ms	2.6–4.9s
SER5 (CPU via SSH)	5,565–5,957ms	7.6–8.3s

5.5 to 6 seconds of STT latency, consistently. The SSH overhead (scp + remote Python startup) adds maybe 200–500ms. So the actual on-device inference is around 5 seconds per utterance. That's worse than my earlier estimate of 4 seconds — the benchmark is running under different load conditions than my previous test.

The good news: transcription accuracy is identical. The SER5 got the same words right and wrong as the workstation. It's purely a speed problem, not a quality problem.

This confirms that keeping STT on the workstation (via proxy) is the right call for SER5 production use. A 6-second silence after a child finishes speaking would ruin the experience.

What the text emulator found

After building the text-only harness (no audio), I ran 9 workflows covering math drills, spelling, speech therapy, ADHD patterns, boundaries, profile switching, parent mode, and edge cases. 2 of 9 workflows passed fully on first run.

The failures split into two categories. The first category is real bugs: "Actually can we do a story?" doesn't trigger a drill exit intent (the pattern matching doesn't handle "can we do" as an exit signal). "Parent mode" PIN entry doesn't have session state — the system knows you entered parent mode but doesn't stay in it for the next turn. These are actual Blip bugs, found by testing.

The second category is a harness design issue. The mock canned responses don't contain topic-specific words ("volcano", "lava", "robot") because the mock doesn't know about topics — it returns the same generic freeplay response for everything outside of drills. Scoring those keyword assertions in mock mode is meaningless. They should only run against real Claude in live mode.

This is actually a useful architectural distinction: mock mode validates routing (did the right backend get called?) and intent detection (did the right intent fire?). Live mode validates response quality (did Claude say something age-appropriate and educationally relevant?). Conflating the two in mock mode produces false failures.

I'm splitting the test criteria accordingly. Routing and intent tests run in both modes. Content and tone quality tests are live-only.

Update: 7/9 passing, 0 failures

After a few hours of iteration — fixing intent signatures, correcting the drill state machine, marking content assertions as live-only — the text emulator is at 7/9 workflows passing with 0 failures in mock mode. The 2 remaining warnings are genuine behavioral findings:

A topic-change question asked mid-drill (like "what's the biggest dragon ever?") routes to the Ollama drill backend instead of Claude's creative path. In the real system this would be handled by the LLM context — Blip would recognize it's off-topic and respond with that. The mock doesn't have that context, so it routes wrong. This is a known emulator limitation for now.
The mock "I don't know" response in the spelling drill doesn't have enough encouraging-tone signal words. This is a mock content gap that'll be verified in live mode.

The workflows now cover 50 turns across 9 interaction types: math drills, spelling, speech therapy, ADHD patterns, boundary pushing, multi-child switching, parent mode, freeplay, and edge cases. Intent routing and backend selection are tested on every turn.

Next: run the full workflow suite against live Claude API and score the response quality. That's where the real signal is — whether the content is age-appropriate, encouraging without being saccharine, and educationally accurate. The mock mode test harness will serve as the regression baseline.