From Flat Piper to Voice Cloning: Blip's Voice Overhaul

Blip started with Piper's en_US-amy-medium voice — a 60 MB neural TTS model that runs entirely on CPU and produces the audio equivalent of a friendly sign in a waiting room. Intelligible. Unobjectionable. Completely without character.

"Oh no, that must have been so hard for you" came out in exactly the same flat pitch as "Want to try math?" My 8-year-old could tell the empathy wasn't real. He didn't say that explicitly — he just kept pausing after Blip's responses, like he was waiting for something.

Voice is not decoration in a voice-first system. It's the whole product. Here's what the overhaul looked like.

Why Piper Amy wasn't working

Piper is good at what it does: fast, offline, small, reasonable quality. For home automation or screen readers, Amy is fine. For a learning companion that talks to children about dinosaurs and math problems and the feelings they had at recess, Amy's relentless evenness becomes a problem.

The issue isn't fidelity. Modern neural TTS sounds human enough. The issue is expressiveness: the range of pitch, pace, and emphasis that humans use to signal meaning, emotion, and engagement. Piper's Amy voice is trained to be neutral and pleasant. It succeeds. Neutral and pleasant is not what a kids' tutor needs.

There's also a secondary issue: the voice doesn't match the character. Blip is described to the kids as a hedgehog, a learning buddy, something warm and playful. Amy sounds like none of those things. There was a mismatch between what the character was supposed to be and what the kids heard every time it opened its mouth.

Switching to F5-TTS voice cloning

F5-TTS is a flow-matching speech synthesis model that can clone any voice from a short reference clip — 6 to 15 seconds of audio plus a transcription. It doesn't need a full training run. You provide the reference, it locks onto the timbre and cadence of that speaker, and everything it synthesizes comes out in that voice.

I set up a FastAPI server on the inference box (RTX PRO 6000, 96 GB VRAM) wrapping F5-TTS at port 8100. The orchestrator POSTs each response to /synthesize and gets back a WAV file. Synthesis latency on a 20-word sentence is under 300ms. If the server is unreachable, the code falls back to Piper automatically — no crashing, just a voice regression that's obvious and fixable.

There were two bugs to work around before this was usable in production. F5-TTS auto-splits long text into batches and concatenates the results, but the tensor shapes don't always match. Fix: the server splits text into individual sentences and calls infer() once per sentence, concatenating the audio arrays with 150 ms silence gaps. The second bug: F5-TTS's phonemizer crashes on em-dashes, smart quotes, and ellipsis characters. Claude loves em-dashes. Fix: a _sanitize_text() helper that replaces all non-ASCII punctuation before the text reaches the model.

The Poppy problem

F5-TTS clones whatever voice is in its reference audio. After trying a few generic neural voice clips (fine quality, no personality), I found reference audio for Poppy from the Trolls movies. Bright, young, energetic — exactly the register Blip should be in.

The cloned voice was good. The kids used it happily for one session. Then Addie said, "Wait, that sounds like Poppy."

And that was it. Once she said it, Jaxsen heard it too. The voice stopped being Blip's voice and became Poppy-doing-a-thing. Every subsequent interaction had this faint weirdness of a recognizable character from one context appearing in a different one. Kids don't have the adult habit of compartmentalizing that kind of thing. When the voice sounds like someone they know, they're talking to that someone.

The lesson: voice cloning from a recognizable character only works if the character is obscure enough that the kids won't catch it, or distinctive enough that it's clearly intentional — like branding. Poppy was neither.

The second voice

The second attempt: a different reference voice. Same approach, less recognizable source. Softer than Poppy, clearer, less relentlessly cheerful. It can sustain for 30-second explanations without exhausting the listener. And warm enough that short coaching phrases ("Nice work!" "You're so close!") don't sound sarcastic.

The kids' reaction the first session: nothing. Nobody stopped to say "that sounds like someone." Jaxsen kept talking. Addie kept talking. The voice got out of the way.

That's the target: a voice that the kids stop noticing, because it simply belongs to Blip. That cloned voice became Blip's canonical voice.

Filler sounds and why one isn't enough

With the voice solved, the next problem was dead air.

After the kid finishes talking, there's a gap while Claude generates a response — typically 2 to 5 seconds depending on the question's complexity. In that window the system is silent. To a child expecting a conversation, complete silence reads as: Blip froze, Blip didn't hear me, or Blip doesn't know what to say. Jaxsen's tell was reliable: at the 3-second mark, he would say "Blip?" Testing with stopwatch precision by accident.

The solution is filler sounds: pre-rendered thinking phrases played immediately after the kid stops speaking, while the API call is in flight. A FillerCache organizes them into categories — thinking sounds ("Hmm," "Let me think"), acknowledgments ("Got it," "Ooh"), transition phrases ("Okay so," "Well") — and picks based on semantic signals in what the kid said. Questions trigger thinking sounds. Short factual answers trigger acknowledgments. Long explanations trigger "Okay so."

The first implementation played one filler and went silent. It made things worse. A single "Hmm" followed by three seconds of nothing is more jarring than pure silence, because now the expectation is set and then violated. Real people don't say "Hmm" and then go completely silent for three seconds. They say "Hmm… let me think about that… okay so…"

The fix is a continuous filler loop: an asyncio background task that picks fillers, plays them, waits a randomized gap (0.3 to 0.5 seconds after the first filler, 0.4 to 0.8 seconds after subsequent ones for natural rhythm variation), then picks the next one. An asyncio.Event signals the loop to stop when Claude responds. The loop finishes the current phrase before exiting — no mid-word cutoff — and the transition into the real response is seamless.

All filler sounds are pre-rendered in Blip's voice at startup and cached as WAVs. Playing a cached WAV is near-instant. There's no synthesis latency on fillers, which is why the loop can stay responsive even on short gap durations.

After the filler loop went in, Jaxsen stopped saying "Blip?" at the three-second mark. He started saying it at maybe the eight-second mark on very complex questions — which is about when I'd get impatient too.

Voice consistency: the two-voices bug

One problem the kids found that I didn't anticipate: when the filler sounds were still in Piper Amy's voice (before I re-rendered them in Blip's cloned voice), Jaxsen said the filler words "sound weird."

He didn't say "that's a different voice." He said weird. Children don't analyze voice inconsistency — they feel it as wrongness. The mismatch between Amy's flat neutral fillers and Blip's warmer, more inflected responses was registering as an uncanny valley in audio.

After re-rendering all 70 cached phrases in Blip's voice, the fillers disappeared into the conversation the way they're supposed to. You stop noticing them, which means they're working.

The same principle applied when I tried slowing Blip's voice down 10% for clarity using librosa.effects.time_stretch. Phase vocoder artifacts appeared that I could barely detect at my desk. Jaxsen, without knowing anything about audio processing, said it "sounds a little weird slowed down." I reverted to natural speed the same day. I was solving a problem that didn't exist.

The listen-transition gap

The last voice-related problem was the subtlest — and the one that took longest to find because it doesn't show up as voice quality. It shows up as missed words.

The barge-in system races TTS playback against a speech monitor. When TTS finishes first (normal turn, no interruption), the code cancelled the monitor coroutine and awaited the cancellation before returning. That await took 50 to 200 milliseconds. Then the display state had to update, the listen loop had to start, and the silence-detection timer had to initialize. Total dead zone: 100 to 400 milliseconds between the end of Blip's sentence and the moment audio capture was actually active and listening.

That window is invisible to an adult who waits a beat before responding. It's fatal for a child who starts talking the instant Blip finishes. Jaxsen's first syllable was being clipped consistently. "Because" was arriving as "ause." "What if" was arriving as "at if." The transcripts looked like someone was mumbling. The actual problem was a 200ms scheduling seam.

The fix: background-cancel the monitor instead of awaiting it. The monitor reads from a shared audio queue that stays alive until end of interaction, so cancelling it asynchronously is safe. A done callback absorbs the CancelledError so it doesn't surface as an event loop warning. The listen loop starts immediately. One line of log output confirms the transition: "TTS complete → listen mode (no barge-in; monitor cancel is async)."

After the fix, Jaxsen's transcripts stopped dropping first syllables.

What real-time audio UX actually costs

Every one of these problems — the flat voice, the character recognition, the dead air, the filler mismatch, the transition gap — was invisible in development. I tested on myself. I tested with the sound playing from across the room. None of them appeared.

They all appeared the moment the kids used it.

Children are a harder test than adults in one specific way: they have no tolerance for anything that feels wrong and no vocabulary to describe what's wrong. They just stop engaging. Or they work around it compulsively — Jaxsen saying "Blip?" at the three-second mark every single time was a workaround, not a complaint. He'd adapted to a system that was failing him without knowing it was failing him.

Real-time audio UX has no slack. A 200ms pause is imperceptible in most software contexts. In a live conversation, it's the difference between being heard and not being heard. A voice mismatch is inaudible to the developer who knows it's two different TTS backends. To the kid, it's a system that feels deeply inconsistent in a way they can't articulate.

The test I should have done from the start: watch the kids use it without helping them, without explaining anything, without being in the room. The problems reveal themselves in about ten minutes. Everything else is guesswork.