I Trained Five Versions of Blip's Brain. TriviaQA Won, and I Still Didn't Use It.
blip-edu v2 wins 1 test out of 28 when it runs against Claude Sonnet, DeepSeek-V3, and a handful of capable local models on a general benchmark. One. Out of twenty-eight.
I am completely fine with this.
blip-edu doesn't compete against those models. It runs locally, in about a quarter of a second, at zero cost, for two kids — one who is eight and obsessed with dinosaurs and multiplication tables, and one who is six and currently interested in spelling "breakfast" and convincing Blip to tell her another story about the dragon. The model doesn't need to win a general benchmark. It needs to know these two kids well enough to talk to them.
The question I wanted to answer was whether I could make it any better.
The five experiments
The full technical breakdown is in this earlier post. The short version: blip-edu is a Qwen2.5-7B fine-tune, trained on 14,000 synthetic teacher-student conversations I generated with Claude Sonnet as the tutor. All the training examples are specifically about talking to kids — age-appropriate vocabulary, short answers, warmth without condescension. That's why it scores poorly against general models and reasonably well against the specific thing it's supposed to do.
The hypothesis I was testing: what if I blended in real training data from outside that synthetic set? Would the model generalize better, talk more naturally, handle off-script moments more gracefully?
I trained five variants. Each one was a 50/50 blend of the existing v2 data and one new dataset: real child-directed speech transcripts (or the closest proxy I could find), short fictional stories for kids, curated instruction-following data, general multi-turn conversation, and factual question-answer pairs from TriviaQA. Same base model, same LoRA config, everything identical except the mix.
What won
Not what I expected.
I assumed the child-directed speech data would win, or maybe the conversation data. TriviaQA is question-answer pairs about history and geography and pop culture. It has nothing to do with talking to children. And yet: 7 wins out of 28 tests, compared to v2's 3. The categories where it pulled ahead were creative tasks — the open-ended questions I'd labeled "Claude territory" in the benchmark setup, the ones I assumed a fine-tuned 7B would reliably lose.
My best guess — and I want to be clear it's a guess — is that TriviaQA training data tends toward short, confident, declarative answers. You ask a question, you get a direct answer, no hedging, no meta-commentary about the nature of the question. That turns out to be exactly the right voice for a tutor talking to an eight-year-old. Less "that's a great question, let's think about this together" and more "it was the brachiosaurus, and here's why that's interesting."
The problem
blip-edu v2 responds in about 0.26 seconds on average. The TriviaQA variant? 1.61 seconds.
For a voice assistant where a kid is waiting for the response, that gap is the difference between "Blip is thinking" and "did Blip break?" Jaxsen will stop talking mid-sentence if there's more than two seconds of silence. I watched it happen during testing and it was immediately obvious the latency was a problem, regardless of what the benchmark scores said.
The standard next step would have been a Phase B ratio sweep — test the TriviaQA blend at 25%, 50%, and 75% to find where the quality gain disappears and the speed comes back. I didn't run it.
Why I stopped there
Partly because I sat down to plan Phase B and realized something: the 7-wins-versus-3 result was inside a tournament that only included blip-edu variants. When I looked at the same models in the full benchmark — comparing against everything, not just each other — the margins between all the variants were effectively noise. Best average score: ab-childes at 39.4 out of 50. blip-edu-v2: 39.5.
Those are not meaningfully different numbers.
So the TriviaQA variant won a head-to-head tournament against inferior competition but didn't move the needle against the actual baseline when measured cleanly. That's not a compelling case for six more hours of training runs. The real bottleneck in Blip's quality right now isn't which 7B variant handles the math drill. It's the routing layer — the part that decides whether to call the 7B at all, or whether to send the request to Claude Sonnet instead. Fix that and you get a larger quality improvement than any training data blend I've tried.
What I actually learned
When you have a model fine-tuned on domain-specific data that's genuinely working, adding general datasets at a 50/50 blend mostly adds noise. The model already knows it's talking to kids — that information is baked into 14,000 training examples. TriviaQA helped marginally because it added directness, which happened to be useful. But the help wasn't worth the cost.
The more interesting finding is about what "winning" means. In the mini-tournament, the TriviaQA variant clearly won. In the full picture, it was a wash. I had defined my success metric as "+3 wins over v2" without thinking clearly about what population of models I was measuring against. A win in a weak field is not the same thing as an improvement. I should have known this and set up the test differently.
blip-edu-v2 stays as the operational model. Maybe I revisit this when the routing layer is better, or when I have actual CHILDES data to train on instead of a proxy. The proxy I used — a different random sample of TinyStories — didn't tell me anything about child-directed speech. It just told me what blip-edu-v2 looks like at a slightly different mix ratio.
Last week Jaxsen asked Blip to help with multiplication by 7s, and Blip walked him through it without me. Adalind spelled "breakfast" and heard "nice job" and was pleased. The latency on that interaction was 0.26 seconds.
That's the benchmark that matters.