Five Datasets, One Question: Does Training Data Source Matter for blip-edu?

blip-edu v2 was trained on 14,000 synthetic examples — conversations I generated using Claude Sonnet as the teacher. The model got 1 win out of 28 in the benchmark. Not bad for a 7B, but not dominant either, and not improving reliably over v1. At some point I started wondering whether the ceiling I was hitting was a data volume problem or a data source problem.

Synthetic data is convenient. It's cheap to make, covers whatever categories I want, and arrives already formatted. But it's also all written by the same author (Claude), which means all 14,000 examples probably share the same rhythm, the same hedging patterns, the same preference for complete sentences. If the model is learning to mimic that rhythm, is it learning anything that would generalize to how a real seven-year-old actually talks?

That question is what this test was about.

The five variants

Phase A is a category sweep. Five variants, each one a 50/50 blend of the existing v2 synthetic data and one new open dataset. Same base model (Qwen2.5-7B-Instruct), same LoRA config (r=16, alpha 32, 3 epochs), everything identical except what gets mixed in.

ab-childes — the CHILDES baseline. 10,000 total examples. I wanted real child-directed speech here — transcripts of adults talking to kids, the kind of thing you'd train on if you were trying to teach a model how adults actually talk to children. True CHILDES isn't packaged for easy download, so I used the closest available proxy: a second sample of TinyStories with a different random seed. Not ideal. This variant mostly tells me what the v2 data alone looks like in a different blend ratio.
ab-stories — TinyStories. 13,000 total. Short, simple narratives designed for language model training on child-appropriate text. The hypothesis: story training might improve Blip's narrative responses and emotional engagement.
ab-instruct — SmolTalk. 13,000 total. A curated instruction-following dataset from HuggingFace's smol-tools collection. The hypothesis: more instruction data might sharpen Blip's task clarity when it's running a drill.
ab-chat — UltraChat. 13,000 total. Multi-turn conversational data, much more diverse in subject matter than the synthetic v2 examples. The hypothesis: general conversation breadth might improve Blip's handling of off-script moments.
ab-knowledge — TriviaQA. 13,000 total. Real question-answer pairs covering a broad factual range. The hypothesis: factual grounding might help Blip on trivia and science tasks where it currently hedges too much.

All five also competed against blip-edu v1 and v2 in the same benchmark run, so there's a direct comparison against the pure-synthetic baselines.

The setup took longer than the training

I should have planned for three days and scheduled the GPU runs up front. Instead I ran the pipeline twice before it worked.

First attempt: I forgot the training infrastructure lives on the inference box (inference, RTX PRO 6000 Blackwell, 96 GB VRAM), not my local workstation. I wrote the pipeline script assuming a local venv that doesn't exist anymore — we moved training to inference back when we were working on v2. The script ran, failed silently on every step because the venv path was wrong, and logged "Training complete" after each failure because I'd forgotten to add -e to set -uo pipefail. So it completed five iterations of nothing and tried to run a benchmark with zero trained models.

Second attempt: disk full. The GGUF conversion pipeline creates an intermediate f16 file (~14 GB) before quantizing down to Q4_K_M (~4.5 GB). I wasn't accounting for that, so by the third variant, the inference box was out of disk space. I moved all the f16 files to cold storage and tried again.

Third attempt, after fixing both: the pipeline ran on inference via SSH, trained on the 96 GB card (which is comically large for a 7B model — the whole thing fits in about 6 GB and training uses maybe 12 GB), and the first variant's adapter and merged weights still existed from the second attempt, so it skipped straight to GGUF conversion. Total training time per variant: ~70 minutes. Five variants, roughly six hours of actual GPU work, though wall-clock was longer while I fixed things between runs.

What I was actually measuring

The same 28-test blip_learn benchmark suite I've been using throughout this series: spelling, math, trivia, emotional support, safety refusal, greeting, voice quality, session management, and multi-turn conversation. Claude Opus 4.6 and DeepSeek-V3-0324 both judging each output. The question is whether any blend beats v2's 1 win, and if so, which one and in which categories.

What would "winning" look like? At minimum, +3 wins over v2 on 28 tests, with no regression on the child-voice and safety tests — those are the categories where getting it wrong matters. A variant that wins 4 trivia tests but starts saying "as an AI language model" to a six-year-old is not better, it's broken differently.

Results

Rank	Model	Wins / 28	Avg rank	Avg time
1	blip-edu-ab-knowledge (TriviaQA blend)	7	4.36	1.61s
2	blip-edu v1 (baseline)	5	3.79	0.30s
2	blip-edu-ab-childes (TinyStories proxy blend)	5	3.93	0.33s
4	blip-edu-ab-instruct (SmolTalk blend)	4	3.82	1.65s
5	blip-edu v2 (synthetic baseline)	3	3.82	0.26s
6	blip-edu-ab-chat (UltraChat blend)	2	4.04	1.62s
6	blip-edu-ab-stories (TinyStories blend)	2	4.25	1.63s

28 tests, dual judge (Claude Opus 4.6 + DeepSeek-V3-0324), blind scoring. Blip fine-tune variants only — the full benchmark suite includes cloud models and larger locals that aren't the point here.

TriviaQA won. I did not expect that.

My theory going in was that SmolTalk (instruction-following data) would win on drill tasks, or that UltraChat (diverse conversation) would win on the off-script moments. TriviaQA was a long shot — it's question-answer pairs about history and geography and pop culture, not kid-directed anything. And yet: 7 wins, compared to v2's 3. That's more than double, and it cleared the +3 threshold I set as the bar for "this actually matters."

Where did the 7 wins come from? Math (1), greeting (1), safety (1), multi-turn (1), creative (3). The creative tests are interesting — those are the ones I labeled "Claude territory" in the benchmark, the open-ended questions where I expected a fine-tuned 7B to struggle. ab-knowledge won all three. My best guess is that the TriviaQA training data has a lot of short, confident, declarative answers — and that's exactly what you want for a kids' tutor that shouldn't hedge everything to death.

The caveat is the response time. blip-edu v1 answers in 0.30 seconds on average. blip-edu-ab-knowledge takes 1.61 seconds. For a voice assistant where a kid is waiting for Blip to respond, that gap is noticeable. I'm not sure yet whether the quality gain is worth the latency hit, or whether Phase B will widen or close that gap as the blend ratio changes.

v2 still lost to v1.

I keep expecting v2 to pull ahead of v1 and it keeps not doing that. This run: v1 gets 5 wins, v2 gets 3. The extra 5,500 examples I added for v2, the new categories, the conversational math swap — none of it pushed v2 clearly past the model that came before it. v2 has a better average score (39.5 vs 37.9, slightly) but loses more individual tests. At this point I think v2 is more consistently adequate and v1 is more occasionally excellent, which in a benchmark that rewards wins over averages, looks like a regression.

Stories and chat didn't help much

TinyStories (ab-stories) and UltraChat (ab-chat) both landed at 2 wins — below v2. For TinyStories, I think I understand why: the stories dataset is very repetitive ("Once upon a time, there was a little...") and probably just added variance without adding signal. UltraChat surprised me more. That dataset covers a genuinely wide range of topics and conversation styles, and I expected it to generalize. Instead it seems to have added verbosity — ab-chat responses averaged 1.62 seconds, versus v1's 0.30. Lots of words, fewer wins.

ab-childes is the interesting one: 5 wins, 0.33s response time, nearly identical to v1. Since I couldn't get real CHILDES data and ended up using another TinyStories sample as a proxy, this variant is mostly just "v2 data at a different mix ratio." The fact that it ties v1 suggests the blend ratio matters — but the proxy substitution means I can't draw conclusions about child-directed speech specifically.

What's next: Phase B

ab-knowledge cleared the bar. Phase B runs three versions of the TriviaQA blend at 25%, 50%, and 75% of the training mix — same base v2 data, different ratios — to find out whether the 7-win result holds up, scales, or disappears. The response time question is the one I care about most: does a lighter mix (25% TriviaQA) get closer to v1's 0.30s while keeping some of the quality gain?

If it turns out that 50% was already too much and the model is learning to give long factual answers when short warm ones are better, I'll know that. If 75% blows it up entirely, I'll know that too. Is that the right trade-off? I genuinely don't know. That's why there's a Phase B.