blip-edu v2: What I Learned Trying to Make My Kid-Tutor Model Better

Six hours of GPU time. Fifteen dollars. Two new model variants. And the headline result is that neither one beat the original.

I should back up. Yesterday's post covered blip-edu v1 — a Qwen2.5-7B model with a LoRA adapter trained on 8,500 synthetic kid-tutor conversations. It scored 2 wins out of 28 on the Blip benchmark, sat in Tier 2, and taught me something I probably should have already known: fine-tuning teaches style, not capability. The 1,500 math-drill examples couldn't fix the fact that a 7B model just can't do arithmetic.

That post ended with two hypotheses I wanted to test. First: swap the math data. Instead of verification math ("is 6+7=13?"), train on conversational math ("I'm stuck, can you help?") — sidestep the arithmetic ceiling entirely. Second: try Qwen2.5-Coder-7B as the base instead of Qwen2.5-Instruct — maybe code pretraining, which includes massive exposure to arithmetic literals and numeric comparisons, would transfer to better math after fine-tuning.

One of those hypotheses worked exactly as I predicted. The other was wrong in a way I find genuinely interesting.

The v2 dataset

14,000 examples across nine categories, generated via Claude Sonnet 4.6. The changes from v1:

Dropped math_drill (1,500). Replaced with math_conversational (1,500) — prompts like "I'm stuck on this problem," "what's the trick to remember 7×8," "this is hard" where Blip's correct response is a strategy or a rephrasing, not an arithmetic answer.
Added multi_turn (2,000). 3-5 turn conversations with explicit context carryover — Blip has to reference what was said earlier in the exchange.
Added emotional_support (1,500). Was folded into general_chat in v1; splitting it creates a focused training target.
Added greeting (500).
Grew safety_redirect from 500 to 1,000. v1 ranked 8/12 on safety with only 500 examples — wanted more.
Grew spelling_drill 1,000 → 1,500 and science_qa 1,000 → 1,500.
Kept general_chat (3,000) and story_collab (1,500) unchanged.

Generating the full set took about 90 minutes with 8 parallel workers hitting the Anthropic API. I burned maybe 20 minutes debugging a timeout problem — multi-turn examples produce 6,000-10,000 output tokens per batch of 20, which blew past the 120-second SDK timeout. Dropping multi-turn batches to 8 examples each fixed it. Obvious in hindsight. Not obvious before the first stack trace.

Two training runs, same data

Same v2 dataset, two different base models, identical LoRA hyperparameters (rank 16, alpha 32, learning rate 2e-4, 3 epochs). Unsloth on the workstation 4090.

blip-edu:v2 — base is unsloth/Qwen2.5-7B-Instruct (same as v1). Trained ~90 minutes. Final training loss around 0.95.
blip-edu-coder — base is unsloth/Qwen2.5-Coder-7B-Instruct. Trained ~90 minutes. Final training loss around 1.05.

That 0.1 gap in training loss was the first clue. The Coder base fought the kid-tutor persona harder than the Instruct base did — same data, same hyperparameters, but the loss curves sat apart the entire run. Teaching a code-generation model to coo reassuring things to a frustrated 7-year-old just... takes more convincing, I guess.

Fourteen models, twenty-eight tests

Added both new variants to config.yaml and ran the full Blip benchmark suite with Claude Opus 4.6 and DeepSeek-V3-0324 both judging every test.

Rank	Model	Wins	Avg rank	Avg time	Avg tokens
1	GLM-5	8	3.89	4.7s	127
2	MiniMax-M2.5	5	5.71	5.6s	147
3	Claude Opus 4.6	4	4.64	3.5s	52
4	Claude Sonnet 4.6	3	4.25	3.2s	46
4	DeepSeek-V3-0324	3	5.39	4.2s	36
6	blip-edu-coder (mine)	1	7.32	1.5s	45
6	qwen3.5-abliterated	1	7.39	0.7s	61
6	blip-edu:v2 (mine)	1	8.11	1.5s	45
6	blip-edu (v1, mine)	1	8.25	1.5s	45
6	DeepSeek-R1:32b	1	8.57	3.3s	41
11	qwen2.5-coder:32b	0	6.11	3.3s	38
12	Hermes3:8b	0	9.89	1.6s	58
13	qwen3.5-35b-a3b-opus-distilled	0	11.89	6.4s	465
14	qwen3.5-27b-opus-distilled	0	13.57	11.4s	495

So the top line is boring

All three blip-edu variants: 1 win out of 28. Tied with qwen-abliterated and DeepSeek-R1 for sixth place. That's actually one win fewer than v1 got in yesterday's run (2 wins), but the variance post already showed the noise floor is ±2-3 wins per model on this benchmark. Tier-level? Still Tier 2. Below the cloud frontier, above the unusable floor.

The big story where v2 lifts the model into Tier 1 did not happen.

But the per-category numbers are where I spent most of the afternoon staring at a spreadsheet, because those moved.

Where the data actually moved

v1 (yesterday's run) vs v2 (today's run) by category average rank. Lower is better.

Category	v1 avg rank	v2 avg rank	Δ (v2−v1)	Notes
math	9.5	6.2	−3.3 ✓	The v2 swap worked
emotional	5.0	4.7	−0.3	marginal
greeting	8.7	7.0	−1.7	new category helped
spelling	5.8	6.5	+0.7	slight regression
multi_turn	4.0	9.0	+5.0 ✗	Training for it made it worse
creative	6.0	9.0	+3.0 ✗	regression
trivia	6.3	9.0	+2.7 ✗	regression
safety	8.0	11.7	+3.7 ✗	grew category, got worse
voice_quality	8.3	11.3	+3.0 ✗	regression

I sat with this table for a while.

Math: the swap worked

v1 ranked 9.5 out of 12 on math tests — worst category, despite math being the second-largest training bucket. v2 ranks 6.2 out of 14 on the same tests. Middle of the pack, above some 32-billion-parameter competitors. Biggest rank improvement of any category.

Why? Because I stopped asking the model to do something it can't do. The old math_drill data rewarded correct arithmetic — and a 7B model will get arithmetic wrong. The new math_conversational data rewards pedagogy — "here's a trick for remembering 7×8" — and the model never has to commit to a specific answer. I didn't make it smarter. I just stopped punishing it for being dumb.

Multi-turn: the one that still bothers me

Here's what I can't quite shake. v1 had zero multi-turn training examples. Zero. And it ranked 4.0 on multi-turn tests — its best category, top half of the whole field. So I added 2,000 explicit multi-turn examples in v2, figuring I'd push that number even higher.

v2 ranks 9.0 on multi-turn. A five-rank drop. I trained it on the thing it was best at and made it worse at that thing.

My best theory — and I want to be honest that it's just a theory — is that the base Qwen2.5-7B already had strong multi-turn handling baked in from its own pretraining. v1's LoRA, trained on single-turn data, didn't touch those weights. v2's multi-turn examples taught the model a specific pattern of context carryover, and that pattern was apparently less flexible than what the base already knew. The LoRA overwrote a working behavior with a narrower one.

If that reading is correct, the lesson is uncomfortable: sometimes the best thing you can do for a capability is leave it alone. Don't fine-tune what already works.

Safety and the rest

I doubled the safety_redirect examples from 500 to 1,000 and got worse results — rank 8.0 to 11.7. Voice quality regressed. Creative regressed. Trivia regressed.

The pattern is pretty clear: v2's dataset specialized the model more narrowly, and that narrowing cost it in categories where the base model's generic flexibility was the thing doing the work. The math improvement wasn't free — it came out of general capability. Classic fine-tuning tradeoff that I should have seen coming but didn't fully appreciate until the numbers were in front of me.

I should flag the variance caveat from the previous post: the 28-test Blip benchmark has about ±2-3 wins of noise per model on a single run, and category-level averages carry similar uncertainty. Some of these "regressions" might be noise. I'm reading the direction, not the exact magnitude.

The Coder hypothesis — wrong, but interesting

The whole reason I trained blip-edu-coder was to test whether code pretraining — all that exposure to arithmetic literals, comparisons, numeric libraries — would give the model a leg up on math after fine-tuning. Cleaner hypothesis than bolting on tool-use training.

It ranks 11.8 on math. Worst of the three variants. The code-trained base, despite having seen more raw arithmetic than Instruct ever did, produces worse arithmetic pedagogy. So that's falsified.

Why, though? Probably the same specialization cost, amplified. The Coder base had more to unlearn to adopt the warm kid-tutor persona, and what it lost in that unlearning was the numeric confidence that would have helped. That 0.1 training loss gap was real and it showed up in the benchmark.

But here's the thing I didn't expect: blip-edu-coder posted the best overall average rank of the three variants — 7.32, versus 8.11 for v2 and 8.25 for v1. It won a safety test where the other two didn't. It ranked best of the three on creative (5.0), multi_turn (5.0), voice_quality (7.7), and trivia (7.3). Five categories where the Coder base beats the Instruct base on the same training data.

Worse at math. Better at almost everything else. Not the specialist I was looking for, but a more consistent mid-tier conversational model — which, honestly, is what Blip needs most of the time anyway.

What I'm actually shipping

Given the data, here's how Blip's hybrid router will work:

Spelling drill mode → blip-edu:v2 (still the best blip-edu on spelling, wins the spell-002 test)
Emotional support turns → blip-edu:v2 (best avg rank on emotional across the three, and the privacy benefit matters most here)
Math drill mode → blip-edu:v2 (rank 6.2 isn't Tier 1 but it's reasonable for a free local model, and the conversational framing means wrong arithmetic doesn't get rewarded)
Multi-turn free chat → blip-edu-coder (best of the three on multi-turn and creative)
Everything else → cloud (per the tier-level analysis from previous posts)

Both local models respond in 1.5 seconds with zero network round-trips and zero tokens leaving the inference box. Neither one is going to dethrone Claude. That's not the point.

If I could go back to yesterday

I'd tell myself three things:

Targeted dataset changes cut both ways. Swapping math prompts moved math from worst to middle. Growing other categories made them worse. You get what you train for, and you pay for it in the categories you don't.
Don't fine-tune on things the base model already handles. Multi-turn was v1's best category with zero training; adding 2,000 examples made it v2's worst shift. If the base already knows how, your LoRA will mostly overwrite a working behavior with a narrower one.
"Bigger base" doesn't mean "better base" for your specific task. Qwen-Coder knows more about arithmetic statistically but produced worse math pedagogy after fine-tuning. Code pretraining doesn't transfer cleanly to "warm and encouraging."

The ceiling

After three blip-edu variants, two dataset generations, a dual-judge benchmark, and about $30 total in cloud and electricity costs — I think I've found the ceiling for this approach. A 7B local model fine-tuned on 14K synthetic examples is Tier 2. Period. No amount of clever dataset design is going to bridge the gap to GLM-5 or Claude Opus. The frontier models have capabilities, not just style, and fine-tuning only changes style.

That's fine. Blip doesn't need Tier 1 for every turn. It needs a model that runs on-device for the sensitive stuff. Responds in under 2 seconds so conversation feels natural. Matches the Blip persona without a 2KB system prompt every turn. blip-edu v2 and blip-edu-coder deliver on all of that.

Total spend, end to end:

v2 dataset generation (14,000 × Sonnet 4.6, 8 parallel workers): ~$11
Training blip-edu:v2 (~90 minutes RTX 4090): ~$0.60 electricity
Training blip-edu-coder (~90 minutes RTX 4090): ~$0.60 electricity
Benchmark run (14 models × 28 tests + dual judge): ~$4
Total: ~$16.20

Under twenty dollars for two model variants, one confirmed positive finding (conversational math works), one confirmed negative finding (Coder base doesn't help math), and one result I'm still chewing on (don't train on things the base already does well). Good return on a few hours of unattended GPU time.

I'm not training a v3. The dataset-level experiments have hit diminishing returns for a single-GPU 7B workflow. The next interesting move would be tool-use training — teaching the model to call an actual calculator instead of pretending to do arithmetic — but that's a different architecture entirely, and I could go on about it but it deserves its own post.

For now, v2 and coder are live on the inference box. The hybrid router starts using them in production this week. If they behave differently against real 8-year-olds than they did against synthetic benchmark prompts — and I suspect they will — I'll write that up too.