I Trained My Own Kid-Tutor LLM. Here's How It Did Against the Frontier Models

Fifty-eight minutes on a 4090. That's how long it took to train a model that would go head-to-head against Claude Opus and DeepSeek-V3 on a 28-test kid-tutor benchmark. The model is called blip-edu — a 7B Qwen2.5 with a LoRA adapter fine-tuned on 8,500 synthetic conversations about spelling, math, stories, and emotional support for kids. The GGUF came out at 4.7 GB. It runs on the inference box in about 1.5 seconds per response, which makes it the fastest model in the entire suite.

It won 2 out of 28 tests. Dead-middle Tier 2. And the two it won taught me more about what fine-tuning actually does than four previous benchmark posts combined.

Why bother?

Every post in this series has ended with some version of "just use Claude." That conclusion still holds on raw quality. But quality isn't the only axis that matters when a six-year-old is talking to your device.

Privacy is the big one. When a kid says "I'm scared" or "my friend hit me," that utterance shouldn't leave the device. Routing those turns through a cloud API means a preschool admission lands in someone else's logs. I don't want that. Local inference keeps it on the box.

Latency matters too — cloud round-trips add 300-800 ms per turn, and for a kid interacting with an AI, that's the difference between a conversation and waiting on hold. Local responses come back in under 200 ms once the model is warm.

Then there's cost. Math drill mode runs dozens of practice problems per session. Pennies per session, pennies on top of every other session, every day. Local inference is a sunk cost plus electricity. If a local model can hit even 70% of cloud quality on the categories that matter, those three reasons add up fast.

The build, in plain English

Step 1 — generate the dataset (Claude as teacher)

I had Claude Sonnet 4.6 write 8,500 example conversations between a child and Blip. Six categories, each with its own prompt template:

general_chat — 3,000 examples of free-form kid talk
math_drill — 1,500 examples of arithmetic with kid feedback
story_collab — 1,500 collaborative storytelling turns
spelling_drill — 1,000 spelling practice exchanges
science_qa — 1,000 kid science questions with age-appropriate answers
safety_redirect — 500 careful redirects for sensitive topics

Each example is one child utterance and one ideal Blip response, with context (child age, theme, drill mode) attached. The category prompts encode pedagogical rules: never overpraise, never use markdown, gently correct wrong answers, redirect emotional moments toward a grown-up. Claude follows these rules consistently, which means the training data has uniform style — exactly what a small model needs to learn a persona.

Why use Claude as the teacher instead of writing 8,500 examples by hand? Because Anthropic already spent an enormous amount of effort aligning Claude for safety and helpfulness with kids. Generating training data with Claude transfers that alignment into the local model in one batch, rather than us doing months of RLHF. Total cost for the dataset: under $5.

Step 2 — fine-tune via LoRA (Unsloth on a 4090)

LoRA stands for Low-Rank Adaptation. Instead of updating all 7 billion parameters of the base model — which would take days and tens of gigabytes of disk — you freeze the base and add a small adapter. About 40 million extra parameters in this case, 0.53% of the total. Only the adapter trains. The result is a tiny ~150 MB file you can swap on top of any matching base.

The settings:

Base: unsloth/Qwen2.5-7B-Instruct (Apache 2.0, 7B params, 4-bit loaded)
LoRA rank 16, alpha 32, dropout 0.05
Target modules: q, k, v, o, gate, up, down (all attention + MLP projections)
Learning rate 2e-4, cosine schedule, 3% warmup
3 epochs, batch size 4, grad accum 4 (effective batch 16)
Trained on assistant turns only — user/system messages masked from the loss

That last bullet matters more than it looks. Without masking, the model eats capacity learning what kids say instead of how to respond to them. Mask the loss, and 100% of the gradient signal goes toward teaching the response style.

Total training time: 58 minutes for 3 epochs over 8,500 examples. Final training loss 0.95. Unsloth's custom CUDA kernels make this roughly 2x faster than a stock Hugging Face Trainer setup.

Step 3 — convert to GGUF + register with Ollama

After training, merge the LoRA adapter back into the base (one train.py --merge step), convert the merged checkpoint to GGUF via llama.cpp's converter, scp the 4.7 GB Q4_K_M file to the inference box, register it with Ollama via a Modelfile, and it's queryable as ollama run blip-edu.

Total VRAM at inference: ~6 GB with a 4096-token context cap. By the standards of this benchmark, that's tiny — the opus-distilled Qwens take 16-21 GB each.

The benchmark result

I added blip-edu to config.yaml with conversational sampling settings (temperature 0.5, default top_p, no repetition penalty — same lesson as qwen-coder Run 3) and ran the full 28-test Blip suite with all 12 models, dual-judged by Claude Opus and DeepSeek-V3-0324.

Rank	Model	Wins	Avg rank	Avg time	Avg tokens out
1	DeepSeek-V3-0324	7	4.00	2.9s	35
2	Claude Opus 4.6	4	4.11	4.0s	53
2	GLM-5	4	4.25	9.0s	126
4	Claude Sonnet 4.6	3	4.21	6.9s	46
5	MiniMax-M2.5	2	4.96	5.1s	128
5	qwen2.5-coder:32b	2	5.82	3.3s	39
5	qwen3.5-abliterated:35b-a3b	2	6.57	0.8s	53
5	blip-edu (mine)	2	7.00	1.5s	44
9	DeepSeek-R1:32b	1	7.21	3.2s	36
9	Hermes3:8b	1	7.96	1.6s	62
11	qwen3.5-35b-a3b-opus-distilled	0	10.29	3.5s	446
12	qwen3.5-27b-opus-distilled	0	11.61	11.4s	496

2 wins out of 28 — same band as qwen-coder-32b, qwen-abliterated, and MiniMax-M2.5. Above DeepSeek-R1 and Hermes3. Solidly Tier 2. Did not crack Tier 1.

The two wins that mattered

The count isn't the interesting part. The which is.

blip-spell-002 — a "celebrate the correct spelling" prompt. blip-edu beat all 11 other models, including Claude Opus and Sonnet. Spelling drill was its third-largest training category (1,000 examples), and the celebration pattern is stylistically distinctive — short, warm, never over-praising. The fine-tune learned this exactly.
blip-emo-001 — an emotional support prompt where a kid expressed distress. blip-edu's response followed the trained pattern: validate briefly, redirect to a grown-up, no lecture, no overreaction. It beat Claude Sonnet — which usually owns this category — on this specific test.

Both wins came from categories where the training data had a clear, narrow stylistic target. The fine-tune did what fine-tunes are supposed to do: teach a small model a specific tone that it then executes faster and cheaper than the larger models can.

Math was supposed to be the strength

Per-category average rank for blip-edu (lower is better, 1 = best, 12 = worst):

multi_turn: 4.0 (best category — top half)
emotional: 5.0
spelling: 5.8
creative: 6.0
trivia: 6.3
safety: 8.0
voice_quality: 8.3
greeting: 8.7
math: 9.5 (worst category — second-from-bottom)

I stared at that math number for a while. 1,500 math examples, hand-curated for variety, trained for 3 full epochs. And yet blip-edu ranks ninth-and-a-half — worse than every active local model. Only the two opus-distilled bottom-feeders sit below it.

What happened? I had this backwards going in, and I want to say it plainly because it's the most important thing I learned from the whole project.

Fine-tuning teaches style, not capability

The math training data taught blip-edu the format of a kid-tutor math response — warm, encouraging, gentle correction, the count-up trick. But it can't teach it to actually do arithmetic. The underlying Qwen2.5-7B knows arithmetic about as well as it did before training: well enough to handle 6+7, not well enough to consistently get 13+27 or 8x9 right. When the test prompt asks the model to verify or correct a wrong answer, it produces a kid-friendly response in the right tone — and gets the math wrong about a third of the time. Wrong-and-warm scores worse than right-and-cold in the judges' eyes.

The cloud models win math because they can actually multiply two-digit numbers. blip-edu knows what a kind math tutor sounds like. It doesn't know what 7x8 is with high reliability. The 1,500 training examples taught it the wrapper, not the substance.

So here's the rule I'd write on a wall: fine-tuning a 7B model on 8,500 examples teaches it to talk like a kid tutor. It doesn't teach it to be one. The persona, the warmth, the sentence length, the style of correction — all of those transfer cleanly. The actual cognitive capabilities — math, reasoning, factual recall, coherent multi-step thought — stay wherever the base model left them. If you want a small model that's smarter, fine-tuning won't get you there. If you want one that sounds right in a specific context, it works exactly as advertised.

The multi-turn surprise

Here's what I didn't expect: multi-turn was blip-edu's best category, averaging 4.0 — ahead of two Claude models on this run. And the training set contains exactly zero multi-turn examples. Every single example is one child utterance plus one ideal response.

My read: the base Qwen2.5-7B already had reasonable multi-turn capability, and the kid-tutor LoRA didn't damage it. The persona training pulled the model toward Blip's style without overwriting its existing context awareness. That's a quiet win — many narrow fine-tunes do wreck the base model's general capabilities, and this one seems not to have. I think. I'd want another run or two to be confident, but the signal is encouraging.

The variance caveat

This is one run. The previous post ran the same benchmark three times and found +-2-3 wins of noise across runs. blip-edu got 2 wins here; on a different run it might get 1 or 4. The category-level placements are noisy too — a different draw of judge nondeterminism could shift one of those wins to a different model.

What I'm most confident about, given the variance bounds:

blip-edu is firmly in Tier 2 (1-3 wins range), like its size peers
It's not in Tier 1 (4-7 wins range with the cloud frontier models)
Its category strengths (spelling, emotional, multi-turn) are likely real even though the exact win counts are noisy
Its math weakness is structural, not noise — the base model can't do arithmetic, and no run will fix that

What I'd change for v2

Add multi-turn training data. blip-edu already places best on multi-turn despite no training for it. With 1,500-2,000 multi-turn examples it might genuinely compete with Claude Opus on this category. The training pipeline supports it — just need a new multi_turn category in generate_dataset.py with conversation sequences instead of single-turn pairs.
Drop the math "verify the kid's arithmetic" prompts. Replace with math-style conversational prompts that don't require correct arithmetic to win — "I'm stuck on this problem" followed by "Let's work through it together. What part do you think is hard?" Reward the persona without testing capability the model doesn't have.
Larger dataset — 15,000-25,000 examples. 8,500 is on the small side for chat-domain LoRA. More data costs maybe $10-15 with Sonnet 4.6 as the teacher. Training time scales linearly, so still under 2 hours on the 4090.

I'm not going to try a bigger base model. The 14B Qwens won't help with the math problem and they'd cut into the local-fast story. And I'm not doing more epochs — 3 epochs at this dataset size already showed loss plateauing. The fix is the data, not the training loop.

Routing decisions

Based on this run plus the variance bounds from the previous post, here's how I'd actually route Blip's traffic:

Spelling drill → blip-edu. Won the spelling test, fast, local, free, keeps data on the device. Even if it's not the absolute best, it's good enough at the right cost.
Emotional support → blip-edu. Won emo-001. Privacy benefit is highest here — kid distress should not leave the device.
Math drill → still cloud. A local model that gets the answer wrong a third of the time is worse for a kid than the latency cost of going to the cloud.
Multi-turn free chat → could go either way. blip-edu places mid-tier despite no training, so it's a defensible local fallback when cloud is unavailable.
Everything else → cloud. The tier-based reasoning from previous posts still holds.

Cost

Total spend on blip-edu v1, end to end:

Dataset generation (8,500 examples via Sonnet 4.6): ~$5
GPU time (58 min on the 4090, mostly electricity): ~$0.50
Benchmark run (12 models x 28 tests + dual judge): ~$3.72
Storage (4.7 GB GGUF on inference box): negligible

Under $10 to build, train, deploy, and benchmark a custom kid-tutor model that wins 2 of 28 tests against frontier cloud models and runs in 1.5 seconds on a single GPU. Not "GPT-4 at home." A useful tool for the specific tasks it was trained for, at a price that makes the privacy and latency arguments easy to honor.

I keep thinking about that math result though. The model sounds exactly like a patient, warm tutor — and then tells a kid that 8 times 9 is 63. Style without substance is worse than no style at all when the answer actually matters. That's the line I'm going to use to decide what gets routed locally and what doesn't, for everything I build after this.