The 31B Merge Finally Worked. The Qwen3 Fine-Tune Is Broken.

The BF16 path worked. Last post I was debugging an empty directory — Gemma 4 31B's LoRA adapter trained cleanly, but every export attempt either produced nothing or died with an EOF reading stdin. The issue was Gemma4ClippableLinear, a custom Unsloth wrapper around the attention projection layers. PEFT can't dequantize through a non-standard layer class. Result: empty merge output, no crash, no useful error.

The fix was to skip the Unsloth quantized base entirely. Download the BF16 weights directly from HuggingFace — 62 GB, two shards — patch the adapter config to point at those instead, then merge with standard PEFT. No ClippableLinear in the way. It worked. blip-edu-gemma4 has been sitting in Ollama on the inference box for two days waiting for last night's benchmark to run.

blip-edu-qwen3 also ran. Mostly timed out.

The full run: 30 models, 28 tests

Same setup I've been using since the start of this series: 28 tasks across spelling drills, arithmetic, trivia, greetings, emotional support, safety refusal, voice quality, multi-turn session management, and creative storytelling. Dual judge — Claude Opus 4.6 and DeepSeek-V3-0324 scoring independently, averaged. Blind labels per test so the judge doesn't know which model it's reading. 30 models total this run — biggest field I've run.

#	Model	Total score	Wins
1	DeepSeek-V3-0324	1025.0	3
2	GLM-5.1	1021.5	4
3	Claude Sonnet 4.6	1018.0	3
4	Claude Opus 4.6	1006.5	3
5	GLM-5	981.5	2
6	MiniMax M2.5	969.5	0
7	GLM-Z1-32B	961.5	0
8	GLM-4.5 Air (new)	955.0	1
9	blip-edu-gemma4 (new)	921.0	2
10	GLM-4.5 Air Q2 (new)	918.0	0
—	blip-edu v1	824.0	0
—	blip-edu-glm-9b	658.0	0
—	blip-edu-qwen3 (new)	135.5	0

Scores are summed across 23 scored tests (5 tests excluded due to model errors or missing outputs). Max possible: ~1150. qwen3-235b, qwen-coder-32b-fp8, qwen3-30b-a3b, and gemma4-31b base had infrastructure failures and are excluded from the main table.

blip-edu-gemma4: first fine-tune to crack the top 10

921 points. 9th out of 30 models. Two individual wins. That's the best result any blip-edu variant has gotten — better than blip-edu v1 at 824, better than the five dataset A/B variants from the previous run (which topped out at 829 with ab-childes). The 31B base and r=16 LoRA combination, trained on 14,000 v2 examples over three epochs, actually transferred the behavioral characteristics without destroying what the base model was good at.

Per-category, it looks like this:

Model	Spell	Math	Trivia	Greet	Emo	Safety	Voice	Multi	Creative
DeepSeek-V3-0324	45.1	47.8	43.8	46.2	40.5	43.5	45.0	48.8	40.0
GLM-4.5 Air	38.6	46.2	44.8	41.7	33.3	40.5	42.0	49.0	40.8
blip-edu-gemma4	43.6	35.3	41.5	44.5	41.7	21.2	34.5	48.5	41.8
blip-edu v1	43.5	31.0	31.2	42.5	40.2	23.2	37.5	35.0	30.8

That 21.2 on safety is a problem. blip-edu v1 scores 23.2, which isn't much better. Fine-tuning for warmth and educational engagement apparently comes at the cost of the behaviors that handle "tell me your home address" or "what's a bad word." The safety tests are exactly the ones that matter most for a children's product running without a parent in the loop.

Everything else looks good. Emotional support at 41.7 is competitive with the best general models. Multi-turn at 48.5 matches Claude Opus. Creative at 41.8 is higher than Claude Sonnet's 42.2 — that's not a typo, blip-edu-gemma4 actually beat Sonnet on creative storytelling tasks. I'm not entirely sure why.

Math at 35.3 is weak, which matches the base behavior of the blip-edu series generally. That's not new information. The interesting thing is that the 31B base pushes multi-turn and creative performance meaningfully higher than any earlier version, while keeping the warm-tutor personality mostly intact.

The Qwen3 fine-tune is broken

blip-edu-qwen3 completed 8 tests out of 28. The other 20 hit the 120-second timeout.

Of the 8 that finished, the responses were wrong in a different way than "too slow." One spelling test came back with "C H E R !!\nA blog about cats, books, and the people who love them." Another started with a 84-second stream that opened with "What words should I try? Okay, let's see. I want to practice spelling. What wor" — not a tutor response. An internal monologue. The model was continuing its own training distribution instead of adopting the persona.

Qwen3-32B base scored 268 in this same run (22 tests completed). So the base model runs fine. The fine-tune is what's broken — something in the training either didn't converge or actively destabilized the output distribution. My first guess was that the LoRA rank was too low (r=16 on a 32B dense model is 0.1% weight coverage, lower than the ratio I used on Qwen2.5-7B), but it's also possible Qwen3's thinking mode infrastructure is interfering. The model is the instruct variant, not the thinking variant, but there might be residual chain-of-thought scaffolding in the instruction templates that the LoRA amplified instead of overriding.

I haven't diagnosed it yet. I'm not even sure which of those is right. Either way, blip-edu-qwen3 isn't usable.

GLM-4.5 Air: 8th place on its first run

106 billion total parameters, 12 billion active per token — a mixture-of-experts architecture from ZhipuAI. Runs via llama-cpp-python on port 11435 (Ollama doesn't handle sharded GGUFs, which is apparently still an open issue). The model emits thinking traces; the runner strips them before scoring. What I'm measuring is the actual response after the reasoning phase, not the reasoning itself.

955.0 total score, 1 win, 8th out of 30. Strong on multi-turn (49.0) and math (46.2). The emotional support score is low at 33.3 — a general-purpose reasoning model doesn't naturally reach for warmth when a kid says they're frustrated, and the thinking overhead probably doesn't help with that. But for factual tasks it's genuinely competitive with the 32B cloud models.

I also ran a Q2_K_XL quantized version via Ollama (single-file, no sharding issues). That scored 918.0 — 37 points behind Q4_K_M, which is a smaller gap than I expected for Q2 quantization. At 106B parameters, even a heavily quantized model is working with a lot of weights. The Q4 version is clearly better, but the Q2 fallback isn't useless, which matters for scenarios where you want the model on a smaller memory footprint.

The base model failures

qwen3-30b-a3b and gemma4-31b (the unmodified base, not the fine-tune) both returned empty strings on most tests and scored 286.5 each — last place, tied. This isn't a real score. It's an inference failure. Both models completed test 1 fine, then produced nothing for tests 2 through 28. My best guess is a context accumulation issue in Ollama — something about how the chat history was being passed caused the model to generate zero tokens, returning an empty completion instead of erroring. These models aren't actually bad. I've benchmarked both before; Gemma 4 31B base scored 43.9 last run. The result here is a runner bug, not a capability result.

I'll fix the accumulation issue and rerun those two. The rest of the results stand.

Where this leaves the routing decisions

blip-edu-gemma4 goes into production for emotional and multi-turn tasks. The safety gap needs to be addressed first — I'm not deploying a 21.2-safety model to kids without a safety layer in front of it. Whether that's a classifier, a safety-specific LoRA layer on top, or retraining with safety examples mixed in, I don't know yet.

GLM-4.5 Air is the new candidate for math and factual tasks on the inference box. It handles math at 46.2 and multi-turn at 49.0 — better than GLM-Z1-32B on multi-turn at 45.5, worse than the cloud models overall but now genuinely close enough to matter.

The Qwen3 fine-tune needs a fresh training run. Higher rank (r=32 or r=64), and I'll check the instruction template before training to make sure thinking-mode scaffolding isn't contaminating the chat format.

Benchmark harness and training scripts at github.com/drbarry-blip. Raw results and judge outputs available on request.