Gemma 4 Crashed the Benchmark and a Bug I Didn't Know I Had

I ran a big batch last night — 26 models, 28 tests, everything queued up to run overnight. Half the results were garbage. Eight models came back with empty content, another showed up with a 1,039-character average response when every other model was averaging 40–200. I thought it was an Ollama version issue. It wasn't.

GLM-Z1-32B was outputting its full chain of thought into the content field. Not Ollama's separate thinking key — the actual message.content that gets passed to the judge. The response started with <think>, ran for 800+ tokens of internal reasoning, hit </think>, and then gave a real answer. My runner extracted content from message.content, grabbed the whole thing, and the judge evaluated a thinking trace instead of a kid tutor response. GLM-Z1 was scoring fine on some tests — reasoning models do fine when the judge reads their reasoning — but the scores weren't measuring what I thought they were measuring.

The fix was one regex. Strip <think>...</think> from content before scoring. The re-run happened this morning. The real GLM-Z1 score is 43.3/50.

The actual results

Four new local models, properly scored for the first time:

#	Model	Wins	Avg score	Avg time	Tokens
1	Gemma 4 31B (local, new)	13	43.9/50	3.9s	37
2	GLM-Z1-32B (local, fixed)	7	43.3/50	3.8s	204*
3	Qwen3-32B (local)	5	43.2/50	3.0s	38
4	Qwen3-30B-A3B MoE (local)	3	38.6/50	2.0s	282
5	blip-edu-glm-32b (mine)	0	26.1/50	54.3s	2722
6	blip-edu-glm-9b (mine)	0	26.0/50	14.9s	2009

* GLM-Z1 token count includes thinking chain; content delivered to judge has thinking stripped.

To put those scores in context against the full leaderboard: Gemma 4 31B at 43.9 sits between Claude Sonnet 4.6 (44.4) and Claude Opus 4.6 (43.7). On a benchmark of 28 kid-tutor tasks — spelling, math drill, trivia, emotional support, greetings, safety redirection, voice quality, multi-turn conversation, creative storytelling — a Google model I pulled from Ollama is now the best local model I've tested. By a real margin. 13 wins out of 28 tests.

Gemma 4 leads the two categories that matter most for Blip

The per-category breakdown:

Model	Spell	Math	Trivia	Greet	Emotional	Safety	Voice	Multi-turn	Creative
Gemma 4 31B	43.1	43.1	41.3	44.5	44.2	39.8	47.3	48.5	45.3
GLM-Z1-32B	43.4	46.8	40.5	42.7	41.5	39.3	46.3	45.5	43.7
Qwen3-32B	42.1	46.8	47.8	41.3	42.2	37.2	46.2	39.2	43.7
Qwen3-30B-A3B	38.0	45.2	36.3	37.8	36.7	40.0	37.3	34.5	38.2

Voice quality: Gemma 4 at 47.3. Multi-turn: Gemma 4 at 48.5. Those are Blip's two most important categories — the ones that determine whether a conversation feels natural or stilted. A kid asking for help with spelling isn't just evaluating correctness, they're evaluating whether they want to keep talking to the thing. Multi-turn is where you find out whether a model can track context, refer back to earlier in the conversation, vary its phrasing. Gemma 4 leads both.

GLM-Z1 and Qwen3 both hit 46.8 on math — tied for the category lead — and Qwen3 dominates trivia at 47.8. If the routing question is specifically "which local model handles factual Q&A best," the answer changed today.

MoE doesn't buy quality, just speed

Qwen3-30B-A3B is a mixture-of-experts model with 30 billion total parameters and 3 billion active per token. On inference, that means it runs at 141 tokens per second — about three to four times faster than the dense 32B models. The tradeoff I was hoping wouldn't exist turned out to exist: 38.6/50 versus 43+ for the three denser models. Four and a half points. Consistent across almost every category.

Speed matters for Blip — latency is real when a kid is waiting for an answer — but not if the answer quality drops this much. The MoE variant stays in the registry as a potential classifier model for routing decisions, not as a responder.

The GLM base was the wrong base for this fine-tune

I trained two blip-edu variants on GLM-Z1 base models — a 9B and a 32B — on the hypothesis that reasoning-pretrained models might handle arithmetic tutoring better than instruction-tuned Qwen2.5. GLM-Z1 is genuinely strong at math. blip-edu:v2 loses every math test. The reasoning went: put a better math brain in the fine-tune, get better math performance out.

What I got instead: both models maxed out the 4096-token generation limit on nearly every test. The 9B variant averaged 2,009 output tokens per response and took 14.9 seconds per call. The 32B averaged 2,722 tokens and 54 seconds. A greeting test — "Hi, I'm Blip! What should we do?" — came back as a 14,721-character stream-of-consciousness that included spelling drills the kid never asked for, mid-conversation corrections to answers that never happened, and four separate topic pivots.

The LoRA (r=16, alpha 32, 3 epochs) changes about 0.3% of the base model's weights. That's enough to redirect style on a model that's already been trained to stop talking. GLM-Z1 was trained to keep talking — to show work, follow chains of reasoning, not terminate early. The fine-tune couldn't override that. The result scored 26.1 and 26.0 out of 50, both worse than blip-edu v1 from three weeks ago. Hypothesis falsified.

If I wanted to fine-tune on a reasoning base for math, I'd need either a much higher-rank LoRA (r=64+, more weight coverage), a full supervised fine-tune instead, or a base model that's been instruction-tuned for stopping behavior first and reasoning-capable second. For now, blip-edu:v2 stays as the math fallback. It doesn't win math tests. But it doesn't generate a novel when you say hello.

Updated routing

Gemma 4 31B is now the primary local routing target for voice-quality-sensitive and multi-turn tasks. GLM-Z1-32B takes math (tied with Qwen3 but slightly higher average, and the math-tutoring framing suits a verbose model better than a single-answer test). Qwen3-32B takes trivia and factual Q&A. Blip-edu:v2 keeps the fine-tuned tasks it was already handling.

The full leaderboard now has five local models in the 38–44 range and a clear drop below that. The gap between the top local tier and cloud is down to 0.5 points. That's close enough to make the routing math interesting — what's the right latency/cost/quality tradeoff when Gemma 4 running locally is essentially indistinguishable from Opus?

I don't have a firm answer yet. But it's a better problem to have than "all local models lose by ten points."