Skip to content

Gemma 4 Crashed the Benchmark and a Bug I Didn't Know I Had

I ran a big batch last night — 26 models, 28 tests, everything queued up to run overnight. Half the results were garbage. Eight models came back with empty content, another showed up with a 1,039-character average response when every other model was averaging 40–200. I thought it was an Ollama version issue. It wasn't.

GLM-Z1-32B was outputting its full chain of thought into the content field. Not Ollama's separate thinking key — the actual message.content that gets passed to the judge. The response started with <think>, ran for 800+ tokens of internal reasoning, hit </think>, and then gave a real answer. My runner extracted content from message.content, grabbed the whole thing, and the judge evaluated a thinking trace instead of a kid tutor response. GLM-Z1 was scoring fine on some tests — reasoning models do fine when the judge reads their reasoning — but the scores weren't measuring what I thought they were measuring.

The fix was one regex. Strip <think>...</think> from content before scoring. The re-run happened this morning. The real GLM-Z1 score is 43.3/50.

The actual results

Four new local models, properly scored for the first time:

#ModelWinsAvg scoreAvg timeTokens
1Gemma 4 31B (local, new)1343.9/503.9s37
2GLM-Z1-32B (local, fixed)743.3/503.8s204*
3Qwen3-32B (local)543.2/503.0s38
4Qwen3-30B-A3B MoE (local)338.6/502.0s282
5blip-edu-glm-32b (mine)026.1/5054.3s2722
6blip-edu-glm-9b (mine)026.0/5014.9s2009

* GLM-Z1 token count includes thinking chain; content delivered to judge has thinking stripped.

To put those scores in context against the full leaderboard: Gemma 4 31B at 43.9 sits between Claude Sonnet 4.6 (44.4) and Claude Opus 4.6 (43.7). On a benchmark of 28 kid-tutor tasks — spelling, math drill, trivia, emotional support, greetings, safety redirection, voice quality, multi-turn conversation, creative storytelling — a Google model I pulled from Ollama is now the best local model I've tested. By a real margin. 13 wins out of 28 tests.

Gemma 4 leads the two categories that matter most for Blip

The per-category breakdown:

ModelSpellMathTriviaGreetEmotionalSafetyVoiceMulti-turnCreative
Gemma 4 31B43.143.141.344.544.239.847.348.545.3
GLM-Z1-32B43.446.840.542.741.539.346.345.543.7
Qwen3-32B42.146.847.841.342.237.246.239.243.7
Qwen3-30B-A3B38.045.236.337.836.740.037.334.538.2

Voice quality: Gemma 4 at 47.3. Multi-turn: Gemma 4 at 48.5. Those are Blip's two most important categories — the ones that determine whether a conversation feels natural or stilted. A kid asking for help with spelling isn't just evaluating correctness, they're evaluating whether they want to keep talking to the thing. Multi-turn is where you find out whether a model can track context, refer back to earlier in the conversation, vary its phrasing. Gemma 4 leads both.

GLM-Z1 and Qwen3 both hit 46.8 on math — tied for the category lead — and Qwen3 dominates trivia at 47.8. If the routing question is specifically "which local model handles factual Q&A best," the answer changed today.

MoE doesn't buy quality, just speed

Qwen3-30B-A3B is a mixture-of-experts model with 30 billion total parameters and 3 billion active per token. On inference, that means it runs at 141 tokens per second — about three to four times faster than the dense 32B models. The tradeoff I was hoping wouldn't exist turned out to exist: 38.6/50 versus 43+ for the three denser models. Four and a half points. Consistent across almost every category.

Speed matters for Blip — latency is real when a kid is waiting for an answer — but not if the answer quality drops this much. The MoE variant stays in the registry as a potential classifier model for routing decisions, not as a responder.

The GLM base was the wrong base for this fine-tune

I trained two blip-edu variants on GLM-Z1 base models — a 9B and a 32B — on the hypothesis that reasoning-pretrained models might handle arithmetic tutoring better than instruction-tuned Qwen2.5. GLM-Z1 is genuinely strong at math. blip-edu:v2 loses every math test. The reasoning went: put a better math brain in the fine-tune, get better math performance out.

What I got instead: both models maxed out the 4096-token generation limit on nearly every test. The 9B variant averaged 2,009 output tokens per response and took 14.9 seconds per call. The 32B averaged 2,722 tokens and 54 seconds. A greeting test — "Hi, I'm Blip! What should we do?" — came back as a 14,721-character stream-of-consciousness that included spelling drills the kid never asked for, mid-conversation corrections to answers that never happened, and four separate topic pivots.

The LoRA (r=16, alpha 32, 3 epochs) changes about 0.3% of the base model's weights. That's enough to redirect style on a model that's already been trained to stop talking. GLM-Z1 was trained to keep talking — to show work, follow chains of reasoning, not terminate early. The fine-tune couldn't override that. The result scored 26.1 and 26.0 out of 50, both worse than blip-edu v1 from three weeks ago. Hypothesis falsified.

If I wanted to fine-tune on a reasoning base for math, I'd need either a much higher-rank LoRA (r=64+, more weight coverage), a full supervised fine-tune instead, or a base model that's been instruction-tuned for stopping behavior first and reasoning-capable second. For now, blip-edu:v2 stays as the math fallback. It doesn't win math tests. But it doesn't generate a novel when you say hello.

Updated routing

Gemma 4 31B is now the primary local routing target for voice-quality-sensitive and multi-turn tasks. GLM-Z1-32B takes math (tied with Qwen3 but slightly higher average, and the math-tutoring framing suits a verbose model better than a single-answer test). Qwen3-32B takes trivia and factual Q&A. Blip-edu:v2 keeps the fine-tuned tasks it was already handling.

The full leaderboard now has five local models in the 38–44 range and a clear drop below that. The gap between the top local tier and cloud is down to 0.5 points. That's close enough to make the routing math interesting — what's the right latency/cost/quality tradeoff when Gemma 4 running locally is essentially indistinguishable from Opus?

I don't have a firm answer yet. But it's a better problem to have than "all local models lose by ten points."