Gemma 4 Crashed the Benchmark and a Bug I Didn't Know I Had
I ran a big batch last night — 26 models, 28 tests, everything queued up to run overnight. Half the results were garbage. Eight models came back with empty content, another showed up with a 1,039-character average response when every other model was averaging 40–200. I thought it was an Ollama version issue. It wasn't.
GLM-Z1-32B was outputting its full chain of thought into the content field. Not
Ollama's separate thinking key — the actual message.content
that gets passed to the judge. The response started with <think>,
ran for 800+ tokens of internal reasoning, hit </think>, and
then gave a real answer. My runner extracted content from message.content,
grabbed the whole thing, and the judge evaluated a thinking trace instead of a kid
tutor response. GLM-Z1 was scoring fine on some tests — reasoning models do fine
when the judge reads their reasoning — but the scores weren't measuring what I
thought they were measuring.
The fix was one regex. Strip <think>...</think> from content
before scoring. The re-run happened this morning. The real GLM-Z1 score is 43.3/50.
The actual results
Four new local models, properly scored for the first time:
| # | Model | Wins | Avg score | Avg time | Tokens |
|---|---|---|---|---|---|
| 1 | Gemma 4 31B (local, new) | 13 | 43.9/50 | 3.9s | 37 |
| 2 | GLM-Z1-32B (local, fixed) | 7 | 43.3/50 | 3.8s | 204* |
| 3 | Qwen3-32B (local) | 5 | 43.2/50 | 3.0s | 38 |
| 4 | Qwen3-30B-A3B MoE (local) | 3 | 38.6/50 | 2.0s | 282 |
| 5 | blip-edu-glm-32b (mine) | 0 | 26.1/50 | 54.3s | 2722 |
| 6 | blip-edu-glm-9b (mine) | 0 | 26.0/50 | 14.9s | 2009 |
* GLM-Z1 token count includes thinking chain; content delivered to judge has thinking stripped.
To put those scores in context against the full leaderboard: Gemma 4 31B at 43.9 sits between Claude Sonnet 4.6 (44.4) and Claude Opus 4.6 (43.7). On a benchmark of 28 kid-tutor tasks — spelling, math drill, trivia, emotional support, greetings, safety redirection, voice quality, multi-turn conversation, creative storytelling — a Google model I pulled from Ollama is now the best local model I've tested. By a real margin. 13 wins out of 28 tests.
Gemma 4 leads the two categories that matter most for Blip
The per-category breakdown:
| Model | Spell | Math | Trivia | Greet | Emotional | Safety | Voice | Multi-turn | Creative |
|---|---|---|---|---|---|---|---|---|---|
| Gemma 4 31B | 43.1 | 43.1 | 41.3 | 44.5 | 44.2 | 39.8 | 47.3 | 48.5 | 45.3 |
| GLM-Z1-32B | 43.4 | 46.8 | 40.5 | 42.7 | 41.5 | 39.3 | 46.3 | 45.5 | 43.7 |
| Qwen3-32B | 42.1 | 46.8 | 47.8 | 41.3 | 42.2 | 37.2 | 46.2 | 39.2 | 43.7 |
| Qwen3-30B-A3B | 38.0 | 45.2 | 36.3 | 37.8 | 36.7 | 40.0 | 37.3 | 34.5 | 38.2 |
Voice quality: Gemma 4 at 47.3. Multi-turn: Gemma 4 at 48.5. Those are Blip's two most important categories — the ones that determine whether a conversation feels natural or stilted. A kid asking for help with spelling isn't just evaluating correctness, they're evaluating whether they want to keep talking to the thing. Multi-turn is where you find out whether a model can track context, refer back to earlier in the conversation, vary its phrasing. Gemma 4 leads both.
GLM-Z1 and Qwen3 both hit 46.8 on math — tied for the category lead — and Qwen3 dominates trivia at 47.8. If the routing question is specifically "which local model handles factual Q&A best," the answer changed today.
MoE doesn't buy quality, just speed
Qwen3-30B-A3B is a mixture-of-experts model with 30 billion total parameters and 3 billion active per token. On inference, that means it runs at 141 tokens per second — about three to four times faster than the dense 32B models. The tradeoff I was hoping wouldn't exist turned out to exist: 38.6/50 versus 43+ for the three denser models. Four and a half points. Consistent across almost every category.
Speed matters for Blip — latency is real when a kid is waiting for an answer — but not if the answer quality drops this much. The MoE variant stays in the registry as a potential classifier model for routing decisions, not as a responder.
The GLM base was the wrong base for this fine-tune
I trained two blip-edu variants on GLM-Z1 base models — a 9B and a 32B — on the hypothesis that reasoning-pretrained models might handle arithmetic tutoring better than instruction-tuned Qwen2.5. GLM-Z1 is genuinely strong at math. blip-edu:v2 loses every math test. The reasoning went: put a better math brain in the fine-tune, get better math performance out.
What I got instead: both models maxed out the 4096-token generation limit on nearly every test. The 9B variant averaged 2,009 output tokens per response and took 14.9 seconds per call. The 32B averaged 2,722 tokens and 54 seconds. A greeting test — "Hi, I'm Blip! What should we do?" — came back as a 14,721-character stream-of-consciousness that included spelling drills the kid never asked for, mid-conversation corrections to answers that never happened, and four separate topic pivots.
The LoRA (r=16, alpha 32, 3 epochs) changes about 0.3% of the base model's weights. That's enough to redirect style on a model that's already been trained to stop talking. GLM-Z1 was trained to keep talking — to show work, follow chains of reasoning, not terminate early. The fine-tune couldn't override that. The result scored 26.1 and 26.0 out of 50, both worse than blip-edu v1 from three weeks ago. Hypothesis falsified.
If I wanted to fine-tune on a reasoning base for math, I'd need either a much higher-rank LoRA (r=64+, more weight coverage), a full supervised fine-tune instead, or a base model that's been instruction-tuned for stopping behavior first and reasoning-capable second. For now, blip-edu:v2 stays as the math fallback. It doesn't win math tests. But it doesn't generate a novel when you say hello.
Updated routing
Gemma 4 31B is now the primary local routing target for voice-quality-sensitive and multi-turn tasks. GLM-Z1-32B takes math (tied with Qwen3 but slightly higher average, and the math-tutoring framing suits a verbose model better than a single-answer test). Qwen3-32B takes trivia and factual Q&A. Blip-edu:v2 keeps the fine-tuned tasks it was already handling.
The full leaderboard now has five local models in the 38–44 range and a clear drop below that. The gap between the top local tier and cloud is down to 0.5 points. That's close enough to make the routing math interesting — what's the right latency/cost/quality tradeoff when Gemma 4 running locally is essentially indistinguishable from Opus?
I don't have a firm answer yet. But it's a better problem to have than "all local models lose by ten points."