Two New Local 32Bs Walked Into the Benchmark — Qwen3-32B and GLM-Z1

I changed three things at once and got a result I can't cleanly attribute to any of them. Bad experimental design. Good data anyway.

The three things: pulled Qwen3-32B and GLM-Z1-32B-0414 onto the inference box (20 GB and 19 GB Q4_K_M GGUFs, both dense 32B models, both running locally on the RTX PRO 6000 for the first time), AND turned on three calibration tweaks that had been sitting in runner.py for a week — min-P sampling at 0.05, per-category temperature overrides in the test YAML, and a prompt-reordering trick where I moved the format constraint to line 2 of each system prompt to catch LLM primacy bias. Should have isolated the variables. Didn't.

Both new 32Bs scored 3 wins on their debut

That ties them with Claude Sonnet and GLM-5. One win behind Opus. For models running on a single GPU at 4-bit quantization, that's better than I expected — closer to four hours, honestly, of expected benchmarking disappointment that never came.

They're stylistically different in a way that matters. Qwen3-32B averages 41 output tokens per response. Tight. Punchy. Exactly what a TTS pipeline wants when a kid asks "what's 7 plus 5." GLM-Z1-32B averages 232 tokens — five times the output, and you can feel the reasoning-model lineage in every response. Z.ai built it as a "thinking" variant and even in non-thinking mode it still writes like it's showing its work.

Qwen3 won a math test, a safety test, and a voice-quality test. GLM-Z1 won a creative test, a greeting, and also a voice-quality test. Is it weird that the most verbose model in the benchmark won a voice-quality test? A little. The judges apparently valued the content richness enough to overlook the length.

#	Model	Wins	Avg rank	Avg time	Tokens
1	Claude Opus 4.6	4	5.36	3.1s	49
2	Claude Sonnet 4.6	3	4.75	3.4s	45
3	GLM-5 (cloud)	3	5.64	6.0s	132
4	Qwen3-32B (local, new)	3	6.71	4.2s	41
5	GLM-Z1-32B (local, new)	3	6.82	8.0s	232
6	blip-edu-coder (mine)	3	8.86	1.3s	44
7	MiniMax-M2.5	2	5.79	6.3s	142
8	blip-edu:v2 (mine)	2	9.39	0.3s	47
9	blip-edu v1 tag (mine)	2	9.79	1.3s	45
10	DeepSeek-V3-0324	1	5.82	3.1s	37
11	qwen2.5-coder:32b	1	7.50	3.9s	40
12	qwen3.5-abliterated	1	9.71	7.7s	51
13	qwen-coder-32b-fp8 (vLLM)	0	—	—	all errored
14	DeepSeek-R1:32b	0	9.96	3.7s	38
15	Hermes3:8b	0	11.18	1.7s	54
16	qwen3.5-35b opus-distilled	0	13.29	12.5s	441
17	qwen3.5-27b opus-distilled	0	15.43	13.6s	500

The calibration effect was bigger than the new models

This is the part I didn't see coming. My three blip-edu fine-tunes — same weights as yesterday, not retrained, nothing changed about the models themselves — went from 3 total wins to 7. More than doubled. On the same benchmark suite.

What changed was how I called them. Min-P at 0.05 filters garbage tokens adaptively. Per-category temperature (0.65 for creative, 0.35 for math and spelling) matches the sampling to the task instead of using one temperature for everything. And moving the format constraint to line 2 of the system prompt catches primacy bias — models pay more attention to instructions that appear early.

None of that costs anything. Zero retraining, zero API spend, zero additional compute. Just thinking harder about how to call a model you already have. I spent two days and $15 training blip-edu variants, and then a few config changes that took twenty minutes produced a bigger improvement than the training did. Annoying? A little. But the lesson is real: small local models are more sensitive to inference settings than frontier models, because they're living closer to the edge of "gets it right" versus "drifts off." Good calibration keeps them on the right side of that edge more often.

Should I have done this first? Obviously. Did I? No. I went straight to training because training feels like progress and config changes feel like fiddling. That's backwards, and I know it now.

blip-edu-coder pulled ahead

Of the three blip-edu fine-tunes, the Coder-base variant (trained yesterday) scored 3 wins — emotional, spelling, voice quality. The Instruct-base v2 got 2. The v1 tag got 2. Within the noise floor for sure, but consistent with yesterday's finding where the Coder variant also had the best average rank of the three.

Still can't do math. None of the four math tests. But the math weakness is less important now because the other categories all got better with the calibration changes. I'm promoting blip-edu-coder to primary local model for Blip, with v2 as the math-drill fallback. v1 gets retired.

The vLLM column is blank

All 28 requests to the vLLM endpoint came back HTTP 400. The min-P parameter I added to every provider turned out to be incompatible with vLLM 0.19.0's speculative decoding — a limitation I discovered the hard way, 28 times in a row. The other 16 models ran fine, the judge skipped the errored entries, and I fixed it the same night with a per-profile opt-in flag. That fix and its benchmark results are in the next post.

Routing updates

Blip's local menu grew this week. Spelling and emotional support go to blip-edu-coder (fast, private, won both categories). Voice quality goes to Qwen3-32B (concise at 41 tokens, won a voice test). Creative stories go to GLM-Z1-32B (verbose enough for long-form, won a creative test). Math drill stays with blip-edu:v2. Everything else still escalates to cloud.

Fifty minutes of wall clock, four dollars in cloud judging. I went in hoping the two new 32Bs would clear the noise floor. They did that and then some — tied with Claude Sonnet on win count. Not bad for models running on a single card at 4-bit quant.