After the Fix: Qwen-Coder vLLM BF16 vs Ollama Q4_K_M, Side by Side

896 milliseconds. That was the smoke test after I patched runner.py — the first clean response from the vLLM endpoint since I accidentally broke it by adding min-P sampling to every provider without checking whether vLLM actually supported it alongside speculative decoding. It doesn't. Every request came back HTTP 400, all 28 tests errored out, and I shipped yesterday's post with that whole column missing.

The fix was 15 lines. Gate min-P behind a per-profile send_min_p flag, default it off for the openai_compat provider since min-P isn't even part of the OpenAI spec, leave Ollama unchanged. I could rant about how "add a parameter to every provider" is never as safe as it sounds, but the short version is: different upstreams have different feature gates, and the right pattern is per-profile opt-in.

Same model, two stacks, one run

Rows 7 and 9 in the leaderboard below are the same model. Qwen2.5-Coder-32B-Instruct, same weights, same temperature. One copy served by vLLM at BF16 with speculative decoding and prefix caching. The other served by Ollama at Q4_K_M, the way every other local model in this benchmark has been running for the past week. Same 28 prompts. Same dual judge.

Metric	vLLM BF16	Ollama Q4_K_M
Wins (out of 28)	2	1
Average rank	7.37	8.04
Response time	1.9s	3.6s
Output tokens	38	42
VRAM	~22 GB	~18 GB

vLLM wins on everything except VRAM footprint. The speed gap is the real headline — 1.9× faster, and that's not noise, that's hardware-limited by spec decode and prefix caching. The +1 win and +0.67 rank improvement could be variance. The latency cannot.

This is the second time I've seen this pattern. Yesterday's coding-suite A/B showed vLLM BF16 winning 4 of 5 tasks at 38.8/50 vs Ollama's 34.6/50 — a 12% quality lift on completely different prompts. Two independent confirmations now. BF16 with a real serving stack just produces better output than Q4 through Ollama, even on a kid-tutor task suite that shouldn't care about quantization artifacts.

What it actually won

blip-math-001. A math drill prompt where a kid says the wrong answer and the model has to correct gently. vLLM beat all 16 other models on that one test. Worth pausing on: math has been the weakest local category in every single benchmark run this series. A local model winning a math test outright? First time.

blip-voice-003. Voice quality — short, TTS-friendly, no markdown. vLLM averaged rank 2.67 across all three voice tests, which is the best of any model in the entire benchmark for that category. And at 1.9 seconds per response, the kid hears Blip answer almost immediately. How many of the cloud models can say that?

The numbers I don't trust yet

Sonnet jumped from 3 wins to 6 in this run. That's a 3-win swing on the same 28 tests. Opus dropped from 4 to 2. The Claude family total stayed at 7 — they just swapped which tests each one won. I'm calling that noise until a third run says otherwise.

The blip-edu family collapsed from 7 wins to 2. I had just written a whole post about blip-edu-coder being the new primary local model, and now it's at zero wins. Honestly? I shouldn't have committed to a routing change based on a single run. I knew the variance floor was ±2-3 wins. I did it anyway. Lesson re-learned.

One thing that has never moved across any run in this series: the opus-distilled Qwen variants sit at the bottom. Zero wins, worst average ranks, every time. That's the only finding I'd bet money on at this point.

17 models, full table

#	Model	Wins	Avg rank	Avg time	Avg tok
1	Claude Sonnet 4.6	6	5.59	2.9s	45
2	GLM-5 (cloud)	3	4.48	8.1s	130
3	DeepSeek-V3-0324	3	6.56	4.4s	37
4	GLM-Z1-32B (local)	3	8.26	7.8s	240
5	Claude Opus 4.6	2	6.26	2.9s	49
6	Qwen3-32B (local)	2	6.37	4.1s	41
7	qwen-coder-32b-fp8 (vLLM BF16)	2	7.37	1.9s	38
8	blip-edu (mine)	2	10.41	1.2s	45
9	qwen-coder-32b (Ollama Q4)	1	8.04	3.6s	42
10	DeepSeek-R1:32b	1	8.96	3.3s	36
11	qwen3.5-abliterated	1	9.22	7.9s	58
12	Hermes3:8b	1	11.89	1.6s	56
13	MiniMax-M2.5	0	7.93	5.6s	144
14	blip-edu-coder	0	10.19	1.2s	44
15	blip-edu:v2	0	10.41	0.3s	45
16	qwen3.5-35b opus-distilled	0	14.52	13.0s	468
17	qwen3.5-27b opus-distilled	0	16.56	13.4s	495

So what actually changes

I'm adding the vLLM entry to Blip's router for voice-quality turns. Rank 2.67 average on voice, 1.9-second latency, runs on my own hardware. That's a better fit than anything else in the lineup for the "kid hears the response read aloud" use case.

The Ollama Q4 version becomes the fallback for when vLLM is down. Same weights, slower and slightly worse, but it doesn't need a second serving process to be healthy. I'm not touching the FP8/NVFP4 Docker upgrade yet — BF16 is already winning the A/B, and the throughput gain from real quantized serving isn't blocking anything I'm currently doing.

The blip-edu routing decision from yesterday? Shelved. I'm running a third round before I commit to anything based on win counts that swing by ±3 between runs. The variance protocol from earlier in this series exists for exactly this kind of situation and I should have followed it the first time.

Total spend on this whole benchmarking series so far: about $24 over four days. Twelve runs, eight posts, three trained models, one validated A/B between serving stacks. Not bad for a few evenings of GPU time and some cloud judging credits.