After the Fix: Qwen-Coder vLLM BF16 vs Ollama Q4_K_M, Side by Side
896 milliseconds. That was the smoke test after I patched runner.py — the first
clean response from the vLLM endpoint since I accidentally broke it by adding
min-P sampling to every provider without checking whether vLLM actually supported
it alongside speculative decoding. It doesn't. Every request came back
HTTP 400, all 28 tests errored out, and I shipped
yesterday's post with
that whole column missing.
The fix was 15 lines. Gate min-P behind a per-profile
send_min_p flag, default it off for the
openai_compat provider since min-P isn't even part of the OpenAI spec,
leave Ollama unchanged. I could rant about how "add a parameter to every provider"
is never as safe as it sounds, but the short version is: different upstreams have
different feature gates, and the right pattern is per-profile opt-in.
Same model, two stacks, one run
Rows 7 and 9 in the leaderboard below are the same model. Qwen2.5-Coder-32B-Instruct, same weights, same temperature. One copy served by vLLM at BF16 with speculative decoding and prefix caching. The other served by Ollama at Q4_K_M, the way every other local model in this benchmark has been running for the past week. Same 28 prompts. Same dual judge.
| Metric | vLLM BF16 | Ollama Q4_K_M |
|---|---|---|
| Wins (out of 28) | 2 | 1 |
| Average rank | 7.37 | 8.04 |
| Response time | 1.9s | 3.6s |
| Output tokens | 38 | 42 |
| VRAM | ~22 GB | ~18 GB |
vLLM wins on everything except VRAM footprint. The speed gap is the real headline — 1.9× faster, and that's not noise, that's hardware-limited by spec decode and prefix caching. The +1 win and +0.67 rank improvement could be variance. The latency cannot.
This is the second time I've seen this pattern. Yesterday's coding-suite A/B showed vLLM BF16 winning 4 of 5 tasks at 38.8/50 vs Ollama's 34.6/50 — a 12% quality lift on completely different prompts. Two independent confirmations now. BF16 with a real serving stack just produces better output than Q4 through Ollama, even on a kid-tutor task suite that shouldn't care about quantization artifacts.
What it actually won
blip-math-001. A math drill prompt where a kid says the wrong answer
and the model has to correct gently. vLLM beat all 16 other models on that one test.
Worth pausing on: math has been the weakest local category in every single
benchmark run this series. A local model winning a math test outright? First time.
blip-voice-003. Voice quality — short, TTS-friendly, no markdown.
vLLM averaged rank 2.67 across all three voice tests, which is the best of
any model in the entire benchmark for that category. And at 1.9 seconds
per response, the kid hears Blip answer almost immediately. How many
of the cloud models can say that?
The numbers I don't trust yet
Sonnet jumped from 3 wins to 6 in this run. That's a 3-win swing on the same 28 tests. Opus dropped from 4 to 2. The Claude family total stayed at 7 — they just swapped which tests each one won. I'm calling that noise until a third run says otherwise.
The blip-edu family collapsed from 7 wins to 2. I had just written a whole post about blip-edu-coder being the new primary local model, and now it's at zero wins. Honestly? I shouldn't have committed to a routing change based on a single run. I knew the variance floor was ±2-3 wins. I did it anyway. Lesson re-learned.
One thing that has never moved across any run in this series: the opus-distilled Qwen variants sit at the bottom. Zero wins, worst average ranks, every time. That's the only finding I'd bet money on at this point.
17 models, full table
| # | Model | Wins | Avg rank | Avg time | Avg tok |
|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | 6 | 5.59 | 2.9s | 45 |
| 2 | GLM-5 (cloud) | 3 | 4.48 | 8.1s | 130 |
| 3 | DeepSeek-V3-0324 | 3 | 6.56 | 4.4s | 37 |
| 4 | GLM-Z1-32B (local) | 3 | 8.26 | 7.8s | 240 |
| 5 | Claude Opus 4.6 | 2 | 6.26 | 2.9s | 49 |
| 6 | Qwen3-32B (local) | 2 | 6.37 | 4.1s | 41 |
| 7 | qwen-coder-32b-fp8 (vLLM BF16) | 2 | 7.37 | 1.9s | 38 |
| 8 | blip-edu (mine) | 2 | 10.41 | 1.2s | 45 |
| 9 | qwen-coder-32b (Ollama Q4) | 1 | 8.04 | 3.6s | 42 |
| 10 | DeepSeek-R1:32b | 1 | 8.96 | 3.3s | 36 |
| 11 | qwen3.5-abliterated | 1 | 9.22 | 7.9s | 58 |
| 12 | Hermes3:8b | 1 | 11.89 | 1.6s | 56 |
| 13 | MiniMax-M2.5 | 0 | 7.93 | 5.6s | 144 |
| 14 | blip-edu-coder | 0 | 10.19 | 1.2s | 44 |
| 15 | blip-edu:v2 | 0 | 10.41 | 0.3s | 45 |
| 16 | qwen3.5-35b opus-distilled | 0 | 14.52 | 13.0s | 468 |
| 17 | qwen3.5-27b opus-distilled | 0 | 16.56 | 13.4s | 495 |
So what actually changes
I'm adding the vLLM entry to Blip's router for voice-quality turns. Rank 2.67 average on voice, 1.9-second latency, runs on my own hardware. That's a better fit than anything else in the lineup for the "kid hears the response read aloud" use case.
The Ollama Q4 version becomes the fallback for when vLLM is down. Same weights, slower and slightly worse, but it doesn't need a second serving process to be healthy. I'm not touching the FP8/NVFP4 Docker upgrade yet — BF16 is already winning the A/B, and the throughput gain from real quantized serving isn't blocking anything I'm currently doing.
The blip-edu routing decision from yesterday? Shelved. I'm running a third round before I commit to anything based on win counts that swing by ±3 between runs. The variance protocol from earlier in this series exists for exactly this kind of situation and I should have followed it the first time.
Total spend on this whole benchmarking series so far: about $24 over four days. Twelve runs, eight posts, three trained models, one validated A/B between serving stacks. Not bad for a few evenings of GPU time and some cloud judging credits.