Skip to content

I Ran the Same Benchmark Three Times and Got Three Different Answers

qwen2.5-coder:32b dropped from 5 wins to 2 wins between my first and second benchmark runs. That bugged me. Was it the new "research-backed" settings (temperature 0.2, top_p 0.1, repetition penalty 1.1) fighting against conversational prompts? Or was the original 5 just a lucky roll that the recalibration happened to expose?

I ran the whole thing a third time to find out. Changed exactly one variable: qwen-coder-32b switched to conversational defaults — temperature 0.5, no top_p clamp, no repetition penalty. Same eleven models, same prompts, same dual judge, same hardware. If the calibration theory was right, qwen-coder would jump back to 5. If the noise theory was right, it would stay at 2.

It scored 3.

Neither hypothesis won, and in the process I learned something I should have already known about measurement at this scale.

Three runs, side by side

Same 11 models. Same 28 prompts. Same hardware.

Model Run 1
(2026-04-07T22)
Run 2
(2026-04-08T16)
Run 3
(2026-04-08T19)
Spread
claude-opus866±2
claude-sonnet446±2
deepseek-v3-0324456±2
glm-54440
minimax-m2.5241±3
qwen2.5-coder:32b523±3
deepseek-r1:32b022±2
qwen3.5-abliterated:35b-a3b110±1
hermes3:8b0000
qwen3.5-35b-a3b-opus-distilled0000
qwen3.5-27b-opus-distilled0000

Each run distributes exactly 28 wins across 11 models. Run 1 had no per-model profiles and used Opus as sole judge; Run 2 added research-backed profiles and a DeepSeek-V3 co-judge; Run 3 changed only qwen-coder-32b's settings to conversational defaults.

The spread column is the whole story

Forget the individual win counts for a second and look at the spreads. Active models — the ones doing real sampling — swing across 3-win ranges between runs. minimax-m2.5 went 2, then 4, then 1. A six-win-equivalent swing on a 28-test budget. qwen-coder went 5, 2, 3 — the question that sent me back to the terminal for a third run — and the answer is just "all three are within the noise."

One model has zero variance: GLM-5. Four wins, four wins, four wins. Its outputs are distinctive enough that judges consistently rank them the same way. Three other models also have zero spread, but they're stuck at 0 wins — hermes3 and both opus-distilled Qwens. Is zero-win consistency really a signal? I'd call it a floor, not stability.

The Claude family and DeepSeek-V3-0324 cluster in the 4-8 win range across all three runs. Never leave it. Their exact win counts shift by ±2, but they're always at the top. That's the useful reading: tier-level rankings are stable; individual win counts within a tier are not.

The qwen-coder verdict

Run 2 used temperature 0.2 with top_p 0.1 and repetition penalty 1.1 — settings that make sense for writing code. Low randomness, narrow sampling, gentle anti-repetition. Run 3 used temperature 0.5, no top_p clamp, no repetition penalty. Conversational defaults. Everything else identical.

Result: 2 wins to 3 wins. Real but small. Conversational tuning recovered about one win on this task mix. Barely above the noise floor. Not the dramatic recovery that would have proved "calibration was the whole problem."

The honest reading: qwen-coder-32b on Blip's conversational tests is a 2-4 win model regardless of which temperature and top_p you pick within a reasonable range. The 5-win Run 1 was a lucky draw on top of possibly-better default sampling. It's not the local champion the first benchmark made it look like.

I was a little disappointed. I wanted the settings to be the answer, because settings are something I can fix. Variance isn't.

Where does all this noise come from?

The math floor

Twenty-eight tests split among eleven models gives an expected per-model standard deviation of about 1.5 wins from the combinatorics alone — even with perfectly deterministic models and a perfectly deterministic judge. That's the floor. Want ±0.5 win precision? You'd need 250+ tests, not 28.

Judges disagree

Even at temperature 0.1, both Opus and DeepSeek-V3 produce slightly different scores each time they evaluate the same outputs. The cell-disagreement rate between the two judges — Opus vs DeepSeek-V3 differing by 2 or more points on the same scoring dimension — was 32% in Run 2 and 33% in Run 3. Over 500 disagreement events per run on 1,540 scoring cells. Averaging two judges helps, but it doesn't make either one deterministic.

Sampling temperature does what it says

The two highest-variance models in the table — minimax-m2.5 (cloud, ±3) and qwen-coder in Run 3 (local, ±3) — both sample at moderate temperature. Each run produces different outputs for the same prompts. Combined with judge nondeterminism, that's enough to push individual win counts around by half their total value from one run to the next.

Could I set all models to temperature 0? Sure. But then I'm measuring something other than production behavior. Blip itself uses Sonnet at 0.7 for warmth. A benchmark at temperature 0 would test the wrong thing.

What survives three runs

Two findings are solid:

  • multi_turn goes to claude-opus. It won 2 of 2 multi-turn tests every single run. The most consistent category-level signal in the entire dataset.
  • creative goes to cloud. Zero local wins on creative prompts across all three runs. The three creative tests get split between cloud models, but no local model has ever won one.

Two more are stable at the tier level but noisy at the model level:

  • Tier 1 (cloud frontier) collectively takes 18-20 of every 28 wins. Which specific model wins which test shifts around. The tier as a group doesn't.
  • Tier 3 (hermes3 and opus-distilled) has never won a single test across three runs and 84 model-run-test combinations. That's a real signal: they don't belong on Blip's task mix.

And here's what didn't survive: the category-level claims I made in the previous two posts. "Safety is local's strongest category." "Qwen-coder dominates math." "DeepSeek-V3 is the new safety winner." All of those were single-run findings that looked stable when I wrote them up and turned out to be noise once I had three data points. Kind of embarrassing to admit, but that's what happened.

What this means for Blip's router

The hybrid router was sending safety, math, and voice_quality to qwen2.5-coder:32b based on Run 1's results. That's no longer defensible:

  • safety: Run 1 said qwen-coder won 2/3, Run 2 said deepseek-v3 won 2/3, Run 3 split the wins three ways. Different winner each time.
  • math: Run 1 had qwen-coder tying for 1/4, Run 2 had minimax-m2.5 winning 2/4, Run 3 had glm-5 winning 2/4. Three runs, three different answers.
  • voice_quality: 3-way ties in two of the three runs.

New routing: cloud as default, local as cost-and-privacy fallback.

  • multi_turn → claude-opus (only category-level signal that survived all three runs)
  • creative → cloud (any Tier-1 model)
  • everything else → cloud by default; route to local qwen-coder-32b only when cost or privacy requires it

The local model still has a job. For math drill mode, where Blip might run dozens of practice problems in a single session, local saves real money even if it's a couple wins behind cloud. For safety, keeping a kid's distress utterance on-device matters more than a quality gap the benchmark can't even reliably measure. But the framing changes: local is a cost and privacy choice, not a quality choice.

What I'd do differently

Three changes for next time:

  1. Run it three times minimum. One run is meaningless at this sample size. Two runs let you see variance but not bound it. Three is the minimum for calling anything stable. Cost was about $3.50 per run, under $11 total — cheap for actual confidence.
  2. Report tiers, not points. "Model X got 5 wins" is less honest than "Model X sits in the 4-6 range (Tier 2)." The first implies precision I don't have.
  3. More tests if I can afford it. 28 tests gives a combinatoric noise floor of ±1.5 wins per model. 100 tests pushes that to ±0.9. 250 gets it to ±0.55. The inference cost is small (about $0.20 per run for cloud); the judge cost scales linearly, which is the real constraint.

I started this benchmarking project to make routing decisions for Blip. Three runs later, I have one stable conclusion: route to cloud unless you have a reason that isn't about quality. The per-category findings I extracted from any single run were noise dressed up as signal.

That isn't the benchmark failing — it's the benchmark working. The methodology improvements that let me see the variance (the dual judge, the multi-run comparison) are what made the noise floor visible. Without them I'd have shipped Run 1 as gospel and built routing rules on top of measurements with 60% relative error.

The router is going to be simpler than I planned. Cloud by default, local for cost or privacy, and the only category rule that survived all three runs is "multi-turn goes to Claude Opus."

Total spend across three runs: $10.69. Cheapest lesson in measurement uncertainty I've ever gotten.