The Corrected Benchmark: 9 Models, 5 Suites, Calibrated Settings

The first post identified three methodology errors. The second described the fix — a per-model profile system that sets temperature, system prompt handling, and repetition penalty to match each model's documented requirements. This post has the corrected results.

What changed in the setup

Nine models across five task suites (Blip tasks, coding, clinical, reasoning, and writing). 396 outputs total. The two Opus-distilled models were removed — they scored zero wins in the first run and averaged 2,500+ tokens per response on tasks that need 40. They may be excellent models for other use cases; they're not useful for Blip's routing layer.

DeepSeek-R1 now runs with no system prompt and no few-shot examples, at temperature 0.6. Hermes3 now runs with repeat_penalty: 1.1. Qwen-Coder runs at temperature 0.2 with top_p 0.1. Every model's settings are recorded in the report alongside its scores.

GLM-5 hit an OpenRouter credit limit mid-run and errored on 6 of 44 tests. Its results are incomplete. I'm noting this rather than excluding it because the wins it did accumulate are meaningful even with the partial data.

Overall results

Judge wins across all 44 judged tests (5 suites):

Claude Opus 4.6: 12 wins
Claude Sonnet 4.6: 10 wins
GLM-5 (partial — 6 errors): 8 wins
Qwen-Coder:32b: 5 wins
MiniMax-M2.5: 3 wins
DeepSeek-V3-0324: 3 wins
DeepSeek-R1:32b: 1 win
Hermes3:8b: 1 win
Qwen-Abliterated:35B-A3B: 1 win

Blip-specific results

On the 28 Blip task prompts specifically:

GLM-5: 7 wins
Claude Opus: 5 wins
Qwen-Coder: 5 wins
Claude Sonnet: 4 wins
MiniMax-M2.5: 2 wins
DeepSeek-V3-0324: 2 wins
DeepSeek-R1: 1 win
Hermes3:8b: 1 win
Qwen-Abliterated: 1 win

What the calibration actually changed

DeepSeek-R1: 0 → 1 win

Correcting R1's settings — removing the system prompt, removing few-shot examples, setting temperature to 0.6 — moved it from zero wins to one. That's a real improvement, but it's not the reversal I expected. Running R1 under wrong conditions didn't bury a competitive model; it confirmed that R1's internal reasoning style, while impressive for complex tasks, produces over-deliberated responses for the short structured outputs Blip needs. A spelling bee answer shouldn't require a multi-step reasoning trace.

Hermes3: 0 → 1 win

Adding repeat_penalty: 1.1 fixed the looping problem and let Hermes3 compete fairly. It won one task — a structured safety response — and performed consistently without degrading mid-conversation. It's still the fastest model by a large margin (184 tokens/sec, vs Claude's 27–29) and its one win on a safety task at 4.7 GB still matters for the router design.

GLM-5 dominated Blip tasks despite errors

Seven wins in 22 eligible Blip tests (6 were errored out) is a strong performance. GLM-5's outputs were consistently concise, correctly formatted for TTS, and accurate. Its pricing ($0.72/$2.30 per million tokens) is higher than DeepSeek-V3 but its results justify it for categories where it wins. I need to top up the OpenRouter credits before drawing final conclusions, but the directional signal is clear.

Claude continues to dominate non-Blip suites

Coding: Sonnet 3, Opus 2. Reasoning: Opus 2, MiniMax 1. Writing: Sonnet 1, Opus 1. Clinical: Sonnet 2, Opus 2 (with GLM-5 and DeepSeek-V3 each picking up one). For tasks requiring deep domain knowledge, sustained coherence, and careful judgment, the frontier models aren't close to being replaced by local ones.

What this means for Blip's routing layer

The data now supports a clearer routing table than the first run produced:

Safety responses: Qwen-Coder (local, free, dominated this category)
Spelling, trivia, structured tasks: GLM-5 or Claude Opus
Emotional and conversational: Claude Sonnet
Math and multi-turn: Qwen-Coder (local) or Claude Opus
Front door / classification: Hermes3:8b (fast, cheap, reliable under load)

The architecture that's emerging: Hermes3 handles classification and simple front-door tasks at <200ms; Qwen-Coder handles structured local tasks (safety, math formatting, spelling validation); cloud escalates to GLM-5 or Claude for creative and conversational tasks where local models consistently underperform.

What's still unresolved

Judge affinity remains a live concern. Claude Opus as judge may favor outputs that resemble Claude's training distribution. The corrected benchmark didn't solve this — it only removed the most obvious sources of artificial score degradation. A proper multi-judge setup (at least one non-Anthropic judge) would give more confidence in the relative rankings between local models and Claude.

GLM-5's credit errors need a clean full run before I treat its 7-win score as final. It's the most promising cloud-affordable model in the Blip task categories, which makes getting clean data on it important.

The results are good enough to proceed with Blip's routing layer. They're not good enough to be the last benchmark I ever run.