The Settings That Change Everything: Per-Model Calibration Before the Re-Run
The first benchmark run sent identical settings to every model. Temperature 0.3. System prompt in the system role. No repetition penalty. Same for all eleven models. That's the wrong way to do this, and I knew it was wrong before the run finished — I just didn't have the per-model profiles implemented yet.
The results article published those numbers. This article documents exactly what was wrong, what the corrected settings are, and what's different between the two runs — so when the re-run completes, readers can evaluate whether the rankings actually changed or held up.
How we know the first run was misconfigured
Each result file from the benchmark runner includes a profile_applied
field — the exact settings used for that model on that test. It's there for
transparency, so a reader can verify conditions were appropriate.
In every result from the 2026-04-07 run: no profile_applied field.
It's absent because the per-model profiles weren't implemented yet when the run
executed. The harness used the global defaults for every model.
The global defaults that ran
| Setting | Value applied to all models |
|---|---|
temperature |
0.3 |
| System prompt role | system (for all models) |
repetition_penalty |
Not set |
max_tokens |
4,096 |
num_ctx |
4,096 (added mid-run to prevent VRAM exhaustion) |
| Few-shot examples | None (Blip suite uses per-test system prompts, not few-shot injection) |
What should have run instead
The per-model calibration research is documented in the methodology article. Here's the complete diff — what each model actually ran vs. what it should run:
| Model | Ran (first run) | Will run (re-run) | Impact |
|---|---|---|---|
| DeepSeek-R1:32b | temp 0.3, system prompt in system role | temp 0.6, system prompt merged into user message | High — two critical bugs |
| Hermes3:8b | temp 0.3, no repetition_penalty | temp 0.8, repetition_penalty 1.1 | High — both settings wrong |
| Qwen2.5-Coder:32b | temp 0.3, no top_p | temp 0.2, top_p 0.1, rep_penalty 1.18 | Moderate — code precision tasks |
| Qwen3.5-abliterated:35b | temp 0.3 | temp 0.5, rep_penalty 1.05 | Low — more natural chat output expected |
| GLM-5 | temp 0.3 | temp 0.5 | Low — general MoE chat model |
| MiniMax-M2.5 | temp 0.3 | temp 0.5 | Low |
| DeepSeek-V3-0324 | temp 0.3 | temp 0.3 (unchanged — V3 is not R1) | None |
| Claude Sonnet 4.6 | temp 0.3 | temp 0.3 (matches Anthropic recommendation) | None |
| Claude Opus 4.6 | temp 0.3 | temp 0.3 (matches Anthropic recommendation) | None |
The two models where this matters most
DeepSeek-R1: two critical errors
R1 wasn't trained to follow system role instructions. It was trained to reason from user prompts. Putting instructions in the system role — which is what the first run did to every model including R1 — causes it to ignore the instructions or enter logic drift. This is documented in DeepSeek's own release notes and reproduced consistently by the community.
The second issue: temperature 0.3 is too low for R1. DeepSeek's official recommendation is 0.5–0.7 (0.6 center). At 0.3, R1 produces incoherent or repetitive output on tasks that require reasoning. The first run used 0.3.
R1 scored zero wins across 28 Blip tests in the first run. That result may be measuring a crippled version of the model, not its actual capability. The re-run will be the first honest data point for R1 on this suite.
Hermes3:8b: both settings off
NousResearch's own example code for Hermes3 uses temperature 0.8. The first run used 0.3 — nearly the opposite end of the range. At 0.3, the model produces stilted, over-formal output that loses the conversational warmth Hermes3 was tuned for.
Without repetition_penalty, Hermes3 loops on multi-turn conversations.
The Blip test suite includes multi-turn tests. In the first run, the model
didn't have the penalty set. Any multi-turn test where it looped would have
scored poorly — not because the model is bad, but because it was operating
without a documented required setting.
What the first run's results can still tell us
Not all of the first run is in question. The models that ran on correct settings — Claude Opus, Claude Sonnet, DeepSeek-V3-0324 — produced valid results. Their rankings against each other are trustworthy. The re-run may adjust scores slightly due to temperature changes in other models, but the Claude models' relative performance should be stable.
Qwen2.5-Coder's results are probably directionally correct even with the wrong temperature. The 5 wins it scored (safety, math, voice quality) are in categories where a slightly higher temperature wouldn't have cost it the win. But we'll know for certain after the re-run.
What changes in the code
The calibration profiles are now in runner.py as a
MODEL_PROFILES dict. Each model has:
temperature— model-specific value from official docs or community researchuse_system_prompt—Falsefor R1 (instructions merge into user message)repetition_penalty— set where documented to prevent loopingtop_p— set where meaningful (Qwen-Coder at 0.1 for code precision)reason— the documented source for each setting choice
Each result will now include a profile_applied field with the exact
settings used. The HTML report will surface these per model, so a reader can
verify the conditions rather than taking the runner's word for it.
What the re-run found
The calibrated run completed April 8, 2026 — 9 models, 44 tests each, 396 total
outputs. The profile_applied field is present in every result file.
Here's what changed:
- Claude Opus: 8 → 12 wins. Claude Sonnet: 4 → 10 wins. The Claude models dominated more clearly when every other model was also properly configured — the gap widened because the field got more accurate, not less competitive.
- GLM-5: 4 → 8 wins, avg 40.7/50. Jumped to the strongest non-Anthropic model in the suite. Its calibration change was modest (temp 0.3 → 0.5), but that was enough to move it from a three-way tie into a clear third place.
- DeepSeek-R1: 0 → 1 win, avg 31.1/50. The calibration helped — correct temperature and system prompt routing fixed the logic drift — but R1 is still at the bottom of the pack on Blip tasks. It's a reasoning model. Spelling bees aren't its context.
- Hermes3:8b: 0 → 1 win, avg 28.2/50. Also improved, also still last. The correct temperature and repetition_penalty fixed the looping behavior, but the model is genuinely weaker than the others on this suite.
- Qwen2.5-Coder: held at 5 wins. Was the best local model in the first run; still the best local model. GLM-5 outperforms it overall but GLM-5 is OpenRouter (cloud), not local.
The results article has been updated with the calibrated numbers and a before/after column for every model.