Skip to content
Building & Tinkering

The Settings That Change Everything: Per-Model Calibration Before the Re-Run

The first benchmark run sent identical settings to every model. Temperature 0.3. System prompt in the system role. No repetition penalty. Same for all eleven models. That's the wrong way to do this, and I knew it was wrong before the run finished — I just didn't have the per-model profiles implemented yet.

The results article published those numbers. This article documents exactly what was wrong, what the corrected settings are, and what's different between the two runs — so when the re-run completes, readers can evaluate whether the rankings actually changed or held up.

How we know the first run was misconfigured

Each result file from the benchmark runner includes a profile_applied field — the exact settings used for that model on that test. It's there for transparency, so a reader can verify conditions were appropriate.

In every result from the 2026-04-07 run: no profile_applied field. It's absent because the per-model profiles weren't implemented yet when the run executed. The harness used the global defaults for every model.

The global defaults that ran

Setting Value applied to all models
temperature 0.3
System prompt role system (for all models)
repetition_penalty Not set
max_tokens 4,096
num_ctx 4,096 (added mid-run to prevent VRAM exhaustion)
Few-shot examples None (Blip suite uses per-test system prompts, not few-shot injection)

What should have run instead

The per-model calibration research is documented in the methodology article. Here's the complete diff — what each model actually ran vs. what it should run:

Model Ran (first run) Will run (re-run) Impact
DeepSeek-R1:32b temp 0.3, system prompt in system role temp 0.6, system prompt merged into user message High — two critical bugs
Hermes3:8b temp 0.3, no repetition_penalty temp 0.8, repetition_penalty 1.1 High — both settings wrong
Qwen2.5-Coder:32b temp 0.3, no top_p temp 0.2, top_p 0.1, rep_penalty 1.18 Moderate — code precision tasks
Qwen3.5-abliterated:35b temp 0.3 temp 0.5, rep_penalty 1.05 Low — more natural chat output expected
GLM-5 temp 0.3 temp 0.5 Low — general MoE chat model
MiniMax-M2.5 temp 0.3 temp 0.5 Low
DeepSeek-V3-0324 temp 0.3 temp 0.3 (unchanged — V3 is not R1) None
Claude Sonnet 4.6 temp 0.3 temp 0.3 (matches Anthropic recommendation) None
Claude Opus 4.6 temp 0.3 temp 0.3 (matches Anthropic recommendation) None

The two models where this matters most

DeepSeek-R1: two critical errors

R1 wasn't trained to follow system role instructions. It was trained to reason from user prompts. Putting instructions in the system role — which is what the first run did to every model including R1 — causes it to ignore the instructions or enter logic drift. This is documented in DeepSeek's own release notes and reproduced consistently by the community.

The second issue: temperature 0.3 is too low for R1. DeepSeek's official recommendation is 0.5–0.7 (0.6 center). At 0.3, R1 produces incoherent or repetitive output on tasks that require reasoning. The first run used 0.3.

R1 scored zero wins across 28 Blip tests in the first run. That result may be measuring a crippled version of the model, not its actual capability. The re-run will be the first honest data point for R1 on this suite.

Hermes3:8b: both settings off

NousResearch's own example code for Hermes3 uses temperature 0.8. The first run used 0.3 — nearly the opposite end of the range. At 0.3, the model produces stilted, over-formal output that loses the conversational warmth Hermes3 was tuned for.

Without repetition_penalty, Hermes3 loops on multi-turn conversations. The Blip test suite includes multi-turn tests. In the first run, the model didn't have the penalty set. Any multi-turn test where it looped would have scored poorly — not because the model is bad, but because it was operating without a documented required setting.

What the first run's results can still tell us

Not all of the first run is in question. The models that ran on correct settings — Claude Opus, Claude Sonnet, DeepSeek-V3-0324 — produced valid results. Their rankings against each other are trustworthy. The re-run may adjust scores slightly due to temperature changes in other models, but the Claude models' relative performance should be stable.

Qwen2.5-Coder's results are probably directionally correct even with the wrong temperature. The 5 wins it scored (safety, math, voice quality) are in categories where a slightly higher temperature wouldn't have cost it the win. But we'll know for certain after the re-run.

What changes in the code

The calibration profiles are now in runner.py as a MODEL_PROFILES dict. Each model has:

  • temperature — model-specific value from official docs or community research
  • use_system_promptFalse for R1 (instructions merge into user message)
  • repetition_penalty — set where documented to prevent looping
  • top_p — set where meaningful (Qwen-Coder at 0.1 for code precision)
  • reason — the documented source for each setting choice

Each result will now include a profile_applied field with the exact settings used. The HTML report will surface these per model, so a reader can verify the conditions rather than taking the runner's word for it.

What the re-run found

The calibrated run completed April 8, 2026 — 9 models, 44 tests each, 396 total outputs. The profile_applied field is present in every result file. Here's what changed:

  • Claude Opus: 8 → 12 wins. Claude Sonnet: 4 → 10 wins. The Claude models dominated more clearly when every other model was also properly configured — the gap widened because the field got more accurate, not less competitive.
  • GLM-5: 4 → 8 wins, avg 40.7/50. Jumped to the strongest non-Anthropic model in the suite. Its calibration change was modest (temp 0.3 → 0.5), but that was enough to move it from a three-way tie into a clear third place.
  • DeepSeek-R1: 0 → 1 win, avg 31.1/50. The calibration helped — correct temperature and system prompt routing fixed the logic drift — but R1 is still at the bottom of the pack on Blip tasks. It's a reasoning model. Spelling bees aren't its context.
  • Hermes3:8b: 0 → 1 win, avg 28.2/50. Also improved, also still last. The correct temperature and repetition_penalty fixed the looping behavior, but the model is genuinely weaker than the others on this suite.
  • Qwen2.5-Coder: held at 5 wins. Was the best local model in the first run; still the best local model. GLM-5 outperforms it overall but GLM-5 is OpenRouter (cloud), not local.

The results article has been updated with the calibrated numbers and a before/after column for every model.