What I Learned Benchmarking 6 Local LLMs — and Why Most Model Comparisons Are Wrong

I needed to choose the right local LLM for Blip's routing layer — the model that handles structured tasks like spelling bees and math rounds before escalating to Claude. That required a real benchmark: same tasks, fair conditions, honest scoring. I built a harness, ran 28 tests across 6 models, and got results I almost published before noticing I'd made three fundamental errors that would have made the entire comparison meaningless.

This is what I found, what I got wrong, and what a fair model comparison actually requires.

The models

All running locally on the inference box (RTX PRO 6000 Blackwell, 96GB VRAM) via Ollama:

Qwen2.5-Coder:32b — code-specialized, ChatML template
DeepSeek-R1:32b — reasoning model with internal <think> blocks
Hermes3:8b — instruction-following generalist, Llama 3.1 base
Qwen3.5-27B Opus-Distilled — dense model trained on Claude Opus reasoning chains
Qwen3.5-35B-A3B Opus-Distilled — MoE variant of the above
Qwen3.5-Abliterated:35B-A3B — same MoE architecture, safety filters removed

Plus Claude Sonnet 4.6 and Opus 4.6 via AWS Bedrock as reference models and judge.

Three errors that would have invalidated the results

1. DeepSeek-R1 and system prompts

My benchmark sent a system prompt to every model. For most models, that's correct — they were trained to follow system-role instructions. For DeepSeek-R1, it's a mistake that degrades performance. R1 was trained to reason from user prompts. Putting instructions in the system role triggers hallucination and circular reasoning loops.

The fix: for R1, merge all instructions into the user message. No system prompt at all. Running R1 with a system prompt and calling it a fair comparison is like testing a sprinter with one shoe and reporting the result as their actual speed.

2. DeepSeek-R1 and few-shot examples

My output calibration script added few-shot examples to every prompt — show the model two examples of the expected output format, then ask for a third. This improves most models. For R1, it degrades performance. R1 tries to mimic the examples instead of using its native reasoning engine. Zero-shot consistently outperforms few-shot for R1.

If you benchmark R1 with few-shot examples, you're measuring how well it mimics patterns, not how well it reasons. The results look bad and tell you nothing.

3. Hermes3 and repetition penalty

Without a repetition penalty, Hermes3 loops on multi-turn conversations. My benchmark didn't set this parameter. The model would start looping mid-conversation and score poorly on coherence — not because it's a bad model, but because I hadn't read the documentation carefully enough. Setting repetition_penalty: 1.1 fixes it.

What model-specific tuning actually looks like

After identifying these problems, I built per-model profiles that set the right conditions for each. The key parameters that vary by model:

System prompt: Full / compact / merge-into-user / none
Few-shot examples: Include / skip
Temperature: Ranges from 0.2 (Qwen-Coder, code precision) to 0.8 (Hermes3, natural chat)
Repetition penalty: Only where documented to help
Output stripping: Remove <think> blocks before scoring (R1, Opus-distilled models)

The <think> block issue deserves a separate note. DeepSeek-R1 and the Opus-distilled models produce internal reasoning traces wrapped in <think></think> tags before the actual answer. These can be ten times longer than the answer itself. If you score the raw output — including the thinking trace — you're measuring something different from what you'd measure if you scored just the answer. The judge has to strip them first.

The judge affinity problem

I'm using Claude Opus as the benchmark judge. That creates a structural problem: the Opus-distilled models in the benchmark were trained on Claude Opus reasoning chains. When Opus evaluates a model that sounds like Opus, it may prefer that output not because it's better, but because it's stylistically familiar. This is judge affinity — same-distribution preference that inflates scores for similar models.

I haven't fully solved this. The partial fixes: use multiple judges where possible, score on objective criteria rather than subjective preference (did the spelling bee word come back in the right format? did the math answer parse correctly?), and flag stylistic similarity in the report so I'm at least aware of it.

What the results actually showed

With proper per-model settings, the benchmark became useful. For Blip's specific tasks — structured spelling, math routing, trivia formatting — the smaller, faster models performed surprisingly well when configured correctly. Hermes3:8b at the right temperature, with repetition penalty set, handled spelling bee turns reliably and returned clean structured output. It loads faster and costs less VRAM than any of the 32B+ models.

The reasoning models (R1, Opus-distilled) over-reasoned simple tasks. A spelling bee prompt shouldn't trigger a multi-paragraph internal deliberation. For Blip's routing layer — the part that handles "is this a spelling question or a math question, and here's the answer" — the overhead isn't worth it. Claude handles the creative and conversational work where that reasoning depth actually matters.

What I'm improving in the harness

The benchmark code needs three changes before I trust the results for final model selection:

Per-model prompt handling — a MODEL_PROFILES dictionary in the runner that controls system prompt tier, few-shot, temperature, and repetition penalty per model. No more one-size-fits-all prompting.
Pre-scoring output calibration — strip thinking tags, remove verbose preambles ("As an AI language model..."), normalize formatting before the judge sees any output.
Transparency in the report — the HTML report should show exactly which settings were used for each model, so a reader can verify the conditions were appropriate, not just trust the scores.

The broader lesson

Most published LLM comparisons don't account for any of this. They pick a benchmark, run every model with the same settings, and report the numbers. The model that happens to work well under generic conditions wins; models with different optimal configurations lose on a test that wasn't designed for them.

This isn't a criticism of the benchmark authors — it's genuinely hard to run per-model calibrated comparisons at scale. But if you're making a real decision (which model goes into production for a specific use case), you need to understand how each model is designed to be used before you test it. Otherwise you're measuring how well models perform under wrong conditions, not how well they actually perform.

For Blip, the decision isn't finalized yet. But the methodology is now solid enough that when the final numbers come in, I'll believe them.