Two Judges Are Better Than One: What the Dual-Judge Benchmark Found

I almost published a post saying GLM-5 dominated the Blip benchmark with 9 wins. That number was wrong. It was wrong because my second judge — the one I'd added specifically to catch this kind of problem — was failing silently on every single test, and I didn't notice until I checked a field I almost didn't check.

The previous post left two open questions: would a non-Anthropic judge rank things differently, and could the two Opus-distilled Qwen models be tamed if I got their verbosity under control? I re-ran the 11-model benchmark to answer both. One answer surprised me. The other didn't.

What changed

Three things. judge.py now sends every test to two judges: Claude Opus 4.6 via Bedrock and DeepSeek-V3-0324 via OpenRouter. Each judge independently scores the 11 model responses on five dimensions — accuracy, completeness, reasoning, specificity, conciseness. Scores get averaged. Any dimension where the judges differ by two or more points gets flagged. Final ranking comes from the averaged totals.

The two opus-distilled models are back. Previous runs had them emitting 2,579 and 1,455 tokens per response because they'd been distilled from Claude Opus's chain-of-thought traces — long deliberative reasoning baked into their bones. This run prepends /no_think (a Qwen3 special token that suppresses the thinking block) and caps max_tokens at 500. Did that fix them?

And qwen-coder-32b now runs at temperature 0.2, top_p 0.1, repetition penalty 1.1 — settings pulled from research papers on coding tasks.

The bug that nearly invalidated everything

First attempt at the dual-judge run: 28 of 28 tests fell back to Opus-only. Every judgment's v3_error field read "OPENROUTER_API_KEY not set — skipping V3 co-judge". The benchmark harness has a .env loader in runner.py, but judge.py ran as a separate subprocess from compare.sh and never called it. So the second judge failed silently, and the results looked like a clean single-judge run.

I caught it because I checked the dual_judge field on the judgment entries and noticed it was always false. If I hadn't? I'd have published "GLM-5 dominates with 9 wins" and meant it. That number would have been five wins too high.

Fix was one function: _load_env_file() added to judge.py. Re-judged the existing raw outputs. The results below are from the re-judge. Silent fallbacks in multi-step pipelines — I could rant about this pattern but that's a different post.

Results

28 tests, 11 models, dual-judged, averaged, re-ranked:

Rank	Model	Wins	Avg rank	Avg tokens out
1	Claude Opus 4.6	6	3.82	53
2	DeepSeek-V3-0324	5	4.21	35
3	Claude Sonnet 4.6	4	3.68	46
3	GLM-5	4	3.75	154
3	MiniMax-M2.5	4	4.11	154
6	qwen2.5-coder:32b	2	5.43	38
7	DeepSeek-R1:32b	2	6.75	47
8	qwen3.5-abliterated:35b-a3b	1	6.43	64
9	Hermes3:8b	0	7.50	53
10	qwen3.5-35b-a3b-opus-distilled	0	9.50	450
11	qwen3.5-27b-opus-distilled	0	10.82	477

Judge affinity was real — and it was a 5-win swing

The Opus-only intermediate results (before I caught the bug) gave GLM-5 9 wins. With the dual judge active, GLM-5 dropped to 4. Five wins evaporated from one methodological change.

Was Opus recognizing something genuinely good in GLM-5's outputs, or just projecting affinity because those outputs pattern-match its training distribution? I honestly don't know. But the effect is quantified: single-judge benchmarks have a 5-win margin of error on a 28-test suite, at least for cloud models in the same architectural neighborhood. That number is bigger than I expected, and it retroactively undermines the previous post's GLM-5 results. Those were Opus-only too.

The judges fight constantly

Every one of the 28 tests had at least one scoring dimension where Opus and DeepSeek-V3 differed by two or more points. Total disagreements: 500 across 1,540 scoring cells (28 tests x 11 models x 5 dimensions). That's a 32.5% cell-disagreement rate.

Per dimension:

Reasoning: 118 disagreements (highest)
Completeness: 103
Specificity: 100
Conciseness: 99
Accuracy: 80 (lowest)

Accuracy being the least-disagreed makes sense — did the model get the facts right? That's close to objective. Reasoning being the most-disagreed also makes sense. "Is the logic sound?" is a judgment call that depends on the judge's own reasoning norms. Two models grading each other's reasoning homework. I have a whole theory about why this creates a ceiling on LLM-as-judge reliability, but the short version is: any single-judge benchmark is noisy enough that small differences between models aren't meaningful. Only large, repeated gaps should drive decisions.

The opus-distilled models, again

Verbosity is under control. With /no_think plus a 500-token cap, the two opus-distilled models went from 2,579 and 1,455 average tokens down to 477 and 450. That's a 5x and 3x reduction. The calibration worked.

They still scored zero wins out of 28. Dead last at 10.82 and 9.50.

So verbosity wasn't the problem. These models were distilled from Opus's long reasoning chains — they learned to reason, not to converse. Force them to produce short outputs and they produce competent short reasoning traces that aren't what a kid wants to hear when learning to spell. I'm removing them from the default benchmark pool. They might be useful for other work — hypothesis testing, clinical reasoning drafts, anything where chain-of-thought is an asset. But not for Blip. Two runs, zero wins each time. The "maybe they'd win if we controlled verbosity" hypothesis is dead.

The qwen-coder calibration accident

Previous run: qwen2.5-coder:32b scored 5 wins and got crowned the local champion. Won 2 of 3 safety tests. Tied cloud on voice_quality, math, and multi-turn. I updated Blip's hybrid routing to send safety, math, and voice_quality to qwen-coder based on that result.

This run, with the "research-backed" settings from coding-task papers: 2 wins. Down from 5.

The settings that are correct for deterministic code generation — low temperature, tight top_p, repetition penalty — are too restrictive for conversational kid-tutor responses. I was tuning the model for the task it's named after, but Blip's tests aren't code. They're conversational educational prompts. The previous run used default sampling, which was less "correct" by the literature but better for what I was actually testing.

Does that mean qwen-coder's 5-win result was real capability measured under lucky settings, or a fluke? I don't know yet. A fair third run would use Blip-appropriate settings — temperature around 0.5, default top_p — and settle the question. Until then, qwen-coder's placement is uncertain.

DeepSeek-V3 quietly won safety

I didn't see this one coming. On the three safety prompts — kid asks about something scary or inappropriate — DeepSeek-V3-0324 won 2 out of 3. Qwen-coder won the third. No other model placed first on a safety test.

At OpenRouter prices ($0.20/$0.77 per million tokens), a safety response costs about $0.0001 per call. A tenth of a penny. The "local is free, cloud is expensive" argument barely applies here.

But there's a reason to keep safety local anyway: if a kid says something genuinely concerning, that utterance should not leave the device. The benchmark measures response quality. It can't measure privacy. So the safety rule stays local, with DeepSeek-V3 documented as the escalation path if quality becomes a problem in practice.

What changes in Blip's routing

Safety: stays local (qwen-coder-32b) for privacy. DeepSeek-V3-0324 outperformed it in the benchmark, and that's documented.
Math: reconsidering. MiniMax-M2.5 won 2 of 4 math tests. Qwen-coder won zero. Math isn't privacy-sensitive — escalating to cloud MiniMax-M2.5 is defensible. But I'm waiting for the third run with Blip-tuned qwen-coder settings before I commit.
Voice_quality: stays local. Three-way tie between qwen-coder, deepseek-v3, and abliterated. No clear winner, and on-device is free.
Creative, emotional, greeting, multi-turn, trivia, spelling: cloud. The Claude family, DeepSeek, and GLM-5 split these. No local model competes.
Opus-distilled Qwens: removed from the default pool. Not coming back without a strong hypothesis for why they'd score differently on a new task.

Open threads

A third judge would help. Two judges flag disagreements but can't break ties. A third — maybe a GPT-4-class model, maybe something smaller like Llama-3.3-70B — would let me score by majority rule on contested dimensions and get a cleaner signal.

The qwen-coder question needs a definitive answer. 5 wins to 2 wins is probably mostly calibration, but "probably" isn't good enough to route real traffic on. One more run with temperature 0.5 and default top_p should settle it.

And then there's the 32.5% disagreement rate hanging over all of this. The difference between 4 wins and 5 wins might be noise. I'm treating rank groups — top tier: Opus; next tier: Sonnet, GLM-5, DeepSeek-V3, MiniMax; bottom: everything else — as the real signal, and individual win counts within a tier as roughly interchangeable.

The whole run took about 50 minutes and cost roughly $3.70 — $0.22 for inference, $3.50 for the two judges running in sequence. Judging is 94% of the total cost, and that's without a third judge. If this becomes a regular thing — running against new releases, new task suites — the judging cost is the number to shrink. One idea I keep coming back to: skip the LLM judge entirely for the subset of tests where must_include criteria are deterministic, and use mechanical scoring there instead.

The hybrid routing is live. Blip is answering real kid questions through it today. None of this invalidates the architecture — it just tunes which model answers which question. Which is the whole point of running the benchmark in the first place.