Skip to content

I Had a Council of AIs Grade Itself. Here's What It Said.

The first production run crashed into the wall I knew was there. Five models voting on fifteen claims, and the local Qwen returned zero votes. Just — nothing. The other four covered it, aggregation kept working, and the session produced a useful report. But one of my voters sat silent, and the numbers show exactly why the thing survives when that happens.

I built an AI Council. Five LLMs run in parallel on a topic, critique each other's responses anonymously, then a claim-extractor pulls their discussion into 8-15 atomic statements and every model votes on every statement. One editor model writes a narrative grounded in the vote outcomes. No single model gets to decide anything on its own.

The idea isn't mine. Justin Zhao et al. published "Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks" at NAACL 2025 — 20 models judging each other's emotional-intelligence responses, with Bradley-Terry pairwise aggregation and per-judge bias normalization. Andrej Karpathy had sketched the concept earlier; burtenshaw turned it into a Hugging Face Space. What I wanted was different: not a leaderboard across models, but a deliberation engine for strategic questions — hiring, product direction, clinical protocol calls — where one model's quiet blind spot could send me the wrong way.

I spent about fourteen hours on the whole thing, end to end. Brainstorm, spec, 15-task plan, fifteen subagent-driven implementation rounds with two-stage code review per task, three benchmark rounds to tune per-model settings, one mid-build design pivot, two live runs, and a WhatsApp delivery path I had to entirely rewrite because the original code was pointing at the wrong gateway. The whole thing is live now, producing a council-YYYY-MM-DD-HHMM.md narrative plus a .votes.json structured artifact to every session's output directory.

The pipeline

Seven stages, counting the aggregation step:

  1. Stage 0 — live research. PubMed for clinical topics, arXiv and Hacker News for tech, Tavily and Brave otherwise. Routed by keyword classification on the topic text.
  2. Stage 1 — N independent opinions, parallel. Each model sees the research brief and writes a 400-600 word analysis.
  3. Stage 2 — anonymized peer review. Every model reviews every other model's Stage 1 output, with identities hashed to Advisor-XXXX via MD5 so nobody knows whose text they're grading. This is lifted from LMSYS-style pairwise evals — you get less deference when models can't tell each other apart.
  4. Stage 2.5 — claim extraction. Claude reads Stages 1 and 2 and writes out 8-15 atomic claims, each tagged as recommendation / risk / factual_claim / opportunity. Prompt enforces at least one risk claim. Prevents the rosy-bias drift where extractors only surface the positive recommendations and leave the concerns buried in prose.
  5. Stage 2.75 — democratic vote. Every council model votes on every claim with a 5-tier enum (strongly_approve / approve / abstain / reject / strongly_reject), plus a 0-10 confidence and a one-sentence reason. Parallel. Each voter gets the claim list in a different randomized order, seeded off a stable MD5 of the model name so the benchmark is reproducible across runs.
  6. Aggregate — pure Python. Per-voter bias normalization (each voter's session mean subtracted from each of their vote integers, stored as a diagnostic field). Raw means drive tier assignment: HIGH if mean ≥ 1.0 and std_dev < 0.8, CONTESTED if std_dev ≥ 1.5, REJECTED if mean < 0 and std_dev < 1.5, else MEDIUM.
  7. Stage 3 — editor narrative. Claude (or the preset chairman if no Claude) writes a 6-section report that describes the vote outcome. Hard prompt constraint: do not introduce new recommendations. The editor is a reporter, not a decider.

The insight that sold me on the architecture isn't the voting itself — it's that the chairman stops being a single point of failure. My original pipeline had one model synthesizing everything; whoever wrote the final report got to pick what made the cut, and whatever biases that model carried bled straight into my decision. Making the chairman an editor tied to the vote log means a contested item can't be quietly buried. If the council split, the narrative has to say so.

The normalization pivot

Midway through implementation I hit an inconsistency in my own spec. I'd written that per-voter bias normalization feeds the aggregate — subtract each voter's mean from each of their votes before computing the claim-level mean, the same council-normalization trick from the Zhao paper's Bradley-Terry math. The implementer built it. The tests passed. And then five of eleven tier-classification tests failed because "all-approve → HIGH" is not how normalized scores behave.

Think about it. If every voter votes strongly_approve on every claim, each voter's mean is +2. After subtraction each normalized vote is 0. The normalized aggregate for any given claim is zero, not +2. The HIGH threshold (mean ≥ 1.0) is unreachable when voters don't differentiate between claims. The feature that was supposed to correct for harsh voters was erasing the signal from unanimous ones.

Three options: (a) update all my tests and thresholds to reflect normalized-score semantics, (b) keep normalization but only for tier computation, (c) compute both — tier on raw scores, store normalized as a diagnostic field for later analysis.

I went with C. The reasoning was: with only 3-5 voters, per-voter bias estimates are noisy anyway. Normalized scores belong in the .votes.json artifact so I can look at trends across weeks of runs and decide later whether to promote them to tier computation. Meanwhile, the intuitive semantic of "all-approve → HIGH" stays intact. I'd rather ship with a feature demoted to diagnostic than ship something that surprises me the first time it runs.

The benchmark that found my bugs

I didn't trust the per-model settings I'd drafted from docs. DeepSeek R1's model card explicitly warns against temperature=0 (endless repetitions). Qwen3's official guidance recommends 0.7 for non-thinking mode but community discussions report the /no_think flag is unstable. Gemini 2.5 Pro can't fully disable thinking. None of that synthesizes cleanly into one global temperature, so I wrote a benchmark harness that runs every model through the vote stage five times against a frozen fixture and reports format validity, self-consistency, and latency.

First run took 51 minutes across 56 runs. Four models hit 100% format validity — Claude, qwen-openclaw (a custom Qwen3 fine-tune), cloud DeepSeek-R1, Grok-3. Three failed hard.

Gemini 2.5 Pro at 0% validity. Every single run returned empty content. I'd written a passthrough for thinking_budget: 128 using body["provider"] = {"google": {"thinking_budget": settings["thinking_budget"]}} — and OpenRouter was rejecting that outright with "Unrecognized key 'google'". Two problems: the correct OpenRouter syntax is body["reasoning"] = {"max_tokens": N} at the root level, and Gemini's minimum reasoning budget is 1024 tokens, not 128. My value was below the floor and my placement was wrong. After removing the broken parameter entirely, Gemini hit 100%. That's my bug, not Google's.

qwen3.5:27b at 60% validity. Three of five vote calls at temperature=0.7 returned zero parsed votes. Lowering to 0.5 plus adding Ollama's native "format": "json" grammar-sampling parameter bumped it to 80% in one round and 60-80% in a second (the metric is genuinely noisy at this sample size — small sample plus model stochasticity). Still below my 95% acceptance threshold. I left it in the council anyway. The aggregation is N-1 resilient, and I'd rather have a diverse voter bench with one flaky member than a homogeneous one.

GLM-4.5-Air at 0% validity. This one took the longest to figure out. A direct curl to Ollama worked — returned 5171 characters of content in 4 minutes. The benchmark path returned 0 content in 0.2 seconds. After tracing through _dispatch_call step by step and finally running it through a Python reproducer that imported the benchmark's own functions, I caught the str(asyncio.TimeoutError()) trick — Python's default string rep for that exception class is empty. My benchmark's except Exception as e: error = str(e) was silently recording an empty-string error, which I'd been reading as "succeeded with no content" rather than "timed out at 300 seconds". Changed it to repr(e) or f"{type(e).__name__}(no message)" and the real problem surfaced immediately: GLM-4.5-Air with format: json enabled produces mid-response thinking tokens that corrupt the JSON grammar state. Without format: json, it works — but a 44 GB model with 47 s of cold-start load_duration plus ~7 tokens/sec of eval time can't finish a full vote response inside a 300 s budget.

I swapped GLM-4.5-Air out of the free preset. The replacement is stock qwen3.5:27b, which isn't dramatically better but at least fits the time budget. Model-family diversity took a hit — free preset now has two Qwen variants plus local DeepSeek-R1 — but shipping beats a theoretical leaderboard. I'll revisit if gemma4 or a different local model proves out.

The first real session

At 7:40 AM I ran the council on its own readiness question: "what are the 3 most important things to verify when the AI council voting pipeline first goes live?" Took 330 seconds. 15 claims extracted. 2 of 3 models voted (Local-Qwen silent again — the benchmark warning landed in production on the very first run). 13 claims landed HIGH, 2 MEDIUM, zero contested.

The 87% HIGH rate is a calibration problem I don't have a fix for yet. The spec's target distribution was roughly 30/40/20/10 across HIGH/MEDIUM/CONTESTED/REJECTED, and I'm way off. Tier thresholds are too loose. I'm leaving them for now — one session isn't enough to tune against, and an analytics script I wrote flags "HIGH > 60% consistently" as an adjust-signal that triggers after ~10 real sessions accumulate. The daily 2 AM cron will get me there inside two weeks.

The first substantive session — the one I actually wanted — came from feeding it a 650-line handoff document about rebuilding my voice assistant's pipeline (Moshi-based full-duplex → streaming ASR → LoRA-controlled token-emitting LLM → streaming TTS; different post). I condensed the plan into a ~1400 character topic and let the full 5-model preset chew on it for 5.5 minutes. Fifteen claims came out. The council's strongest warning (approval 1.75, confidence 9.2 of 10) was about migration safety: my Phase 5 had no canary deployment, no shadow-mode spec, no rollback trigger. The council's unanimous strongly_approve claim (approval 2.0 of 2.0, the maximum possible) was a shadow-run recommendation that I hadn't even put in the original plan.

That was the moment it earned its keep. Not the tier distribution, not the format validity, not the per-voter bias math — the fact that a council of five models, given a plan I'd spent real effort on, immediately found the most dangerous gap and proposed the safety mechanism that belonged there. Would I have gotten there anyway? Probably. Would I have gotten there in six minutes? I don't think so.

Asking the council to grade itself

I fed the council a 1764-character summary of its own design — stages, voting math, per-model benchmark results, the 100%-HIGH-tier problem, the qwen3.5 flakiness, the normalization pivot, the Karpathy / Zhao citations it descends from. Asked it to name three strongest design choices, three biggest weaknesses in its own aggregation, and three most valuable next steps. Told it to prefer sharp critique over polite hedging. Five models, six-minute session, fifteen claims extracted.

Twelve HIGH, three MEDIUM, zero contested, zero rejected. So the council — asked to diagnose why its 100% HIGH rate is a problem — produced an 80% HIGH rate. I laughed out loud when I saw that number. The editor's summary flagged the irony directly: "The session's most pointed finding is self-referential: 100% of claims landed HIGH tier, precisely mirroring the inaugural session's failure mode the council was convened to diagnose." Which is fair.

The tied-for-highest-confidence claim, unanimous strongly_approve from all four responding voters:

Equal-weight voting assigns the same decisional weight to qwen3.5:27b (60-80% format validity) as to Claude or Gemini (100% format validity), treating reliability differentials as irrelevant. [approval: 2.00/2.00, confidence: 9.5/10]

Translation: my council thinks its own flaky voter is counting too much. And yeah. The second 2.00/2.00 claim was the session's self-burn: the tier system produced zero discriminating signal in its first real run. Council knows its tiers are broken.

Three weaknesses the council surfaced that I hadn't put in my own plan:

  • Shared-corpus collusion (single-source insight, approved). "When 6 of 8 council models agree on a claim, that agreement may reflect shared pretraining corpora rather than independent epistemic convergence, and the pipeline currently has no mechanism to detect or flag this model-collusion risk." One advisor raised it. The rest affirmed it. The fact that it was a unique insight about undetectable convergence — and nobody else independently surfaced the same concern — is itself a mild illustration of the gap it describes.
  • Editor narrative has no oversight. "If a single model writes the narrative, it can systematically reframe or soften the democratic voting output in ways that override the council's signal without detection." [1.50 approval, 8.0 confidence]. I read the editor's output every morning. Nobody audits it against the raw vote data. That's a real gap.
  • Don't tune thresholds on 15 claims. "Proposing specific new threshold numbers based solely on the 15-claim inaugural session risks overfitting those thresholds to an unrepresentative sample before adversarial calibration is performed." [1.75, 8.5]. I read that one twice. The council was telling me not to turn the knobs I was planning to turn based on the data I had. Which is exactly what my Phase 3 runbook said to do.

qwen3.5:27b voted zero times in this session too. Third live run in a row, third time silent. That's not volatility. That's a pattern I should stop pretending is noise.

What's next

Based on what the council said, not what I planned Monday:

Calibration corpus before threshold tuning. The council's top recommendation: build 20-30 claims with known controversy levels and tune tier boundaries until known-controversial claims reliably score CONTESTED. Not pick-a-number-and-see. My runbook was going to eyeball the thresholds at week 3 based on whatever sessions accumulated. That approach was lazy and the council told me so. I need a fixture first, generated from real disputed topics where I already know which way the verdict should swing.

Synthetic canary claims embedded in every session. One or two claims per run with pre-established expected tier outcomes. If a canary claim that should land CONTESTED lands HIGH instead, the tier math has drifted and the alert fires. Cheap to add — one extra claim per session, chosen ahead of time from a fixed set. Probably the lowest-effort-per-value item on this list.

Reliability-weighted voting, capped at 3x. Each model's vote weight becomes its rolling format-validity rate over the most recent 10 sessions. This was already in my Phase 3 plan. What the council added: cap weight differentials so no single model can dominate. Without the cap, Claude and Gemini end up with ~1.65x the weight of qwen3.5 and the whole thing collapses into a 2-model oligarchy that reads like an ensemble but isn't. The 3x ratio preserves epistemic diversity. I wouldn't have thought of the cap on my own — that's the council earning its keep.

Editor audit mechanism. This one I don't have a clean answer for. Options the council surfaced: diff the editor's tier summary against the raw aggregated data to catch systematic softening; run a second blind editor in parallel and compare outputs; constrain the editor to a structured-output schema that forces specific fields. I lean toward option one — programmatic consistency checks between editor narrative claims and the .votes.json ground truth. Need to think about this more before coding. Running two editors feels like over-engineering for a $0.15 session.

Collusion detection. The hardest one, and I don't have a working idea. Pairwise vote-disagreement rate across sessions might proxy independence — if two models always agree on every claim, their responses are suspiciously correlated even when both are technically "right". Fleiss' kappa or Krippendorff's alpha would give a formal inter-rater agreement number. Both are noisy at three voters though, and with five voters and 15 claims per session the signal-to-noise is marginal. I'll need 30+ sessions to even start measuring meaningfully. Park it for now, revisit in Phase 4.

Three of the five items were already on my list. Two — editor audit and collusion detection — were not. That's where the council paid for itself on this run. It found two holes in my plan that I hadn't seen, and flagged them specifically enough that I know what to build.

One thing I keep coming back to

The council is at its most useful when it catches me doing something I was already about to do. The Sally voice-pipeline review yesterday found a migration-safety gap in a plan I'd been staring at for six hours. This self-review told me not to tune thresholds from 15 data points. If I'd shipped Phase 3 the way I'd written it Monday, I would've hardcoded tier thresholds at week 3 against maybe fourteen real sessions and only noticed the overfitting months later. Or never.

That's the value I didn't expect. Not the 100% HIGH tier. Not the per-voter bias normalization diagnostic. A council of five models, given a thing you already have opinions about, will tell you the opinions you should have had.

It costs about fifteen cents a session.

References