26 Models, Half a Crash, and One Clear Winner
I got greedy. Twenty-six models in one benchmark run — local Ollama, vLLM BF16, OpenRouter cloud, Bedrock Claude, five CHILDES fine-tunes, two GLM-Z1 LoRAs, and a 142 GB Qwen3-235B running via mmap across the GPU and system RAM. I wanted one clean leaderboard that ranked everything I'd built and pulled over the past week.
I got about half of one.
What broke
Fifteen of twenty-eight judge calls came back with parse errors. Over half the test scores are missing. Several models — qwen3-235b, gemma4-31b, qwen3-32b, qwen3-30b-a3b — emitted zero tokens on their responses, which means they either timed out or never loaded. Average response time: 0.5 seconds. That's not inference. That's Ollama failing to load a 20 GB model because the previous one hasn't finished unloading from VRAM.
The inference box has 96 GB of VRAM. vLLM was sitting on 74 GB of it. That leaves 22 GB for Ollama to cycle through 20+ models, some of which are 19-20 GB each. The math was never going to work. I knew that going in — I'd seen the cold-load cycling in earlier runs — but I figured "more models, more data, let the slow ones be slow." What I didn't account for was that Ollama just gives up if it can't allocate enough memory for the model, and returns an empty response fast enough that the runner logs it as "done" without flagging the failure.
Should have stopped vLLM first. Didn't. That's on me.
GLM-5.1 didn't care about any of that
Four wins out of thirteen scored tests. Average rank 5.5 out of 26. First time in the benchmark, added via OpenRouter at $1.26/$3.96 per million tokens — the most expensive cloud model in the lineup by a factor of two, and it earned it.
It won both math tests outright. Beat Claude Opus on math. Won a greeting test and a voice-quality test. Z.ai's GLM-5.1 is the successor to GLM-5, which has been a steady 3-4 win performer across the whole series. The successor is better. Whether it's $1.26-per-million-tokens better than GLM-5 at $0.72 — I'd want a clean full run before making that call.
The CHILDES variants showed up
Two of the five new CHILDES-based fine-tunes scored wins. blip-edu-ab-stories took a spelling test and a voice-quality test — 2 wins, which ties it with GLM-5 (cloud) on this run. blip-edu-ab-childes won a creative test.
These are Qwen2.5-7B LoRAs trained on a blend of real child-language data from the CHILDES corpus plus synthetic Claude-generated examples. A different dataset philosophy than my v1 and v2 blip-edu variants, which used purely synthetic data. The fact that both approaches produce models that win tests — on different categories — is actually the most useful finding from this run, even with the infrastructure problems.
Is real child-language data in the training mix the thing that makes the difference? Can't say from one partial run. But it's the first hypothesis I'd test in a clean re-run.
The 235B scored nothing
Zero wins. Average rank 22.1 out of 26. Zero output tokens. The model I spent two days pulling (142 GB download), benchmarking for TTFT (117ms — genuinely fast), and writing an entire brainstorm design around didn't produce a single scoreable response in the full benchmark run.
The standalone serving tests from last night — 8.67 tok/s, sub-200ms TTFT, clean kid-tutor responses — were real. Those ran with the full 96 GB VRAM available. Today's benchmark ran with vLLM eating 74 GB. The 235B needs the mmap split to work (96 GB GPU + 46 GB CPU RAM), and with only 22 GB of VRAM free, the model couldn't load enough layers onto the GPU to function.
The 235B can work locally. It just can't share the GPU with vLLM. I need to either run it in a dedicated benchmark pass with vLLM stopped, or accept that the 235B is a "stop everything else, load this one model, run inference" kind of setup. Not great for a production router. Fine for a quality-tier fallback when latency isn't the constraint.
The partial leaderboard
Thirteen tests scored cleanly. The rest are noise. I'm publishing this table with that caveat — treat it as directional, not authoritative.
| # | Model | Wins | Avg rank | Avg time | Tokens |
|---|---|---|---|---|---|
| 1 | GLM-5.1 (cloud, new) | 4 | 5.5 | 4.7s | 203 |
| 2 | GLM-5 (cloud) | 2 | 5.7 | 7.0s | 125 |
| 2 | blip-edu-ab-stories (mine, new) | 2 | 9.9 | 1.3s | 45 |
| 4 | Claude Opus | 1 | 6.9 | 4.3s | 47 |
| 4 | MiniMax-M2.5 | 1 | 8.0 | 3.8s | 158 |
| 4 | blip-edu:v2 (mine) | 1 | 9.9 | 0.2s | 44 |
| 4 | GLM-Z1-32B (local) | 1 | 10.0 | 25.6s | 244 |
| 4 | blip-edu-ab-childes (mine, new) | 1 | 10.7 | 2.2s | 46 |
| 9 | Claude Sonnet | 0 | 6.9 | 2.5s | 46 |
| 10 | vLLM qwen-coder-32b-fp8 | 0 | 8.2 | 2.1s | 42 |
| 11 | DeepSeek-V3-0324 | 0 | 8.2 | 4.3s | 41 |
Models with 0 output tokens omitted — they failed to load, not failed to compete. Full 26-model data in the report HTML.
What I'm doing differently next time
Stop vLLM before the run. Free all 96 GB for Ollama. The 26-model config is fine — the models just need room to load and unload. With the full GPU available, even the 235B should get a fair shot (it proved it can serve at 8.67 tok/s in the standalone test).
The judge parse errors are a separate problem. Fifteen failures out of twenty-eight
means either the prompt is too long for 26-model comparisons (each model's output
gets pasted into the judge prompt, and at 26 models that's a lot of context), or
one of the two judges is choking on the volume. I need to check whether the
failures are all from Opus, all from DeepSeek-V3, or split — that tells me whether
to increase max_tokens on the judge call or split the scoring into
smaller batches.
GLM-5.1 gets to stay in the lineup. Four wins on a partial run is enough signal to justify the higher token cost. The CHILDES fine-tunes get a proper evaluation in the clean re-run — they deserve a fair comparison against the v1/v2/coder variants on the same set of scored tests. And the 235B gets its own dedicated pass, vLLM off, full VRAM, because the latency data from last night says it can compete if you give it the resources.
Cost of this run: about $5 in cloud judging, plus whatever the failed calls cost (probably another $2 in wasted Opus tokens on prompts that didn't parse). Total series spend is now around $31 over five days. Most of that went to the judge, not the inference. Still cheaper than one month of a junior engineer's time, and I have twelve benchmark runs, nine blog posts, five trained models, and one very clear picture of what works and what doesn't.