Skip to content

The Smaller Model Won, and the 70B Ran at 1.6 Tokens Per Second

I was benchmarking three local models to pick a voice-assistant fallback and all three returned 8-second first-token latency. Eight seconds. For "yes." My first thought was the models were broken. They weren't. My GPU was just eating itself.

Three services on the inference box had each pinned their own model in VRAM with keep_alive=-1, each assuming they were the only caller, and Ollama was silently juggling them in and out of the RTX PRO 6000's 96 GB of memory. The benchmark itself triggered a fourth load on top. Nothing was broken — they were just all fighting for the same GPU. So I built a router, A/B/C tested the three candidates properly, and the smallest one won. Story below.

What I was testing, and why

Sally — my voice assistant — runs on Bedrock Sonnet 4.6 in the cloud. When Bedrock is healthy she sounds smart. The question I was testing: when Bedrock fails (5xx, 3-second timeout, network partition), which local model takes over? It has to be fast enough that the handoff doesn't sound like she died. Doesn't have to be brilliant — I already have brilliant. Just needs to keep the conversation going for the <1% of turns where the cloud path drops.

Three candidates:

  • A — qwen-openclaw:latest. 32.8B Q4_K_M. Qwen3 family. Already the current primary. Proven.
  • B — llama3.3:70b-instruct-q3_K_M. 34 GB on disk. The "bigger is smarter" pick the AI Council talked me into considering.
  • C — glm-4.7-flash:q4_K_M. 30B MoE. "Flash" in the name. Speed-tuned.

Llama 3.3 70B Q4_K_M doesn't fit — it wants 114 GiB of system RAM and the box only has 105 GiB available. I downgraded to Q3_K_M so it would load at all. That matters for the result later.

The router came first

I couldn't trust any benchmark run on the inference box because every service on that box was fighting for VRAM. GLM-Z1 was pinned forever from a test I forgot to unload. OmniVoice TTS was live. Parakeet STT was live. IndexTTS was live. When I ran the benchmark, whatever I was loading would push something else out, and whatever I called next would reload from disk. Nothing showed the real first-token latency — everything showed load-from-disk time.

So I stood up LiteLLM as a router on the host box. One endpoint. Named tiers — sally-primary, sally-fallback, openclaw-default, local-default — mapped to concrete backends (Bedrock, Ollama, eventually vLLM). Every caller now goes through one door. When I change the mapping, every service sees the change without a redeploy. And — the thing I care about most — the local-* namespace is guaranteed never to route to cloud, so FlexStudio, ComfyUI, and the LoRA trainer can use it without worrying about privacy leakage. The router config has a CI-style assertion at the bottom of the file: any entry with the local- prefix that points at a non-local endpoint is a privacy bug.

Building that took longer than I planned, because LiteLLM did two things I didn't expect. First, it bailed at startup saying bedrock_converse/ isn't a real provider name — the right prefix is just bedrock/ even when you want the Converse API. Second, its Ollama adapter returned completely empty content for Qwen3, eating the response silently. Direct Ollama calls returned "Pong. 🏓". The adapter was dropping thinking-mode tokens and never recovering. I swapped to Ollama's OpenAI-compatible endpoint (/v1/chat/completions) and the problem disappeared. LiteLLM version 1.83.9, for the record. Might be fixed by the time you read this.

The benchmark

Five voice-representative prompts, each model freshly loaded, router-warmed once, then measured. System prompt is Sally's real one — "1-2 short sentences, under 25 words, no markdown." Numbers below are cold cold load in milliseconds, plus per-prompt time to first token (TTFT) and total response time, plus tokens per second.

ModelCold loadTTFT p50TTFT maxTotal p50Tok/sec
GLM-4.7-Flash Q4_K_M (30B MoE)6.7 s119 ms133 ms317 ms67.6
qwen-openclaw:latest (Qwen3 32B)19.2 s138 ms*55.4 s*470 ms*62.1
Llama 3.3 70B Q3_K_M28.5 s3.6 s49.5 s7.7 s1.6

* Qwen3 went bimodal — two of five prompts were fast (119-138 ms), three blew up to 53-55 seconds. More on that below.

GLM-4.7-Flash is the winner and it's not close

Every prompt. 105-133 ms TTFT. 65-70 tokens per second. Answers sounded like a person, not a reasoning model showing its work. It handled the refusal case cleanly — when I asked it to play music, it said "I can't do that, Barry. I'm just a text-based AI. But I can..." It knew what it was. Didn't fake a tool call, didn't apologize five ways.

Qwen-openclaw was second on the prompts that landed fast, but three of its five prompts took 53-55 seconds. I think I know why. Qwen3 has thinking mode baked in. Ollama accepts a think: false field, but I'm not sure every code path honors it. On those three prompts it looks like Qwen either burned the token budget on a hidden reasoning block or got evicted from VRAM between turns despite my keep_alive=-1. Inconsistent latency is worse than slow latency when you're on a phone call — you can't plan around it. Out.

Now the 70B

1.6 tokens per second. One point six. On an RTX PRO 6000 Blackwell with 96 GB of VRAM. A Q3-quantized 70B should hit 15-25 tok/s on that card if it's loaded into GPU memory properly. 1.6 tok/s means it was running substantially on the CPU.

I think what happened: the box's swap was 100% full when I started the afternoon. I fixed it — added a 16 GB swapfile, set vm.swappiness=10 — but by the time I pulled the 70B Q4 (failed, needed 114 GiB RAM), deleted it, pulled the Q3 (fit), and smoke-tested, some combination of fragmented memory, kernel page cache pressure, and Ollama's layer-offload heuristic decided the safer path was to put the attention layers on the CPU. I didn't verify this with a nvidia-smi probe during the hot loop, so this is a guess — but 1.6 tok/s is the number you'd expect from a CPU-heavy 70B, not a GPU one.

Either way: the practical verdict is the same. On this hardware today, Llama 3.3 70B Q3 is not a viable fallback. Not because the model is bad. Because the system around the model can't load it reliably into the hot path.

What the AI Council got wrong

I ran two council sessions before I pulled any weights — five-model voting panels, structured claim extraction, the whole thing. The council locked in Llama 3.3 70B Q6_K as the recommended primary for a later vLLM migration, confidence 8.4. It reasoned from vLLM's stable llama3_json tool-call parser and the proven pedigree of the Llama 70B series. Both defensible. Both correct, even.

What the council didn't weight heavily enough: the room-temperature reality of my inference box. GLM-4.7-Flash serves first tokens in 120 ms on this hardware today, and Llama 3.3 70B — even if we get it onto vLLM and it properly sits in VRAM — still has to compete with the STT, TTS, and memory-encoding services for the same GPU. Flash is 19 GB loaded. Llama Q6 is 57 GB. That difference isn't intelligence. That difference is every other service on the box.

I'll still put vLLM + Llama Q6 in the roadmap for the "thinking on demand" tier — when you specifically ask Sally to reason about something hard, off-cloud. But the fallback tier — the one that fires when Bedrock is having a bad day — that's Flash. Shipping it to Sally's config tonight.

The lesson I keep relearning

Benchmarks lie when you don't control the environment. My first run last night showed every model at 8-second TTFT and I thought I'd broken something. I hadn't broken anything. I was just reading noise. Same thing bit me on the calibrated benchmark in March — the thing I tested last wasn't the thing I thought I was testing, because something else in the environment had shifted.

If you're running local models on a shared GPU, the single most useful thing you can build isn't the next model. It's a router that stops three services from politely ignoring each other while they chew through each other's KV cache.

Config, router code, and raw benchmark JSON are in ~/litellm-router/ and /tmp/benchmark-data.json on my machine. If you want to reproduce this and your GPU is different from mine, expect different results — that's the whole point.