The Smaller Model Won, and the 70B Ran at 1.6 Tokens Per Second
I was benchmarking three local models to pick a voice-assistant fallback and all three returned 8-second first-token latency. Eight seconds. For "yes." My first thought was the models were broken. They weren't. My GPU was just eating itself.
Three services on the inference box had each pinned their own model in VRAM
with keep_alive=-1, each assuming they were the only caller, and
Ollama was silently juggling them in and out of the RTX PRO 6000's 96 GB of
memory. The benchmark itself triggered a fourth load on top. Nothing was
broken — they were just all fighting for the same GPU. So I built a
router, A/B/C tested the three candidates properly, and the smallest one
won. Story below.
What I was testing, and why
Sally — my voice assistant — runs on Bedrock Sonnet 4.6 in the cloud. When Bedrock is healthy she sounds smart. The question I was testing: when Bedrock fails (5xx, 3-second timeout, network partition), which local model takes over? It has to be fast enough that the handoff doesn't sound like she died. Doesn't have to be brilliant — I already have brilliant. Just needs to keep the conversation going for the <1% of turns where the cloud path drops.
Three candidates:
- A — qwen-openclaw:latest. 32.8B Q4_K_M. Qwen3 family. Already the current primary. Proven.
- B — llama3.3:70b-instruct-q3_K_M. 34 GB on disk. The "bigger is smarter" pick the AI Council talked me into considering.
- C — glm-4.7-flash:q4_K_M. 30B MoE. "Flash" in the name. Speed-tuned.
Llama 3.3 70B Q4_K_M doesn't fit — it wants 114 GiB of system RAM and the box only has 105 GiB available. I downgraded to Q3_K_M so it would load at all. That matters for the result later.
The router came first
I couldn't trust any benchmark run on the inference box because every service on that box was fighting for VRAM. GLM-Z1 was pinned forever from a test I forgot to unload. OmniVoice TTS was live. Parakeet STT was live. IndexTTS was live. When I ran the benchmark, whatever I was loading would push something else out, and whatever I called next would reload from disk. Nothing showed the real first-token latency — everything showed load-from-disk time.
So I stood up LiteLLM as a router on the host box. One endpoint. Named tiers
— sally-primary, sally-fallback, openclaw-default,
local-default — mapped to concrete backends (Bedrock, Ollama, eventually
vLLM). Every caller now goes through one door. When I change the mapping,
every service sees the change without a redeploy. And — the thing I care about
most — the local-* namespace is guaranteed never to route to
cloud, so FlexStudio, ComfyUI, and the LoRA trainer can use it without
worrying about privacy leakage. The router config has a CI-style assertion at
the bottom of the file: any entry with the local- prefix that
points at a non-local endpoint is a privacy bug.
Building that took longer than I planned, because LiteLLM did two things I
didn't expect. First, it bailed at startup saying
bedrock_converse/ isn't a real provider name — the right prefix
is just bedrock/ even when you want the Converse API. Second,
its Ollama adapter returned completely empty content for Qwen3, eating the
response silently. Direct Ollama calls returned "Pong. 🏓".
The adapter was dropping thinking-mode tokens and never recovering. I swapped
to Ollama's OpenAI-compatible endpoint (/v1/chat/completions) and
the problem disappeared. LiteLLM version 1.83.9, for the record. Might be
fixed by the time you read this.
The benchmark
Five voice-representative prompts, each model freshly loaded, router-warmed once, then measured. System prompt is Sally's real one — "1-2 short sentences, under 25 words, no markdown." Numbers below are cold cold load in milliseconds, plus per-prompt time to first token (TTFT) and total response time, plus tokens per second.
| Model | Cold load | TTFT p50 | TTFT max | Total p50 | Tok/sec |
|---|---|---|---|---|---|
| GLM-4.7-Flash Q4_K_M (30B MoE) | 6.7 s | 119 ms | 133 ms | 317 ms | 67.6 |
| qwen-openclaw:latest (Qwen3 32B) | 19.2 s | 138 ms* | 55.4 s* | 470 ms* | 62.1 |
| Llama 3.3 70B Q3_K_M | 28.5 s | 3.6 s | 49.5 s | 7.7 s | 1.6 |
* Qwen3 went bimodal — two of five prompts were fast (119-138 ms), three blew up to 53-55 seconds. More on that below.
GLM-4.7-Flash is the winner and it's not close
Every prompt. 105-133 ms TTFT. 65-70 tokens per second. Answers sounded like a person, not a reasoning model showing its work. It handled the refusal case cleanly — when I asked it to play music, it said "I can't do that, Barry. I'm just a text-based AI. But I can..." It knew what it was. Didn't fake a tool call, didn't apologize five ways.
Qwen-openclaw was second on the prompts that landed fast, but three of its
five prompts took 53-55 seconds. I think I know why. Qwen3 has thinking mode
baked in. Ollama accepts a think: false field, but I'm not sure
every code path honors it. On those three prompts it looks like Qwen either
burned the token budget on a hidden reasoning block or got evicted from VRAM
between turns despite my keep_alive=-1. Inconsistent latency is
worse than slow latency when you're on a phone call — you can't plan around
it. Out.
Now the 70B
1.6 tokens per second. One point six. On an RTX PRO 6000 Blackwell with 96 GB of VRAM. A Q3-quantized 70B should hit 15-25 tok/s on that card if it's loaded into GPU memory properly. 1.6 tok/s means it was running substantially on the CPU.
I think what happened: the box's swap was 100% full when I started the
afternoon. I fixed it — added a 16 GB swapfile, set
vm.swappiness=10 — but by the time I pulled the 70B Q4 (failed,
needed 114 GiB RAM), deleted it, pulled the Q3 (fit), and smoke-tested, some
combination of fragmented memory, kernel page cache pressure, and Ollama's
layer-offload heuristic decided the safer path was to put the attention layers
on the CPU. I didn't verify this with a nvidia-smi probe during
the hot loop, so this is a guess — but 1.6 tok/s is the number you'd expect
from a CPU-heavy 70B, not a GPU one.
Either way: the practical verdict is the same. On this hardware today, Llama 3.3 70B Q3 is not a viable fallback. Not because the model is bad. Because the system around the model can't load it reliably into the hot path.
What the AI Council got wrong
I ran two council sessions before I pulled any weights — five-model voting
panels, structured claim extraction, the whole thing. The council locked in
Llama 3.3 70B Q6_K as the recommended primary for a later vLLM migration,
confidence 8.4. It reasoned from vLLM's stable llama3_json
tool-call parser and the proven pedigree of the Llama 70B series. Both
defensible. Both correct, even.
What the council didn't weight heavily enough: the room-temperature reality of my inference box. GLM-4.7-Flash serves first tokens in 120 ms on this hardware today, and Llama 3.3 70B — even if we get it onto vLLM and it properly sits in VRAM — still has to compete with the STT, TTS, and memory-encoding services for the same GPU. Flash is 19 GB loaded. Llama Q6 is 57 GB. That difference isn't intelligence. That difference is every other service on the box.
I'll still put vLLM + Llama Q6 in the roadmap for the "thinking on demand" tier — when you specifically ask Sally to reason about something hard, off-cloud. But the fallback tier — the one that fires when Bedrock is having a bad day — that's Flash. Shipping it to Sally's config tonight.
The lesson I keep relearning
Benchmarks lie when you don't control the environment. My first run last night showed every model at 8-second TTFT and I thought I'd broken something. I hadn't broken anything. I was just reading noise. Same thing bit me on the calibrated benchmark in March — the thing I tested last wasn't the thing I thought I was testing, because something else in the environment had shifted.
If you're running local models on a shared GPU, the single most useful thing you can build isn't the next model. It's a router that stops three services from politely ignoring each other while they chew through each other's KV cache.
Config, router code, and raw benchmark JSON are in
~/litellm-router/ and /tmp/benchmark-data.json on
my machine. If you want to reproduce this and your GPU is different from
mine, expect different results — that's the whole point.