I Tried to Use the TurboQuant Paper on My GPU. Here's What Actually Ran.
A few days ago, Google Research published a post about something called TurboQuant — a technique for compressing large language model KV caches to just 3 bits per value while preserving benchmark accuracy. I read it the morning it went up, got genuinely excited, pulled up the associated paper, looked for the code repository, and found nothing. No GitHub link. No implementation section. Just a footnote pointing to a preprint on arXiv and a note that it would be presented at ICLR 2026.
This is a familiar feeling. Research AI moves in two tracks: what gets published and what actually runs on hardware you own. The gap between them can be months or years. TurboQuant is brilliant on paper. You cannot use it today.
But the ideas in it are worth understanding — partly because they're technically interesting, and partly because they illuminate what you can use right now if you're running a Blackwell GPU and want to push local model performance further. That's what this post is about.
What TurboQuant actually does
First, a clarification that isn't obvious from the headline: TurboQuant compresses the KV cache, not the model weights. These are different problems.
Weight quantization (GPTQ, AWQ, NVFP4) shrinks the model itself so it takes up less VRAM at rest. You load a smaller version of the model and it stays that size. KV cache quantization shrinks the memory that accumulates during inference — the "working memory" the model uses to keep track of what it's already processed. A 32,000-token context at BF16 creates a KV cache that can eat 16–32 GB on top of the model weights, depending on architecture. That's the problem TurboQuant is solving.
The technique is a two-stage pipeline. First, PolarQuant: rather than quantizing KV vectors in Cartesian coordinates (where values can be arbitrarily distributed and you need per-vector normalization constants to recover them), it converts to polar coordinates first. The angles of these rotated vectors concentrate tightly around predictable positions — a known mathematical property of high-dimensional random rotations. Because the distribution is predictable, the quantizer can skip the expensive per-vector metadata that normally accounts for much of the overhead in extreme compression. You can use "most of the compression power" on the data itself.
The second stage, QJL, applies a 1-bit Johnson-Lindenstrauss transform to the residual — the difference between the quantized and actual value — reducing each leftover discrepancy to a single sign bit. The math guarantees that distance relationships between vectors are approximately preserved even after this extreme compression.
The claimed results: at least 6x KV cache memory reduction on long-context benchmarks, with "perfect downstream results across all benchmarks" and up to 8x throughput gain at 4-bit on an H100. If those numbers hold in production, that's the difference between handling a 32K context in 4 GB instead of 32 GB of VRAM.
Why does it work when other extreme quantization schemes fail? The key insight is eliminating the quantization constant. Standard INT4 weight quantization stores a scale factor and zero point per group of values — that metadata is necessary to reconstruct the original values, but it also partially negates the compression. At very low bit widths, the metadata eats an increasing fraction of the bits you saved. PolarQuant sidesteps this because the distribution is known in advance — no per-vector bookkeeping required.
Why you can't use it
The paper is arxiv:2504.19874, presented at ICLR 2026. There is no public implementation. The Google Research blog post links to the paper but not to a code repository. No integration exists in vLLM, SGLang, or llama.cpp. The technique was tested on Gemma and Mistral with H100 GPU accelerators, without disclosing architecture details that would let you reproduce it.
This isn't a criticism. Research papers routinely precede usable software by 6–18 months. Flash Attention was published in 2022, landed as stable in PyTorch in 2023, and became the obvious default in most inference stacks in 2024. TurboQuant is probably on a similar trajectory. Check back in Q3 2026.
The production equivalent that ships today
Here's the useful reframe: TurboQuant and the hardware features available on a modern Blackwell GPU are solving adjacent problems with different tools. The goal in both cases is to move more model capability through less memory. The paths differ.
The RTX PRO 6000 Blackwell I use for inference has a feature called NVFP4 — native 4-bit floating-point tensor core instructions added in the Blackwell architecture. This is weight compression, not KV compression, but the effect on VRAM utilization is dramatic: a 32-billion-parameter model in BF16 occupies roughly 64 GB of VRAM. The same model in NVFP4 drops to roughly 16 GB, leaving 80 GB free for KV cache and activations. You effectively get 4x the model in the same VRAM budget.
The critical difference from previous INT4 formats is that NVFP4 runs natively on Blackwell's tensor cores. Earlier 4-bit quantization was a memory trick — you loaded weights as INT4, dequantized them to BF16 on the fly, and ran BF16 compute. The math throughput was unchanged; you only saved bandwidth. Blackwell computes the matrix multiplications themselves in FP4, which is where the roughly 2x throughput gain over FP8 comes from. You get both benefits: less memory and faster compute.
This ships today in vLLM v0.18+ as --quantization nvfp4. One practical note for
RTX PRO 6000 Blackwell (compute capability sm_120): the MoE-specific FP4 backend isn't
auto-detected on this card per vLLM issue #31085. For MoE models, set
VLLM_USE_FLASHINFER_MOE_FP4=1 before starting the server. Dense models like
qwen2.5-coder:32b work without the flag.
For the KV cache side — the part TurboQuant is targeting — the production option is FP8 KV cache, also in vLLM and supported on Blackwell. FP8 is 8-bit floating point rather than TurboQuant's 3-bit polar scheme, so it's less aggressive, but it's real and it works. The catch: FP8 KV cache and prefix caching are still incompatible in most vLLM configurations (issue #3156, long-standing). You have to pick one. For Blip-style short-context chat with heavy prefix reuse, prefix caching wins and I skipped FP8 KV. For long-context or batch-heavy workloads, FP8 KV is the right trade.
Speculative decoding: the other throughput multiplier
Quantization changes how much model fits in memory. Speculative decoding changes how many tokens you get out per second for a given model.
The idea is simple. Autoregressive generation is sequential by design — you can't produce token N+1 until you have token N. But verifying N candidate tokens in parallel is almost free: the large model processes them all in a single forward pass, the same pass it would have used to produce just one token. If a small "draft" model can predict 4–5 tokens ahead correctly most of the time, you get 4–5 tokens per large-model pass instead of one.
The simplest version doesn't even need a separate draft model. N-gram speculation looks for
repeating patterns in the context itself — if the token sequence "def calculate" appeared
earlier, the next few tokens after the second occurrence of "def calculate" are likely the
same. This turns out to work surprisingly well for code and structured output, which is
most of what I use the coder model for. vLLM ships this as
--speculative-config '{"method":"ngram",...}'. No second model to serve, no
additional VRAM, measurable gains on structured generation tasks.
What I actually changed (and what broke)
Yesterday the inference box was running everything through Ollama with GGUF files: INT4 quantization via llama.cpp, no speculative decoding, no Blackwell-specific compute paths. I installed vLLM into a dedicated Python venv and wrote a serve script. Then I hit three compatibility walls in sequence.
First try: --quantization nvfp4. Unknown method. vLLM 0.19.0
uses mxfp4 as the flag name, not nvfp4.
Second try: --quantization mxfp4. PTX toolchain mismatch.
The pre-built vLLM wheel for cu128 includes CUDA 12.8 support but the FlashInfer kernels
required for FP4 compute were not compiled for sm_120 (Blackwell). To get NVFP4 working
you'd need to build vLLM from source with Blackwell-specific flags. That's a multi-hour
build I didn't want to do for a benchmark run.
Third try: --quantization fp8. This one started up — the
model loaded, the server came online, and responded to requests. But the responses were
garbage: the FP8 on-the-fly quantization uses a Cutlass ScaledMM kernel whose tensor
layout is incompatible with the Triton attention backend on vLLM 0.19.0. The underlying
FlashAttention2 PTX also doesn't target sm_120, so you're forced to use Triton attention
anyway. The result: activations go NaN somewhere in the FP8 × Triton attention stack, and
you get random tokens instead of answers.
What actually ran: BF16, Triton attention, eager mode, n-gram spec decode.
Strip out FP8, force --attention-backend TRITON_ATTN (since FlashAttention2
PTX won't compile for sm_120), add --enforce-eager to skip CUDAGraph profiling
(which also hits the bad PTX path), keep n-gram speculative decoding and prefix caching.
That combination starts, serves real responses, and passes smoke tests.
The NVFP4 story is a "not yet" rather than a "no." Once vLLM publishes Blackwell-native
wheels — probably in a point release once RTX PRO 6000 hardware gets wider distribution —
the mxfp4 path will work without a source build. Until then, BF16 is the
baseline.
The benchmark results
With vLLM serving BF16 on port 8001 and Ollama serving the Q4_K_M GGUF on port 11434, I ran the same five coding tests through both and used the dual-judge system (Claude Opus 4.6 + DeepSeek-V3-0324, anonymized, position-randomized) to score the outputs.
| Test | Ollama Q4_K_M | vLLM BF16 | Ollama ms | vLLM ms |
|---|---|---|---|---|
| Next.js API route with auth | 32.0 / 50 | 44.0 / 50 ✓ | 17,327 | 42,416 |
| Python data pipeline | 44.5 / 50 ✓ | 43.5 / 50 | 15,508 | 38,518 |
| React component with state | 43.0 / 50 | 45.5 / 50 ✓ | 18,535 | 52,978 |
| Complex SQL query | 36.0 / 50 | 41.0 / 50 ✓ | 16,740 | 41,001 |
| Debug this code | 17.5 / 50 | 20.0 / 50 ✓ | 14,952 | 30,289 |
| Average | 34.6 / 50 | 38.8 / 50 | 16,612 | 41,040 |
vLLM BF16 wins 4 of 5 quality tests. The average score is 38.8 vs 34.6 — about a 12% quality improvement. The cost is latency: vLLM BF16 is roughly 2.5x slower per request.
The explanation isn't subtle. Q4_K_M is 18.5 GB on the GPU; BF16 is 64 GB. The Q4_K_M model reads roughly 3.5x less weight memory per forward pass, which directly maps to faster token generation on a memory-bandwidth-limited workload. vLLM's n-gram speculative decoding is probably offsetting some of that — without it, BF16 would likely be even slower — but the memory bandwidth difference dominates.
The quality gap is also explicable: 4-bit quantization introduces quantization error that accumulates through layers. For code generation specifically, small errors in early token choices can cascade into structurally wrong output. BF16 has no such error. The Opus and DeepSeek-V3 judges agreed on this in 4 of 5 tests (the one Ollama win was a 1-point gap, essentially a tie).
What this result doesn't measure is the scenario vLLM is actually designed for: concurrent requests. Ollama handles one request at a time; vLLM's batching, prefix caching, and speculative decoding show their value under load. For a single-user home setup, Ollama Q4_K_M is the practical choice: faster, lower VRAM, good enough quality. For any multi-user deployment, the calculus shifts toward vLLM BF16 — or, once Blackwell wheels land, vLLM NVFP4 with no quality trade-off at all.
The bigger question TurboQuant points at
The reason the TurboQuant paper is interesting even if you can't run it is what it implies about the direction of model scaling. KV cache is currently the main bottleneck for long-context serving. A 72-billion-parameter model running at BF16 with a 128K context window can produce a KV cache that dwarfs the model weights themselves. If 3-bit KV compression with "perfect downstream results" actually holds up in production, the effective context window of current hardware roughly triples without changing anything else.
Combined with NVFP4 weights, you get a compounding effect: the model is smaller, the context buffer is smaller, and what's left of your VRAM goes to concurrent requests. That's the path to running frontier-class models on a single consumer GPU — not bigger hardware, but better utilization of the hardware that exists.
TurboQuant is probably a real step on that path. It just isn't a usable step yet. When the code ships, the upgrade is straightforward: swap out the KV cache quantization backend in vLLM and re-run the benchmark. When Blackwell-native vLLM wheels land, I'll re-run the comparison with NVFP4. The BF16 result above sets the quality baseline that any quantization scheme needs to match.