Is 3-bit KV Cache the Holy Grail? A reality check on Google’s TurboQuant

by ai-intensify
0 comments
Is 3-bit KV Cache the Holy Grail? A reality check on Google's TurboQuant

Author(s): Ravi Yogesh

Originally published on Towards AI.

10 experiments, 3 models, one honest verdict: The quality story is real, the speed story needs a disclaimer, and entropy is a discovery in data no one talks about.

⏱ ~14 min read🔬 Deep Dive⚙️ LLM Estimation🗜 Quantification🚀 Serving

photo by logan voss But unsplash

When Google published TurboQuant at ICLR 2026, the headline was hard to ignore: Compress your LLM key-value cache to 3 bits, preserve quality, get up to 6× memory savings.

I created a 10-experiment evaluation pipeline, ran it across three models – Gemma-2b Base, Gemma-2b-IT, and TinyLama 1.1b Chat – and measured everything I could: factual accuracy, RAG retrieval quality, multi-task generation fidelity, throughput, memory footprint, layer sensitivity, and something most quantization write-ups omit entirely: entropy. What does compression do?

Figure 0: Full experiment suite – quality, memory, throughput and mixed-bit results across all three models

What is TurboQuant and why is it different?

Most KV-cache quantization schemes treat compression as a reconstruction problem: minimize the mean square error on the cached vectors. TurboQuant is much smarter than that. It targets what really matters to focus attention on – in-product protection.

Stage 1 – PolarQuant

Each pair of consecutive vector coordinates is converted to polar form (radius + angle). The radius remains in full flow. The angle – which remains in a limited, approximately uniform range after an alternating random rotation – is quantized at lower bitwidths. Rotation is the real trick: it spreads the external energy so that the scalar quantization on the angle is not destroyed by a few prime values.

Stage 2 – QJL (Quantized Johnson-Lindenstrauss)

The reconstruction error from step 1 is interpolated into a random subspace, and only Signal Each projection is stored – 1 bit per projection. The Johnson–Lindenstrauss lemma states that random guesses approximately preserve inner products. So this step specifically corrects the dot-product error that distorts the attention score, not just the MSE. That difference separates TurboQuant from naive quantizers.

Disclaimer: My implementation combines forward passes of the model and compresses the cached K and V tensors between decode stages using int8 containers with per-vector float16 scale factors. It is not a packed 3-bit kernel – it does not use fused CUDA or Triton. This difference matters a lot for speed results and to a lesser extent for memory results. I will be completely clear about both.

experimental setup

Ten experiments, two phases. A fixed external encoder was used in all text evaluations (sentence-transformers/all-MiniLM-L6-v2) different from the models tested. Throughput used a fixed 80-step greedy decode benchmark with alternating load orders to reduce hot/cold GPU bias.

1. 3-Bit is the real operating point – and entropy explains why

Start with a synthetic bit-depth comparison – PolarQuant alone vs. full TurboQuant-style path focusing on MSE and KL divergence.

At 2 bits, TurboQuant performs worse than PolarQuant alone. On 3 and 4 bits, the combination wins decisively
Figure 1 (E1): Note MSE and KL divergence versus bit depth – Polarquant versus Turboquant-style.
Figure 2 (E2): Rate-distortion curve – measured MSE versus anchored 4^-B reference (log scale)

MSE table shows that 2-bit is worse, but doesn’t explain why Why. Attention uses entropy. Entropy measures how concentrated or spread out the softmax distribution is – higher entropy means the model is participating broadly, lower entropy means it is locked to certain positions.

write on medium

The main difference is between random key distributions (noise-floor conditions) and structured key distributions – vectors with key directional clusters, which is closer to what real LLM attention heads actually produce.

Structured distributions (closer to real LLM key patterns) show 65% entropy collapse at 2 bits – the model is forced to hyper-focus on a small fraction of the context conditions

At 2-bit compression, structured key distributions experience 65% entropy collapse – the quantizer is actively reshaping the attention geometry in ways that cut off access to the broader context. It’s not just advanced MSE. This is a less reported reason why 2-bit compression silently hurts multi-hop reasoning, long-document synthesis, and complex RAG chains.

2. Quality results are stronger than expected

Factual QA (40 questions, 3 seeds)

RAG-style context-based QA (12 contexts, 3 seeds)

Multi-task semantic similarity (factual, reasoning, coding, summary) between the baseline and TQ-3 bit generations: 1.0 in all categories and all three models.

Figure 3 (E7): Multi-task semantic similarity radar – TQ-3bit vs baseline (1.0 in all categories)

3. There are two very different versions of memory math

For Gemma-2B (18 layers, MQA with 1 KV head per group, 256-dim heads), the ideal 3-bit target vs. what the int8 prototype actually stores:

The difference of 2.69× between the ideal and the prototype is entirely explained by the storage format: int8 costs 8 bits/element instead of 3, plus float16 scale metadata per token vector.
Figure 4 (E4): Left: Ideal target by bitwidth. Right: prototype storage versus ideal 3-bit target

If someone tells you TurboQuant gives 5× memory savings, ask what storage backend they are using. An int8 prototype gives ~2×. It’s still useful – it can expand the context window or increase batch sizes on restricted hardware – but that’s a different product story. ~5× titles require really packed sub-byte storage with a custom memory layout.

4. Speed ​​is model dependent – ​​here’s the honest reason why

Throughput sweep: 80 fixed greedy decode steps, 5 measured trials, 2 optional rounds, three accelerated lengths.

Title (512-token prompt)

Full prompt-length sweep (TQ/baseline ratio)

Values ​​above 1.0 mean that TQ-3bit is fast. TinyLama is about 10× faster over the baseline, making it proportionally more sensitive to cache I/O savings

The question of speed is not “Is the TurboQuant fast?” This is “Is the bottleneck of your model memory bandwidth or arithmetic?” Fast models like TinyLama spend proportionally more time on cache I/O – compressing the cache reduces that bottleneck. Compute-heavy models like Gemma-2B are less responsive. Know your bottleneck before optimizing for it.

5. Mixed-bit schedules are an easy win no one talks about

Layer sensitivity analysis showed that for both Gemma variants, compression sensitivity peaks in layers 7–10 – the middle of the stack. Uniform bit allocation is leaving quality on the table for free.

Figure 6 (E3): Per-layer reconstruction MSE – sensitivity peaks in the middle layers (7-10) for Gemma-2B

Sensitivity-ranking schedule – 4 bits for the top 25% most sensitive layers, 3 bits for the middle, 2 bits for the least sensitive – at ~2.94 effective bits:

~19% MSE improvement over the same average bit budget as 3-bit
Figure 7 (E10): Per-layer MSE – uniform 3-bit vs. sensitivity-ranked mixed-bit (green = improvement area)

In practice the scan looks like this:

def compute_layer_sensitivity(model, bits=3):
"""Single-pass sensitivity scan over key-projection layers."""
sensitivity = {}
for i, layer in enumerate(model.model.layers):
W = layer.self_attn.k_proj.weight.data
W_hat, _, _, _ = turboquant_apply(W, bits)
mse = ((W - W_hat.to(W.dtype)) ** 2).mean().item()
sensitivity(i) = mse
return sensitivity

def build_mixed_bit_schedule(sensitivity, top_pct=0.25, bottom_pct=0.25):
"""Assign 4/3/2 bits by sensitivity rank."""
ranked = sorted(sensitivity.items(), key=lambda x: x(1), reverse=True)
n = len(ranked)
schedule = {i: 3 for i in sensitivity}
for idx, _ in ranked(:int(n * top_pct)):
schedule(idx) = 4 # most sensitive → more bits
for idx, _ in ranked(-int(n * bottom_pct):):
schedule(idx) = 2 # least sensitive → fewer bits
return schedule

A forward pass. Three lines of argument. ~19% MSE improvement on medium-weight models. If you are building TurboQuant into an inference system, run this scan before defaulting to equal bit allocation.

5 things to consider in your next production decision

  1. 3 bits is your quality-neutral default. 2 bits distort geometry focusing on realistic key distribution. Perform thorough validation before deploying at that compression level.
  2. Plan on memory budget around ~2× today, ~5× eventually. Supports algorithmic titles. The engineering (packed sub-byte storage + fused kernel) does not yet exist outside Google’s internal stack.
  3. Run a sensitivity scan before deploying uniform bitwidth. One forward pass, measurable quality improvement, essentially free.
  4. Speed ​​benefits are architecture dependent. Tinylama intensified by 20-81%; Gemma-2B slowed down by 5%. Know whether your serving bottleneck is memory bandwidth or compute.
  5. Production wins require kernel integration. See vLLM and TensorRT-LLM for TurboQuant-style packed attention support. That’s when the throughput story changes.

bottom line

TurboQuant is not a free 5x win that you can drop into production tomorrow. This is even more interesting than that.

Experiments suggest a clear gap between algorithmic promise and current engineering reality. In terms of quality, the results are encouraging: 3-bit quantization held up well, and the mixed-bit schedule helped stabilize the delicate middle layers in Gemma-2B. On memory, the theoretical savings are real, but today’s PyTorch implementations are limited by the lack of int8 containers and packed sub-byte kernels. On speed, the story largely depends on the model and runtime path.

The most useful solution is not simply that “lower bits are better.” It is that quantization changes the geometry of attention. The 2-bit run not only added harmless noise; They collapsed entropy into structured key distributions, which helps explain why context-sensitive functions deteriorate so rapidly under aggressive compression.

This makes the practical lesson quite concrete: treat 3-bit quantization as a safe default, scan layers before choosing a uniform bitwidth, and reserve 2-bit compression for places where the model can actually tolerate it. Mixed-bit scheduling matters more in production than headline compression ratio.

The algorithm is reliable, the quality findings are promising, and the deployment story will ultimately be decided by kernel support. It is worth taking paper data seriously, but not blindly. The real production opportunity begins when packed storage and fused attention kernel algorithms catch up.

Reference

  1. Zandih et al. (2026). TurboQuant: Online vector quantization with near-optimal distortion rate. ICLR 2026. openreview.net/forum?id=tO3ASKZlock
  2. Zandih et al. (2025). arXiv preprint arXiv:2504.19874.
  3. Google Research Blog. TurboQuant: Redefining AI efficiency with extreme compression.
  4. Zandih et al. (2024). PolarQuant: Leveraging polar transformations for efficient KV cache quantization. aistats 2026.
  5. Johnson & Lindenstrauss (1984). Extension of the Lipschitz mapping in Hilbert space. contemporary mathematics26.
  6. Kwon et al. (2023). Efficient memory management for serving large language models with PagedAttention. sosp 2023. arXiv:2309.06180
  7. Ainslie et al. (2023). GQA: Training a generalized multi-query transformer model from multi-head checkpoints. arXiv:2305.13245
  8. Hooper et al. (2024). KVQuant: Towards 10 million context length LLM estimation with KV cache quantization. arXiv:2401.18079
  9. GitHub repository with the source code used for the above experiments: github.com/quartzap/turboquant

Published via Towards AI

Related Articles

Leave a Comment