Author(s): Ravi Yogesh
Originally published on Towards AI.
10 experiments, 3 models, one honest verdict: The quality story is real, the speed story needs a disclaimer, and entropy is a discovery in data no one talks about.
⏱ ~14 min read🔬 Deep Dive⚙️ LLM Estimation🗜 Quantification🚀 Serving
When Google published TurboQuant at ICLR 2026, the headline was hard to ignore: Compress your LLM key-value cache to 3 bits, preserve quality, get up to 6× memory savings.
I created a 10-experiment evaluation pipeline, ran it across three models – Gemma-2b Base, Gemma-2b-IT, and TinyLama 1.1b Chat – and measured everything I could: factual accuracy, RAG retrieval quality, multi-task generation fidelity, throughput, memory footprint, layer sensitivity, and something most quantization write-ups omit entirely: entropy. What does compression do?

What is TurboQuant and why is it different?
Most KV-cache quantization schemes treat compression as a reconstruction problem: minimize the mean square error on the cached vectors. TurboQuant is much smarter than that. It targets what really matters to focus attention on – in-product protection.
Stage 1 – PolarQuant
Each pair of consecutive vector coordinates is converted to polar form (radius + angle). The radius remains in full flow. The angle – which remains in a limited, approximately uniform range after an alternating random rotation – is quantized at lower bitwidths. Rotation is the real trick: it spreads the external energy so that the scalar quantization on the angle is not destroyed by a few prime values.
Stage 2 – QJL (Quantized Johnson-Lindenstrauss)
The reconstruction error from step 1 is interpolated into a random subspace, and only Signal Each projection is stored – 1 bit per projection. The Johnson–Lindenstrauss lemma states that random guesses approximately preserve inner products. So this step specifically corrects the dot-product error that distorts the attention score, not just the MSE. That difference separates TurboQuant from naive quantizers.
Disclaimer: My implementation combines forward passes of the model and compresses the cached K and V tensors between decode stages using int8 containers with per-vector float16 scale factors. It is not a packed 3-bit kernel – it does not use fused CUDA or Triton. This difference matters a lot for speed results and to a lesser extent for memory results. I will be completely clear about both.
experimental setup
Ten experiments, two phases. A fixed external encoder was used in all text evaluations (sentence-transformers/all-MiniLM-L6-v2) different from the models tested. Throughput used a fixed 80-step greedy decode benchmark with alternating load orders to reduce hot/cold GPU bias.
1. 3-Bit is the real operating point – and entropy explains why
Start with a synthetic bit-depth comparison – PolarQuant alone vs. full TurboQuant-style path focusing on MSE and KL divergence.


MSE table shows that 2-bit is worse, but doesn’t explain why Why. Attention uses entropy. Entropy measures how concentrated or spread out the softmax distribution is – higher entropy means the model is participating broadly, lower entropy means it is locked to certain positions.
The main difference is between random key distributions (noise-floor conditions) and structured key distributions – vectors with key directional clusters, which is closer to what real LLM attention heads actually produce.
At 2-bit compression, structured key distributions experience 65% entropy collapse – the quantizer is actively reshaping the attention geometry in ways that cut off access to the broader context. It’s not just advanced MSE. This is a less reported reason why 2-bit compression silently hurts multi-hop reasoning, long-document synthesis, and complex RAG chains.
2. Quality results are stronger than expected
Factual QA (40 questions, 3 seeds)
RAG-style context-based QA (12 contexts, 3 seeds)
Multi-task semantic similarity (factual, reasoning, coding, summary) between the baseline and TQ-3 bit generations: 1.0 in all categories and all three models.

3. There are two very different versions of memory math
For Gemma-2B (18 layers, MQA with 1 KV head per group, 256-dim heads), the ideal 3-bit target vs. what the int8 prototype actually stores:

If someone tells you TurboQuant gives 5× memory savings, ask what storage backend they are using. An int8 prototype gives ~2×. It’s still useful – it can expand the context window or increase batch sizes on restricted hardware – but that’s a different product story. ~5× titles require really packed sub-byte storage with a custom memory layout.
4. Speed is model dependent – here’s the honest reason why
Throughput sweep: 80 fixed greedy decode steps, 5 measured trials, 2 optional rounds, three accelerated lengths.
Title (512-token prompt)
Full prompt-length sweep (TQ/baseline ratio)
The question of speed is not “Is the TurboQuant fast?” This is “Is the bottleneck of your model memory bandwidth or arithmetic?” Fast models like TinyLama spend proportionally more time on cache I/O – compressing the cache reduces that bottleneck. Compute-heavy models like Gemma-2B are less responsive. Know your bottleneck before optimizing for it.
5. Mixed-bit schedules are an easy win no one talks about
Layer sensitivity analysis showed that for both Gemma variants, compression sensitivity peaks in layers 7–10 – the middle of the stack. Uniform bit allocation is leaving quality on the table for free.

Sensitivity-ranking schedule – 4 bits for the top 25% most sensitive layers, 3 bits for the middle, 2 bits for the least sensitive – at ~2.94 effective bits:

In practice the scan looks like this:
def compute_layer_sensitivity(model, bits=3):
"""Single-pass sensitivity scan over key-projection layers."""
sensitivity = {}
for i, layer in enumerate(model.model.layers):
W = layer.self_attn.k_proj.weight.data
W_hat, _, _, _ = turboquant_apply(W, bits)
mse = ((W - W_hat.to(W.dtype)) ** 2).mean().item()
sensitivity(i) = mse
return sensitivitydef build_mixed_bit_schedule(sensitivity, top_pct=0.25, bottom_pct=0.25):
"""Assign 4/3/2 bits by sensitivity rank."""
ranked = sorted(sensitivity.items(), key=lambda x: x(1), reverse=True)
n = len(ranked)
schedule = {i: 3 for i in sensitivity}
for idx, _ in ranked(:int(n * top_pct)):
schedule(idx) = 4 # most sensitive → more bits
for idx, _ in ranked(-int(n * bottom_pct):):
schedule(idx) = 2 # least sensitive → fewer bits
return schedule
A forward pass. Three lines of argument. ~19% MSE improvement on medium-weight models. If you are building TurboQuant into an inference system, run this scan before defaulting to equal bit allocation.
5 things to consider in your next production decision
- 3 bits is your quality-neutral default. 2 bits distort geometry focusing on realistic key distribution. Perform thorough validation before deploying at that compression level.
- Plan on memory budget around ~2× today, ~5× eventually. Supports algorithmic titles. The engineering (packed sub-byte storage + fused kernel) does not yet exist outside Google’s internal stack.
- Run a sensitivity scan before deploying uniform bitwidth. One forward pass, measurable quality improvement, essentially free.
- Speed benefits are architecture dependent. Tinylama intensified by 20-81%; Gemma-2B slowed down by 5%. Know whether your serving bottleneck is memory bandwidth or compute.
- Production wins require kernel integration. See vLLM and TensorRT-LLM for TurboQuant-style packed attention support. That’s when the throughput story changes.
bottom line
TurboQuant is not a free 5x win that you can drop into production tomorrow. This is even more interesting than that.
Experiments suggest a clear gap between algorithmic promise and current engineering reality. In terms of quality, the results are encouraging: 3-bit quantization held up well, and the mixed-bit schedule helped stabilize the delicate middle layers in Gemma-2B. On memory, the theoretical savings are real, but today’s PyTorch implementations are limited by the lack of int8 containers and packed sub-byte kernels. On speed, the story largely depends on the model and runtime path.
The most useful solution is not simply that “lower bits are better.” It is that quantization changes the geometry of attention. The 2-bit run not only added harmless noise; They collapsed entropy into structured key distributions, which helps explain why context-sensitive functions deteriorate so rapidly under aggressive compression.
This makes the practical lesson quite concrete: treat 3-bit quantization as a safe default, scan layers before choosing a uniform bitwidth, and reserve 2-bit compression for places where the model can actually tolerate it. Mixed-bit scheduling matters more in production than headline compression ratio.
The algorithm is reliable, the quality findings are promising, and the deployment story will ultimately be decided by kernel support. It is worth taking paper data seriously, but not blindly. The real production opportunity begins when packed storage and fused attention kernel algorithms catch up.
Reference
- Zandih et al. (2026). TurboQuant: Online vector quantization with near-optimal distortion rate. ICLR 2026. openreview.net/forum?id=tO3ASKZlock
- Zandih et al. (2025). arXiv preprint arXiv:2504.19874.
- Google Research Blog. TurboQuant: Redefining AI efficiency with extreme compression.
- Zandih et al. (2024). PolarQuant: Leveraging polar transformations for efficient KV cache quantization. aistats 2026.
- Johnson & Lindenstrauss (1984). Extension of the Lipschitz mapping in Hilbert space. contemporary mathematics26.
- Kwon et al. (2023). Efficient memory management for serving large language models with PagedAttention. sosp 2023. arXiv:2309.06180
- Ainslie et al. (2023). GQA: Training a generalized multi-query transformer model from multi-head checkpoints. arXiv:2305.13245
- Hooper et al. (2024). KVQuant: Towards 10 million context length LLM estimation with KV cache quantization. arXiv:2401.18079
- GitHub repository with the source code used for the above experiments: github.com/quartzap/turboquant
Published via Towards AI