docs: add TurboQuant KV quantization documentation

windreamer · windreamer · commit 68f9516dfb84 · 2026-04-14T20:00:57.000+08:00
- Add TurboQuant section based on Google Research's technology (ICLR 2026)
- Explain principles: PolarQuant (high-quality compression) + QJL (error elimination)
- Add K=4bit QJL4 + V=2bit MSE quantization scheme details
- Include performance benchmark on H200 with Qwen3-30B-A3B-Base
- Document limitations: PytorchEngine only, no MLA, no speculative decoding
- Add optional dependency for fast_hadamard_transform
- Improve quant_policy CLI help text with TurboQuant reference
diff --git a/docs/en/quantization/kv_quant.md b/docs/en/quantization/kv_quant.md
@@ -21,6 +21,74 @@ In summary, LMDeploy kv quantization has the following advantages:
 3. KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
 4. Efficient inference, with int8/int4 kv quantization applied to llama2-7b, RPS is improved by round 30% and 40% respectively compared to fp16
 
+## TurboQuant
+
+LMDeploy supports KV quantization based on [Google Research's TurboQuant technology](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) (to be presented at ICLR 2026), achieving higher compression ratio with near-zero accuracy loss through K=4bit QJL4 + V=2bit MSE combination.
+
+### Principles
+
+TurboQuant achieves efficient compression through two key steps:
+
+1. **High-quality compression (PolarQuant method)**: First randomly rotates the data vectors (using orthogonal transforms like Hadamard transform). This clever step simplifies the data's geometry, making it easy to apply a standard, high-quality quantizer to each part of the vector individually. This stage uses most of the compression power (the majority of the bits) to capture the main concept and strength of the original vector.
+
+2. **Eliminating hidden errors (QJL method)**: Uses a small, residual amount of compression power (just 1 bit) to apply the QJL (Quantized Johnson-Lindenstrauss) algorithm to the tiny amount of error left over from the first stage. The QJL stage acts as a mathematical error-checker that eliminates bias, leading to more accurate attention scores.
+
+### K/V Quantization Scheme
+
+- **K Path - QJL4 Quantization**:
+
+  - Uses 3-bit Lloyd-Max codebook for MSE quantization (captures main information)
+  - Uses 1-bit QJL to store residual sign (eliminates error bias)
+  - Each token's K is compressed to 4-bit
+
+- **V Path - MSE int2 Quantization**:
+
+  - Uses 2-bit Lloyd-Max codebook for MSE quantization
+  - Each token's V is compressed to 2-bit
+  - Stores normalization coefficients for dequantization
+
+### Advantages
+
+- **Zero accuracy loss**: Through PolarQuant + QJL combination, achieves high compression rate while maintaining model accuracy
+- **Higher compression ratio**: K 4bit + V 2bit = average 3bit, further compression compared to int4's 4bit
+- **Eliminates quantization bias**: QJL algorithm acts as error-checker, effectively eliminating quantization-induced bias
+
+### Performance Benchmark
+
+Tested on H200 with Qwen3-30B-A3B-Base model and ShareGPT dataset:
+
+| Metric             | Baseline (quant_policy=0) | TurboQuant (quant_policy=42) | Change     |
+| ------------------ | ------------------------- | ---------------------------- | ---------- |
+| Input throughput   | 2368.8 tok/s              | 2195.8 tok/s                 | -7.3%      |
+| Output throughput  | 2186.7 tok/s              | 2027.0 tok/s                 | -7.3%      |
+| Request throughput | 10.74 req/s               | 9.96 req/s                   | -7.3%      |
+| Mean E2E latency   | 5.888s                    | 6.348s                       | +7.8%      |
+| Mean TTFT          | 1.139s                    | 1.235s                       | +8.4%      |
+| Mean TPOT          | 0.024s                    | 0.026s                       | +8.3%      |
+| Mean ITL           | 0.059s                    | 0.059s                       | ~unchanged |
+
+**Test configuration**: GPU: H200, Model: Qwen3-30B-A3B-Base, Dataset: ShareGPT, Concurrency: 64, Requests: 5000
+
+**Takeaway**: TurboQuant K4V2 achieves ~5x KV cache memory reduction with about 7%-8% end-to-end performance overhead, which looks like a reasonable trade-off for memory-bound serving scenarios.
+
+### Limitations
+
+- **PytorchEngine only**: TurboQuant currently only supports PyTorch engine, not Turbomind engine
+- **MLA not supported**: Does not support Multi-head Latent Attention architecture
+- **Speculative decoding not supported**: Does not support speculative decoding
+- Requires head_dim to be a power of 2
+- Requires `fast_hadamard_transform` package for best performance (optional)
+
+### Optional Dependency
+
+TurboQuant uses Hadamard transform to accelerate the quantization process. Installing `fast_hadamard_transform` provides better performance:
+
+```shell
+pip install fast_hadamard_transform
+```
+
+Without this dependency, TurboQuant still works correctly, but performance may be slightly reduced.
+
 In the next section, we will take `internlm2-chat-7b` model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please ensure that lmdeploy is installed.
 
 ```shell
@@ -31,7 +99,7 @@ pip install lmdeploy
 
 Applying kv quantization and inference via LMDeploy is quite straightforward. Simply set the `quant_policy` parameter.
 
-**LMDeploy specifies that `quant_policy=4` stands for 4-bit kv, whereas `quant_policy=8` indicates 8-bit kv.**
+**LMDeploy specifies that `quant_policy=4` stands for 4-bit kv, `quant_policy=8` indicates 8-bit kv, and `quant_policy=42` indicates TurboQuant.**
 
 ### Offline inference
 
@@ -49,6 +117,22 @@ print(response)
 lmdeploy serve api_server internlm/internlm2_5-7b-chat --quant-policy 8
 ```
 
+### TurboQuant
+
+TurboQuant uses `quant_policy=42`, **PytorchEngine only**:
+
+```python
+from lmdeploy import pipeline, PytorchEngineConfig
+engine_config = PytorchEngineConfig(
+    tp=1,
+    cache_max_entry_count=0.8,
+    quant_policy=42  # TurboQuant: K=4bit QJL4 + V=2bit MSE
+)
+pipe = pipeline("Qwen/Qwen3-8B", backend_config=engine_config)
+response = pipe.infer("Hello, how are you?", max_new_tokens=30)
+print(response.text)
+```
+
 ## Evaluation
 
 We apply kv quantization of LMDeploy to several LLM models and utilize OpenCompass to evaluate the inference accuracy. The results are shown in the table below:
diff --git a/docs/zh_cn/quantization/kv_quant.md b/docs/zh_cn/quantization/kv_quant.md
@@ -21,6 +21,74 @@ LMDeploy kv 4/8 bit 量化和推理支持如下 NVIDIA 显卡型号：
 3. kv int8 量化精度几乎无损，kv int4 量化精度在可接受范围之内
 4. 推理高效，在 llama2-7b 上加入 int8/int4 kv 量化，RPS 相较于 fp16 分别提升近 30% 和 40%
 
+## TurboQuant 量化
+
+LMDeploy 支持基于 [Google Research 的 TurboQuant 技术](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)（将在 ICLR 2026 发表）实现的 KV 量化方案，通过 K=4bit QJL4 + V=2bit MSE 的组合，实现更高的压缩率和几乎无损的精度。
+
+### 原理
+
+TurboQuant 通过两个关键步骤实现高效压缩：
+
+1. **高质量压缩（PolarQuant 方法）**：首先对数据向量进行随机旋转（使用 Hadamard 变换等正交变换）。这个巧妙的步骤简化了数据的几何结构，使得可以对向量的每个部分单独应用标准的高质量量化器。这一阶段使用大部分压缩能力（大部分比特）来捕捉原始向量的主要概念和强度。
+
+2. **消除隐藏误差（QJL 方法）**：使用少量剩余的压缩能力（仅 1 bit）将 QJL（Quantized Johnson-Lindenstrauss）算法应用于第一阶段剩余的微小误差。QJL 阶段充当数学误差检查器，消除偏差，从而获得更准确的注意力分数。
+
+### K/V 量化方案
+
+- **K 路径 - QJL4 量化**：
+
+  - 使用 3bit Lloyd-Max 码本进行 MSE 量化（捕捉主要信息）
+  - 使用 1bit QJL 存储残差符号（消除误差偏差）
+  - 每个 token 的 K 压缩为 4bit
+
+- **V 路径 - MSE int2 量化**：
+
+  - 使用 2bit Lloyd-Max 码本进行 MSE 量化
+  - 每个 token 的 V 压缩为 2bit
+  - 存储归一化系数用于反量化
+
+### 优势
+
+- **零精度损失**：通过 PolarQuant + QJL 的组合，实现高压缩率的同时保持模型精度
+- **更高的压缩率**：K 4bit + V 2bit = 平均 3bit，相比 int4 的 4bit 进一步压缩
+- **消除量化偏差**：QJL 算法作为误差检查器，有效消除量化引入的偏差
+
+### 性能测试
+
+在 H200 上使用 Qwen3-30B-A3B-Base 模型、ShareGPT 数据集进行测试：
+
+| 指标           | Baseline (quant_policy=0) | TurboQuant (quant_policy=42) | 变化  |
+| -------------- | ------------------------- | ---------------------------- | ----- |
+| 输入吞吐       | 2368.8 tok/s              | 2195.8 tok/s                 | -7.3% |
+| 输出吞吐       | 2186.7 tok/s              | 2027.0 tok/s                 | -7.3% |
+| 请求吞吐       | 10.74 req/s               | 9.96 req/s                   | -7.3% |
+| 平均端到端延迟 | 5.888s                    | 6.348s                       | +7.8% |
+| 平均 TTFT      | 1.139s                    | 1.235s                       | +8.4% |
+| 平均 TPOT      | 0.024s                    | 0.026s                       | +8.3% |
+| 平均 ITL       | 0.059s                    | 0.059s                       | 持平  |
+
+**测试配置**：GPU: H200, 模型: Qwen3-30B-A3B-Base, 数据集: ShareGPT, 并发: 64, 请求数: 5000
+
+**结论**：TurboQuant K4V2 实现约 5 倍的 KV cache 内存压缩，端到端性能开销约 7%-8%，在内存受限的 serving 场景中是一个合理的权衡。
+
+### 限制
+
+- **仅支持 PytorchEngine**：TurboQuant 目前仅支持 PyTorch 引擎，不支持 Turbomind 引擎
+- **不支持 MLA**：不支持 Multi-head Latent Attention 结构
+- **不支持推测解码**：不支持 speculative decoding
+- 需要 head_dim 为 2 的幂次方（power of 2）
+- 需要安装 `fast_hadamard_transform` 包以获得最佳性能（可选）
+
+### 可选依赖
+
+TurboQuant 使用 Hadamard 变换加速量化过程。安装 `fast_hadamard_transform` 可获得更好的性能：
+
+```shell
+pip install fast_hadamard_transform
+```
+
+不安装此依赖时，TurboQuant 仍可正常工作，但性能可能略有下降。
+
 接下来，我们以 internlm2-chat-7b 模型为例，介绍 kv 量化和推理的若干应用。而在此之前，请安装 lmdeploy
 
 ```shell
@@ -31,7 +99,7 @@ pip install lmdeploy
 
 通过 LMDeploy 应用 kv 量化非常简单，只需要设定 `quant_policy` 参数。
 
-**LMDeploy 规定 `qant_policy=4` 表示 kv int4 量化，`quant_policy=8` 表示 kv int8 量化。**
+**LMDeploy 规定 `quant_policy=4` 表示 kv int4 量化，`quant_policy=8` 表示 kv int8 量化，`quant_policy=42` 表示 TurboQuant 量化。**
 
 ### 离线推理
 
@@ -49,6 +117,22 @@ print(response)
 lmdeploy serve api_server internlm/internlm2_5-7b-chat --quant-policy 8
 ```
 
+### TurboQuant 量化
+
+TurboQuant 量化使用 `quant_policy=42`，**仅支持 PytorchEngine**：
+
+```python
+from lmdeploy import pipeline, PytorchEngineConfig
+engine_config = PytorchEngineConfig(
+    tp=1,
+    cache_max_entry_count=0.8,
+    quant_policy=42  # TurboQuant: K=4bit QJL4 + V=2bit MSE
+)
+pipe = pipeline("Qwen/Qwen3-8B", backend_config=engine_config)
+response = pipe.infer("Hello, how are you?", max_new_tokens=30)
+print(response.text)
+```
+
 ## 精度评测
 
 我们把 lmdeploy 的 kv 量化应用在若干 LLM 模型上，并使用 opencompass 评测推理精度，结果如下表所示：
diff --git a/lmdeploy/cli/utils.py b/lmdeploy/cli/utils.py
@@ -267,11 +267,14 @@ def max_batch_size(parser):
     def quant_policy(parser, default: int = 0):
         """Add argument quant_policy to parser."""
 
+        from lmdeploy.messages import QuantPolicy
+
         return parser.add_argument('--quant-policy',
                                    type=int,
                                    default=0,
-                                   choices=[0, 4, 8],
-                                   help='Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv')
+                                   choices=list(QuantPolicy),
+                                   help='KV cache quantization policy. '
+                                   '0: no quantization; 4: 4-bit; 8: 8-bit; 42: TurboQuant (K4V2)')
 
     @staticmethod
     def rope_scaling_factor(parser):