Skip to content

Commit 68f9516

Browse files
committed
docs: add TurboQuant KV quantization documentation
- Add TurboQuant section based on Google Research's technology (ICLR 2026) - Explain principles: PolarQuant (high-quality compression) + QJL (error elimination) - Add K=4bit QJL4 + V=2bit MSE quantization scheme details - Include performance benchmark on H200 with Qwen3-30B-A3B-Base - Document limitations: PytorchEngine only, no MLA, no speculative decoding - Add optional dependency for fast_hadamard_transform - Improve quant_policy CLI help text with TurboQuant reference
1 parent e1bfc96 commit 68f9516

3 files changed

Lines changed: 175 additions & 4 deletions

File tree

docs/en/quantization/kv_quant.md

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,74 @@ In summary, LMDeploy kv quantization has the following advantages:
2121
3. KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
2222
4. Efficient inference, with int8/int4 kv quantization applied to llama2-7b, RPS is improved by round 30% and 40% respectively compared to fp16
2323

24+
## TurboQuant
25+
26+
LMDeploy supports KV quantization based on [Google Research's TurboQuant technology](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) (to be presented at ICLR 2026), achieving higher compression ratio with near-zero accuracy loss through K=4bit QJL4 + V=2bit MSE combination.
27+
28+
### Principles
29+
30+
TurboQuant achieves efficient compression through two key steps:
31+
32+
1. **High-quality compression (PolarQuant method)**: First randomly rotates the data vectors (using orthogonal transforms like Hadamard transform). This clever step simplifies the data's geometry, making it easy to apply a standard, high-quality quantizer to each part of the vector individually. This stage uses most of the compression power (the majority of the bits) to capture the main concept and strength of the original vector.
33+
34+
2. **Eliminating hidden errors (QJL method)**: Uses a small, residual amount of compression power (just 1 bit) to apply the QJL (Quantized Johnson-Lindenstrauss) algorithm to the tiny amount of error left over from the first stage. The QJL stage acts as a mathematical error-checker that eliminates bias, leading to more accurate attention scores.
35+
36+
### K/V Quantization Scheme
37+
38+
- **K Path - QJL4 Quantization**:
39+
40+
- Uses 3-bit Lloyd-Max codebook for MSE quantization (captures main information)
41+
- Uses 1-bit QJL to store residual sign (eliminates error bias)
42+
- Each token's K is compressed to 4-bit
43+
44+
- **V Path - MSE int2 Quantization**:
45+
46+
- Uses 2-bit Lloyd-Max codebook for MSE quantization
47+
- Each token's V is compressed to 2-bit
48+
- Stores normalization coefficients for dequantization
49+
50+
### Advantages
51+
52+
- **Zero accuracy loss**: Through PolarQuant + QJL combination, achieves high compression rate while maintaining model accuracy
53+
- **Higher compression ratio**: K 4bit + V 2bit = average 3bit, further compression compared to int4's 4bit
54+
- **Eliminates quantization bias**: QJL algorithm acts as error-checker, effectively eliminating quantization-induced bias
55+
56+
### Performance Benchmark
57+
58+
Tested on H200 with Qwen3-30B-A3B-Base model and ShareGPT dataset:
59+
60+
| Metric | Baseline (quant_policy=0) | TurboQuant (quant_policy=42) | Change |
61+
| ------------------ | ------------------------- | ---------------------------- | ---------- |
62+
| Input throughput | 2368.8 tok/s | 2195.8 tok/s | -7.3% |
63+
| Output throughput | 2186.7 tok/s | 2027.0 tok/s | -7.3% |
64+
| Request throughput | 10.74 req/s | 9.96 req/s | -7.3% |
65+
| Mean E2E latency | 5.888s | 6.348s | +7.8% |
66+
| Mean TTFT | 1.139s | 1.235s | +8.4% |
67+
| Mean TPOT | 0.024s | 0.026s | +8.3% |
68+
| Mean ITL | 0.059s | 0.059s | ~unchanged |
69+
70+
**Test configuration**: GPU: H200, Model: Qwen3-30B-A3B-Base, Dataset: ShareGPT, Concurrency: 64, Requests: 5000
71+
72+
**Takeaway**: TurboQuant K4V2 achieves ~5x KV cache memory reduction with about 7%-8% end-to-end performance overhead, which looks like a reasonable trade-off for memory-bound serving scenarios.
73+
74+
### Limitations
75+
76+
- **PytorchEngine only**: TurboQuant currently only supports PyTorch engine, not Turbomind engine
77+
- **MLA not supported**: Does not support Multi-head Latent Attention architecture
78+
- **Speculative decoding not supported**: Does not support speculative decoding
79+
- Requires head_dim to be a power of 2
80+
- Requires `fast_hadamard_transform` package for best performance (optional)
81+
82+
### Optional Dependency
83+
84+
TurboQuant uses Hadamard transform to accelerate the quantization process. Installing `fast_hadamard_transform` provides better performance:
85+
86+
```shell
87+
pip install fast_hadamard_transform
88+
```
89+
90+
Without this dependency, TurboQuant still works correctly, but performance may be slightly reduced.
91+
2492
In the next section, we will take `internlm2-chat-7b` model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please ensure that lmdeploy is installed.
2593

2694
```shell
@@ -31,7 +99,7 @@ pip install lmdeploy
3199

32100
Applying kv quantization and inference via LMDeploy is quite straightforward. Simply set the `quant_policy` parameter.
33101

34-
**LMDeploy specifies that `quant_policy=4` stands for 4-bit kv, whereas `quant_policy=8` indicates 8-bit kv.**
102+
**LMDeploy specifies that `quant_policy=4` stands for 4-bit kv, `quant_policy=8` indicates 8-bit kv, and `quant_policy=42` indicates TurboQuant.**
35103

36104
### Offline inference
37105

@@ -49,6 +117,22 @@ print(response)
49117
lmdeploy serve api_server internlm/internlm2_5-7b-chat --quant-policy 8
50118
```
51119

120+
### TurboQuant
121+
122+
TurboQuant uses `quant_policy=42`, **PytorchEngine only**:
123+
124+
```python
125+
from lmdeploy import pipeline, PytorchEngineConfig
126+
engine_config = PytorchEngineConfig(
127+
tp=1,
128+
cache_max_entry_count=0.8,
129+
quant_policy=42 # TurboQuant: K=4bit QJL4 + V=2bit MSE
130+
)
131+
pipe = pipeline("Qwen/Qwen3-8B", backend_config=engine_config)
132+
response = pipe.infer("Hello, how are you?", max_new_tokens=30)
133+
print(response.text)
134+
```
135+
52136
## Evaluation
53137

54138
We apply kv quantization of LMDeploy to several LLM models and utilize OpenCompass to evaluate the inference accuracy. The results are shown in the table below:

docs/zh_cn/quantization/kv_quant.md

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,74 @@ LMDeploy kv 4/8 bit 量化和推理支持如下 NVIDIA 显卡型号:
2121
3. kv int8 量化精度几乎无损,kv int4 量化精度在可接受范围之内
2222
4. 推理高效,在 llama2-7b 上加入 int8/int4 kv 量化,RPS 相较于 fp16 分别提升近 30% 和 40%
2323

24+
## TurboQuant 量化
25+
26+
LMDeploy 支持基于 [Google Research 的 TurboQuant 技术](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)(将在 ICLR 2026 发表)实现的 KV 量化方案,通过 K=4bit QJL4 + V=2bit MSE 的组合,实现更高的压缩率和几乎无损的精度。
27+
28+
### 原理
29+
30+
TurboQuant 通过两个关键步骤实现高效压缩:
31+
32+
1. **高质量压缩(PolarQuant 方法)**:首先对数据向量进行随机旋转(使用 Hadamard 变换等正交变换)。这个巧妙的步骤简化了数据的几何结构,使得可以对向量的每个部分单独应用标准的高质量量化器。这一阶段使用大部分压缩能力(大部分比特)来捕捉原始向量的主要概念和强度。
33+
34+
2. **消除隐藏误差(QJL 方法)**:使用少量剩余的压缩能力(仅 1 bit)将 QJL(Quantized Johnson-Lindenstrauss)算法应用于第一阶段剩余的微小误差。QJL 阶段充当数学误差检查器,消除偏差,从而获得更准确的注意力分数。
35+
36+
### K/V 量化方案
37+
38+
- **K 路径 - QJL4 量化**
39+
40+
- 使用 3bit Lloyd-Max 码本进行 MSE 量化(捕捉主要信息)
41+
- 使用 1bit QJL 存储残差符号(消除误差偏差)
42+
- 每个 token 的 K 压缩为 4bit
43+
44+
- **V 路径 - MSE int2 量化**
45+
46+
- 使用 2bit Lloyd-Max 码本进行 MSE 量化
47+
- 每个 token 的 V 压缩为 2bit
48+
- 存储归一化系数用于反量化
49+
50+
### 优势
51+
52+
- **零精度损失**:通过 PolarQuant + QJL 的组合,实现高压缩率的同时保持模型精度
53+
- **更高的压缩率**:K 4bit + V 2bit = 平均 3bit,相比 int4 的 4bit 进一步压缩
54+
- **消除量化偏差**:QJL 算法作为误差检查器,有效消除量化引入的偏差
55+
56+
### 性能测试
57+
58+
在 H200 上使用 Qwen3-30B-A3B-Base 模型、ShareGPT 数据集进行测试:
59+
60+
| 指标 | Baseline (quant_policy=0) | TurboQuant (quant_policy=42) | 变化 |
61+
| -------------- | ------------------------- | ---------------------------- | ----- |
62+
| 输入吞吐 | 2368.8 tok/s | 2195.8 tok/s | -7.3% |
63+
| 输出吞吐 | 2186.7 tok/s | 2027.0 tok/s | -7.3% |
64+
| 请求吞吐 | 10.74 req/s | 9.96 req/s | -7.3% |
65+
| 平均端到端延迟 | 5.888s | 6.348s | +7.8% |
66+
| 平均 TTFT | 1.139s | 1.235s | +8.4% |
67+
| 平均 TPOT | 0.024s | 0.026s | +8.3% |
68+
| 平均 ITL | 0.059s | 0.059s | 持平 |
69+
70+
**测试配置**:GPU: H200, 模型: Qwen3-30B-A3B-Base, 数据集: ShareGPT, 并发: 64, 请求数: 5000
71+
72+
**结论**:TurboQuant K4V2 实现约 5 倍的 KV cache 内存压缩,端到端性能开销约 7%-8%,在内存受限的 serving 场景中是一个合理的权衡。
73+
74+
### 限制
75+
76+
- **仅支持 PytorchEngine**:TurboQuant 目前仅支持 PyTorch 引擎,不支持 Turbomind 引擎
77+
- **不支持 MLA**:不支持 Multi-head Latent Attention 结构
78+
- **不支持推测解码**:不支持 speculative decoding
79+
- 需要 head_dim 为 2 的幂次方(power of 2)
80+
- 需要安装 `fast_hadamard_transform` 包以获得最佳性能(可选)
81+
82+
### 可选依赖
83+
84+
TurboQuant 使用 Hadamard 变换加速量化过程。安装 `fast_hadamard_transform` 可获得更好的性能:
85+
86+
```shell
87+
pip install fast_hadamard_transform
88+
```
89+
90+
不安装此依赖时,TurboQuant 仍可正常工作,但性能可能略有下降。
91+
2492
接下来,我们以 internlm2-chat-7b 模型为例,介绍 kv 量化和推理的若干应用。而在此之前,请安装 lmdeploy
2593

2694
```shell
@@ -31,7 +99,7 @@ pip install lmdeploy
3199

32100
通过 LMDeploy 应用 kv 量化非常简单,只需要设定 `quant_policy` 参数。
33101

34-
**LMDeploy 规定 `qant_policy=4` 表示 kv int4 量化,`quant_policy=8` 表示 kv int8 量化。**
102+
**LMDeploy 规定 `quant_policy=4` 表示 kv int4 量化,`quant_policy=8` 表示 kv int8 量化,`quant_policy=42` 表示 TurboQuant 量化。**
35103

36104
### 离线推理
37105

@@ -49,6 +117,22 @@ print(response)
49117
lmdeploy serve api_server internlm/internlm2_5-7b-chat --quant-policy 8
50118
```
51119

120+
### TurboQuant 量化
121+
122+
TurboQuant 量化使用 `quant_policy=42`**仅支持 PytorchEngine**
123+
124+
```python
125+
from lmdeploy import pipeline, PytorchEngineConfig
126+
engine_config = PytorchEngineConfig(
127+
tp=1,
128+
cache_max_entry_count=0.8,
129+
quant_policy=42 # TurboQuant: K=4bit QJL4 + V=2bit MSE
130+
)
131+
pipe = pipeline("Qwen/Qwen3-8B", backend_config=engine_config)
132+
response = pipe.infer("Hello, how are you?", max_new_tokens=30)
133+
print(response.text)
134+
```
135+
52136
## 精度评测
53137

54138
我们把 lmdeploy 的 kv 量化应用在若干 LLM 模型上,并使用 opencompass 评测推理精度,结果如下表所示:

lmdeploy/cli/utils.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -267,11 +267,14 @@ def max_batch_size(parser):
267267
def quant_policy(parser, default: int = 0):
268268
"""Add argument quant_policy to parser."""
269269

270+
from lmdeploy.messages import QuantPolicy
271+
270272
return parser.add_argument('--quant-policy',
271273
type=int,
272274
default=0,
273-
choices=[0, 4, 8],
274-
help='Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv')
275+
choices=list(QuantPolicy),
276+
help='KV cache quantization policy. '
277+
'0: no quantization; 4: 4-bit; 8: 8-bit; 42: TurboQuant (K4V2)')
275278

276279
@staticmethod
277280
def rope_scaling_factor(parser):

0 commit comments

Comments
 (0)