You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/quantization/kv_quant.md
+85-1Lines changed: 85 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,6 +21,74 @@ In summary, LMDeploy kv quantization has the following advantages:
21
21
3. KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
22
22
4. Efficient inference, with int8/int4 kv quantization applied to llama2-7b, RPS is improved by round 30% and 40% respectively compared to fp16
23
23
24
+
## TurboQuant
25
+
26
+
LMDeploy supports KV quantization based on [Google Research's TurboQuant technology](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) (to be presented at ICLR 2026), achieving higher compression ratio with near-zero accuracy loss through K=4bit QJL4 + V=2bit MSE combination.
27
+
28
+
### Principles
29
+
30
+
TurboQuant achieves efficient compression through two key steps:
31
+
32
+
1.**High-quality compression (PolarQuant method)**: First randomly rotates the data vectors (using orthogonal transforms like Hadamard transform). This clever step simplifies the data's geometry, making it easy to apply a standard, high-quality quantizer to each part of the vector individually. This stage uses most of the compression power (the majority of the bits) to capture the main concept and strength of the original vector.
33
+
34
+
2.**Eliminating hidden errors (QJL method)**: Uses a small, residual amount of compression power (just 1 bit) to apply the QJL (Quantized Johnson-Lindenstrauss) algorithm to the tiny amount of error left over from the first stage. The QJL stage acts as a mathematical error-checker that eliminates bias, leading to more accurate attention scores.
35
+
36
+
### K/V Quantization Scheme
37
+
38
+
-**K Path - QJL4 Quantization**:
39
+
40
+
- Uses 3-bit Lloyd-Max codebook for MSE quantization (captures main information)
41
+
- Uses 1-bit QJL to store residual sign (eliminates error bias)
42
+
- Each token's K is compressed to 4-bit
43
+
44
+
-**V Path - MSE int2 Quantization**:
45
+
46
+
- Uses 2-bit Lloyd-Max codebook for MSE quantization
47
+
- Each token's V is compressed to 2-bit
48
+
- Stores normalization coefficients for dequantization
49
+
50
+
### Advantages
51
+
52
+
-**Zero accuracy loss**: Through PolarQuant + QJL combination, achieves high compression rate while maintaining model accuracy
53
+
-**Higher compression ratio**: K 4bit + V 2bit = average 3bit, further compression compared to int4's 4bit
**Takeaway**: TurboQuant K4V2 achieves ~5x KV cache memory reduction with about 7%-8% end-to-end performance overhead, which looks like a reasonable trade-off for memory-bound serving scenarios.
73
+
74
+
### Limitations
75
+
76
+
-**PytorchEngine only**: TurboQuant currently only supports PyTorch engine, not Turbomind engine
77
+
-**MLA not supported**: Does not support Multi-head Latent Attention architecture
78
+
-**Speculative decoding not supported**: Does not support speculative decoding
79
+
- Requires head_dim to be a power of 2
80
+
- Requires `fast_hadamard_transform` package for best performance (optional)
81
+
82
+
### Optional Dependency
83
+
84
+
TurboQuant uses Hadamard transform to accelerate the quantization process. Installing `fast_hadamard_transform` provides better performance:
85
+
86
+
```shell
87
+
pip install fast_hadamard_transform
88
+
```
89
+
90
+
Without this dependency, TurboQuant still works correctly, but performance may be slightly reduced.
91
+
24
92
In the next section, we will take `internlm2-chat-7b` model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please ensure that lmdeploy is installed.
25
93
26
94
```shell
@@ -31,7 +99,7 @@ pip install lmdeploy
31
99
32
100
Applying kv quantization and inference via LMDeploy is quite straightforward. Simply set the `quant_policy` parameter.
33
101
34
-
**LMDeploy specifies that `quant_policy=4` stands for 4-bit kv, whereas `quant_policy=8` indicates 8-bit kv.**
102
+
**LMDeploy specifies that `quant_policy=4` stands for 4-bit kv, `quant_policy=8` indicates 8-bit kv, and `quant_policy=42` indicates TurboQuant.**
response = pipe.infer("Hello, how are you?", max_new_tokens=30)
133
+
print(response.text)
134
+
```
135
+
52
136
## Evaluation
53
137
54
138
We apply kv quantization of LMDeploy to several LLM models and utilize OpenCompass to evaluate the inference accuracy. The results are shown in the table below:
0 commit comments