You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| uniform 4b K + Q4 V | 3.8x | -7.8% | Simple, no delta overhead |
137
138
| uniform 4b K + FP16 V | 1.6x | +0.0% | Lossless baseline |
138
139
140
+
### QK-norm aware compression (Gemma 4)
141
+
142
+
Models with QK-norm (Gemma 4) normalize key vectors to the unit sphere, creating extremely sparse distributions (256 dimensions, only ~56 active). Standard 4-bit quantization destroys directional information (cosine similarity drops to 0.62).
143
+
144
+
quant.cpp automatically detects QK-normed models and stores keys in FP32 while quantizing only values to Q4. This preserves perfect key precision with **3.5x V memory reduction**.
145
+
146
+
| Config | Compression | Quality (Gemma 4) |
147
+
|--------|-------------|-------------------|
148
+
| FP32 K + Q4 V (auto) | 3.5x V savings | Correct: "Paris", "서울" |
149
+
| 4-bit K (forced) | 3.8x total | Broken: cosine=0.62 |
150
+
139
151
### Delta compression
140
152
141
153
Standard KV caching stores each key vector as-is. Delta mode stores `key[t] - reconstruct(key[t-1])` — like video P-frames.
0 commit comments