Skip to content

Commit 82afed5

Browse files
unamedkrclaude
andcommitted
S3: layer-adaptive analysis — RHT makes it unnecessary (negative result)
Measured per-layer post-RHT distributions on Llama 3.2 3B (28 layers): Pre-RHT kurtosis: 4.13 – 20.62 (wildly different) Post-RHT kurtosis: 2.64 – 3.81 (near-uniform) Theoretical maximum benefit of optimal per-layer bit allocation: ~1.8% MSE reduction → ~0.9% PPL improvement. Below noise. CONCLUSION: RHT already normalizes layer distributions, making per-layer adaptation unnecessary. This is an architectural advantage: simpler code (single bit allocation) achieves near-optimal per-layer performance. Key insight for the paper: the information bottleneck in KV quantization is TEMPORAL (which tokens need more bits — S1 finding) not SPATIAL (which layers). RHT solves the spatial dimension; progressive compression solves the temporal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d07e809 commit 82afed5

1 file changed

Lines changed: 52 additions & 0 deletions

File tree

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Layer-Adaptive KV Compression Analysis
2+
3+
## Result: NEGATIVE — Not Worth Implementing
4+
5+
Layer-adaptive bit allocation (different bits per transformer layer)
6+
provides at most ~0.9% PPL improvement after RHT normalization.
7+
This is below the measurement noise and not worth the implementation
8+
complexity.
9+
10+
## Why
11+
12+
RHT (Random Hadamard Transform) normalizes ALL layers to near-Gaussian:
13+
14+
| Metric | Pre-RHT | Post-RHT |
15+
|---|---|---|
16+
| Kurtosis range | 4.13 – 20.62 | **2.64 – 3.81** |
17+
| Kurtosis std | large | **0.25** |
18+
| Skewness range | -2.54 – +2.59 | -0.34 – +0.19 |
19+
20+
The variance of log2(std) across layers post-RHT is only 0.0177,
21+
meaning the theoretical MSE improvement from optimal per-layer bit
22+
allocation is ~1.8% MSE → ~0.9% PPL.
23+
24+
## Architectural Insight
25+
26+
This is an advantage of the RHT-based approach:
27+
28+
- Without RHT: layers have wildly different distributions → need per-layer
29+
calibration → complex + slow
30+
- With RHT: all layers are near-Gaussian → single bit allocation works for
31+
all → simple + fast
32+
33+
Other KV compression methods (KIVI, KVQuant) don't use RHT and therefore
34+
need per-layer calibration profiles. quant.cpp's RHT makes layer-adaptive
35+
unnecessary, which is actually a feature — simpler code, fewer parameters.
36+
37+
## Implication for the Paper
38+
39+
The fact that RHT eliminates the need for per-layer adaptation is a
40+
publishable insight: "RHT-based KV quantization achieves near-optimal
41+
per-layer performance with a single uniform bit allocation."
42+
43+
This strengthens the attention-aware (temporal) quantization finding (S1):
44+
- Per-layer (spatial) adaptation: NOT needed after RHT (~0.9% max benefit)
45+
- Per-token (temporal) adaptation: CRITICAL (+3.8% → +0.6% benefit)
46+
47+
The information bottleneck is temporal (which tokens), not spatial (which layers).
48+
49+
## Measured on
50+
51+
Llama 3.2 3B Instruct Q8_0, 28 layers, 28 tokens profiled.
52+
Post-RHT kurtosis: mean 3.04, std 0.25.

0 commit comments

Comments
 (0)