S3: layer-adaptive analysis — RHT makes it unnecessary (negative result)

unamedkr · claude · unamedkr · commit 82afed593fef · 2026-04-10T01:03:30.000+09:00
Measured per-layer post-RHT distributions on Llama 3.2 3B (28 layers):
  Pre-RHT kurtosis: 4.13 – 20.62 (wildly different)
  Post-RHT kurtosis: 2.64 – 3.81 (near-uniform)

Theoretical maximum benefit of optimal per-layer bit allocation:
  ~1.8% MSE reduction → ~0.9% PPL improvement. Below noise.

CONCLUSION: RHT already normalizes layer distributions, making per-layer
adaptation unnecessary. This is an architectural advantage: simpler code
(single bit allocation) achieves near-optimal per-layer performance.

Key insight for the paper: the information bottleneck in KV quantization
is TEMPORAL (which tokens need more bits — S1 finding) not SPATIAL
(which layers). RHT solves the spatial dimension; progressive compression
solves the temporal.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/results/layer_adaptive_analysis.md b/bench/results/layer_adaptive_analysis.md
@@ -0,0 +1,52 @@
+# Layer-Adaptive KV Compression Analysis
+
+## Result: NEGATIVE — Not Worth Implementing
+
+Layer-adaptive bit allocation (different bits per transformer layer)
+provides at most ~0.9% PPL improvement after RHT normalization.
+This is below the measurement noise and not worth the implementation
+complexity.
+
+## Why
+
+RHT (Random Hadamard Transform) normalizes ALL layers to near-Gaussian:
+
+| Metric | Pre-RHT | Post-RHT |
+|---|---|---|
+| Kurtosis range | 4.13 – 20.62 | **2.64 – 3.81** |
+| Kurtosis std | large | **0.25** |
+| Skewness range | -2.54 – +2.59 | -0.34 – +0.19 |
+
+The variance of log2(std) across layers post-RHT is only 0.0177,
+meaning the theoretical MSE improvement from optimal per-layer bit
+allocation is ~1.8% MSE → ~0.9% PPL.
+
+## Architectural Insight
+
+This is an advantage of the RHT-based approach:
+
+- Without RHT: layers have wildly different distributions → need per-layer
+  calibration → complex + slow
+- With RHT: all layers are near-Gaussian → single bit allocation works for
+  all → simple + fast
+
+Other KV compression methods (KIVI, KVQuant) don't use RHT and therefore
+need per-layer calibration profiles. quant.cpp's RHT makes layer-adaptive
+unnecessary, which is actually a feature — simpler code, fewer parameters.
+
+## Implication for the Paper
+
+The fact that RHT eliminates the need for per-layer adaptation is a
+publishable insight: "RHT-based KV quantization achieves near-optimal
+per-layer performance with a single uniform bit allocation."
+
+This strengthens the attention-aware (temporal) quantization finding (S1):
+- Per-layer (spatial) adaptation: NOT needed after RHT (~0.9% max benefit)
+- Per-token (temporal) adaptation: CRITICAL (+3.8% → +0.6% benefit)
+
+The information bottleneck is temporal (which tokens), not spatial (which layers).
+
+## Measured on
+
+Llama 3.2 3B Instruct Q8_0, 28 layers, 28 tokens profiled.
+Post-RHT kurtosis: mean 3.04, std 0.25.