Skip to content

Commit 4268590

Browse files
unamedkrclaude
andcommitted
HONEST: 4th correction — Tim Dettmers comment was general, not direct feedback to us
User asked to verify whether Tim Dettmers' HIGGS attribution comment in llama.cpp #20969 was actually directed at quant.cpp specifically. After re-checking the discussion timeline: 2026-04-07 21:49 @unamedkr (us) first comment in thread 2026-04-07 22:08 @Xcc313r4n7 "compares to Rotorquant?" 2026-04-07 23:56 @TimDettmers HIGGS comment — top-level, no replyTo, no @-mention 2026-04-08 00:11 @caiovicentino (replying to TheTom) 2026-04-08 03:14 @unamedkr (us) "@TimDettmers — thank you" Tim's comment was a top-level message to the thread (which has 6+ forks all loosely calling their work 'TurboQuant'), NOT a direct reply to our comment, NOT an @-mention of us. The substance applied to us along with everyone else in the thread, and we voluntarily chose to update our docs and reply with thanks. But Tim did not single us out, and framing his comment as 'Tim Dettmers gave us direct feedback' overstates the relationship. Updated: - README.md / README.ko.md: HIGGS reference now says 'we added this attribution after seeing Tim Dettmers' general comment in #20969 asking participants in that thread to credit HIGGS instead. His comment was not directed at us specifically, but the substance applied to our naming as well, and we chose to update accordingly.' - docs/papers/quant_cpp_arxiv_draft.md: Acknowledgements section rewritten with the honest framing. - bench/results/turboquant_reproduction.md: attribution update note rewritten. - CHANGELOG.md: v0.6.5 entry now lists this as the 4th honest correction in the v0.6.x series and explicitly notes that the v0.6.4 commit message (commit 9481870) overstated the framing. Saved to memory as feedback_dont_personalize_general_comments.md so future sessions distinguish (a) direct reply / @-mention from (b) a general top-level comment whose substance happens to apply. The substance of the correction (HIGGS attribution) is unchanged. Only the framing of how the feedback reached us has been corrected. Honest corrections so far: v0.6.0 'lossless 7×' → '+6.3% PPL' v0.6.4 'beats fp32' → '−7% vs fp32 (NEON)' v0.6.5 'with Metal' → 'without Metal (user default)' v0.6.5 post 'Tim gave us feedback' → 'general comment we observed' Each was caught by validation. Validation > marketing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent efbc023 commit 4268590

5 files changed

Lines changed: 6 additions & 5 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,12 @@ No source code changes — the CMake default was already `OFF`. The bug was in o
4242

4343
### Honest corrections so far in the v0.6.x series
4444

45-
This is now the **third** honest correction we've caught and fixed before it spread:
45+
This is now the **fourth** honest correction we've caught and fixed before it spread:
4646

4747
1. **v0.6.0**: "lossless 7× compression" → measured "+6.3% PPL on Llama 3.2 3B"
4848
2. **v0.6.4**: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
4949
3. **v0.6.5**: "benchmarks with Metal" → re-measured "benchmarks without Metal (which is the user default)"
50+
4. **v0.6.5 (post-release)**: "Tim Dettmers gave us direct feedback" → "Tim's general comment to a thread we participate in happened to apply to us; we incorporated it voluntarily, not as a direct response". Earlier docs and the v0.6.4 commit message overstated the relationship; the substance of HIGGS attribution is unchanged but the framing has been corrected in README, README.ko, the arXiv draft, and `bench/results/turboquant_reproduction.md`.
5051

5152
Each correction was caught by the validation discipline documented in our `feedback_validation_first` memory. **Validation > marketing.**
5253

README.ko.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -483,7 +483,7 @@ Linux, macOS, Windows (MSVC/MinGW), iOS, Android, WASM에서 동작합니다.
483483

484484
quant.cpp는 발표된 연구의 독립 구현체입니다. Variant F 아키텍처 (RHT 전처리 + scalar Lloyd-Max codebook, QJL stage 없음)는 두 prior work의 계보에 위치합니다:
485485

486-
- **HIGGS** — Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh. *Pushing the Limits of Large Language Model Quantization via the Linearity Theorem*. Nov 2024. [arXiv:2411.17525](https://arxiv.org/abs/2411.17525). HIGGS가 **Random Hadamard Transform + MSE-optimal grid quantization** 패턴을 weight 양자화에 도입. 우리 `tq_rht.c` (Walsh-Hadamard + Rademacher)가 이 패턴을 따름. *Tim Dettmers가 [llama.cpp #20969 discussion](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16481725)에서 이 점을 지적해주신 데 감사드립니다.*
486+
- **HIGGS** — Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh. *Pushing the Limits of Large Language Model Quantization via the Linearity Theorem*. Nov 2024. [arXiv:2411.17525](https://arxiv.org/abs/2411.17525). HIGGS가 **Random Hadamard Transform + MSE-optimal grid quantization** 패턴을 weight 양자화에 도입. 우리 `tq_rht.c` (Walsh-Hadamard + Rademacher)가 이 패턴을 따름. *Tim Dettmers가 [llama.cpp discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969)에서 thread 참여자들에게 (6+개 fork가 "TurboQuant" 이름을 느슨하게 사용 중) HIGGS credit을 요청한 일반 코멘트를 본 후 우리가 이 attribution을 추가했습니다. 그의 코멘트는 우리에게 직접 향한 것이 아니었지만, substance가 우리 naming에도 적용되어 자발적으로 정정했습니다.*
487487
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874). TurboQuant는 rotation 패턴을 **KV cache**에 적용 + 1-bit QJL residual + per-channel outlier handling. 우리 작업은 TurboQuant 직접 포팅으로 시작했으나, 9 라운드 Karpathy 루프로 단순화 (QJL 제거, outlier channel 제거)하여 현재 Variant F가 됨. 우리는 shipped variant가 TurboQuant 알고리즘이라고 주장하지 않습니다 — 경험적으로 도출된 단순화입니다.
488488
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617). 우리 `tq_polar.c` baseline의 polar-coordinate KV quantization.
489489
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482). 1-bit sketch building block. `tq_qjl.c` baseline에 사용; Variant F regime에서 attention 점수에 ~0 기여한다는 것을 발견하고 제거.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -508,7 +508,7 @@ Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acce
508508

509509
quant.cpp is an independent implementation of published research. The Variant F architecture (RHT preprocessing + scalar Lloyd-Max codebook on rotated values, no QJL stage) sits in a lineage that combines two prior works:
510510

511-
- **HIGGS** — Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh. *Pushing the Limits of Large Language Model Quantization via the Linearity Theorem*. Nov 2024. [arXiv:2411.17525](https://arxiv.org/abs/2411.17525). HIGGS introduced the **Random Hadamard Transform + MSE-optimal grid quantization** pattern (for weight quantization). Our `tq_rht.c` Walsh-Hadamard + Rademacher implementation follows this pattern. *Credit to Tim Dettmers ([discussion thread](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16481725)) for pointing this out.*
511+
- **HIGGS** — Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh. *Pushing the Limits of Large Language Model Quantization via the Linearity Theorem*. Nov 2024. [arXiv:2411.17525](https://arxiv.org/abs/2411.17525). HIGGS introduced the **Random Hadamard Transform + MSE-optimal grid quantization** pattern (for weight quantization). Our `tq_rht.c` Walsh-Hadamard + Rademacher implementation follows this pattern. *We added this attribution after seeing [Tim Dettmers' general comment in llama.cpp discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969) asking participants in that thread (which uses "TurboQuant" loosely across many forks) to credit HIGGS instead. His comment was not directed at us specifically, but the substance applied to our naming as well, and we chose to update accordingly.*
512512
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874). TurboQuant applies the rotation pattern to **KV cache** with a 1-bit QJL residual stage and per-channel outlier handling. Our work started as a literal port of TurboQuant; through 9 rounds of Karpathy-loop iteration we simplified it (dropped QJL, dropped outlier channels) into the current Variant F. We do not claim our shipped variant is the TurboQuant algorithm — it is an empirically-derived simplification.
513513
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617). The polar-coordinate KV quantization that our `tq_polar.c` baseline implements.
514514
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482). The 1-bit sketch building block. Used in our `tq_qjl.c` baseline; we found it contributed ~zero to attention scores in the Variant F regime and dropped it.

bench/results/turboquant_reproduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Variant F derivation — from TurboQuant literal port to HIGGS-style simplification
22

3-
> **Important attribution update (2026-04-08)**: Following [Tim Dettmers' comment in llama.cpp #20969](https://github.com/ggml-org/llama.cpp/discussions/20969), we now credit **HIGGS** (Malinovskii et al., Nov 2024, [arXiv:2411.17525](https://arxiv.org/abs/2411.17525)) for the Random Hadamard Transform + scalar grid quantization pattern. The shipped Variant F is structurally closest to HIGGS (RHT + MSE-optimal grids on rotated values), applied to KV cache like TurboQuant, with both the QJL residual stage and the per-channel outlier split removed through ablation. We do **not** claim our shipped variant is the published TurboQuant algorithm — it is an empirically-derived simplification arrived at through 9 Karpathy-loop rounds.
3+
> **Important attribution update (2026-04-08)**: After observing [Tim Dettmers' general comment in llama.cpp discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969) — directed at the thread's participants in general (6+ forks were all loosely calling their work "TurboQuant"), not at us specifically — we recognized the substance applied to our naming as well and updated our docs to credit **HIGGS** (Malinovskii et al., Nov 2024, [arXiv:2411.17525](https://arxiv.org/abs/2411.17525)) for the Random Hadamard Transform + scalar grid quantization pattern. The shipped Variant F is structurally closest to HIGGS (RHT + MSE-optimal grids on rotated values), applied to KV cache like TurboQuant, with both the QJL residual stage and the per-channel outlier split removed through ablation. We do **not** claim our shipped variant is the published TurboQuant algorithm — it is an empirically-derived simplification arrived at through 9 Karpathy-loop rounds.
44
55

66

docs/papers/quant_cpp_arxiv_draft.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -265,7 +265,7 @@ The full Karpathy-loop history is in `bench/results/turboquant_reproduction.md`
265265

266266
## Acknowledgements
267267

268-
Tim Dettmers ([discussion thread](https://github.com/ggml-org/llama.cpp/discussions/20969)) for pointing out the HIGGS attribution. Mohamed Chorfa for the bug fix PRs (#12, #13). The ggml-org / llama.cpp community for the Discussion #20969 venue for KV quantization research.
268+
We thank Tim Dettmers, whose [general comment in llama.cpp discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969) (a thread where 6+ independent forks were all loosely calling their work "TurboQuant") asked the discussion participants to credit HIGGS instead. His comment was not directed at us specifically, but the substance applied to our naming as well, and we updated our docs and this paper accordingly. Mohamed Chorfa for the bug fix PRs (#12, #13). The ggml-org / llama.cpp community for the Discussion #20969 venue for KV quantization research.
269269

270270
## References
271271

0 commit comments

Comments
 (0)