v0.6.0: turbo_kv_4b champion, TurboQuant story, follow-ups tracked

unamedkr · claude · unamedkr · commit 5f6387b35715 · 2026-04-08T06:53:05.000+09:00
CHANGELOG.md: - New v0.6.0 entry as the headline release of the Variant F win - Document the entire arc: literal port → ablation → 6 Karpathy rounds → drop QJL → reinvest in larger codebook → beats baseline - KV PPL comparison table, layout change description, all v0.6 work ROADMAP.md: - Move turbo_kv_4b/3b into Production-ready section with measured PPL - Mark issue #14 as partially resolved (Variant F shipped; per-channel outlier handling and Llama 3.1 8B reproduction tracked in #15) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,54 @@
 # Changelog
 
+## [0.6.0] — 2026-04-08
+
+### Highlights
+
+- **🏆 turbo_kv_4b is the new champion** — Beats both `uniform_4b` and llama.cpp `q4_0` KV at the same 4-bit budget on Llama 3.2 3B (PPL 14.28 vs 14.41 vs ~14.99). Reached after 6 rounds of Karpathy-loop iteration starting from a literal port of [Google TurboQuant (ICLR 2026)](https://arxiv.org/abs/2504.19874).
+- **CLI default switched** — `quant model.gguf` now uses `turbo_kv_4b` automatically. `uniform_4b` remains available via `-k uniform_4b`.
+- **Honest TurboQuant reproduction story** — full ablation history, public issue #14, no overstated claims. The shipped `turbo_kv_*` is structurally simpler than the paper (single-stage RHT + Lloyd-Max codebook + ‖x‖) but empirically beats the literal two-stage port on our benchmark.
+- **@quantcpp/wasm npm package** — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project.
+- **Windows CI green** — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `__builtin_*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
+- **Public PR & issue triage** — PR #12 (5 critical bug fixes from MChorfa) cherry-picked into main; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.
+
+### KV quantization
+
+The `turbo_kv_3b` / `turbo_kv_4b` block layouts changed in this release. The `qjl_signs` field is gone — Karpathy-loop ablation showed it contributed byte-identical zero to attention scores. The freed 16 bytes per block are now used for a 2× larger Lloyd-Max codebook. Same total block size, finer reconstruction, single-stage estimator.
+
+| KV type | Bits/elem | Llama 3.2 3B PPL | Δ vs FP32 |
+|---|---:|---:|---:|
+| FP32 baseline | 32 | 13.56 | — |
+| **`turbo_kv_4b`** ⭐ | 4 | **14.28** | **+5.3%** |
+| `uniform_4b` | 4 | 14.41 | +6.3% |
+| `turbo_kv_3b` | 3 | 15.39 | +13.5% |
+| llama.cpp q4_0 KV (rough) | 4 | ~14.99 | +10.6% |
+
+### Strategy & positioning
+
+- New `docs/positioning.md` — quant.cpp = the single-header C reference engine for the embedded niche (iOS, Android, WASM, MSVC, microcontrollers, game engines)
+- README repositioned to honest "production = turbo_kv_4b, research = building blocks" framing with full PPL methodology
+- Citations to Google TurboQuant, PolarQuant, QJL papers added throughout
+
+### Tooling & ecosystem
+
+- `wasm/package.json` + ESM `index.mjs` + `index.d.ts` for npm publishing
+- `examples/README.md` (cherry-picked from PR #13) — comprehensive embedding examples doc
+- CMake `TQ_BUILD_EXAMPLES` option, single-header examples link only against libm + threads
+- Windows CI test timeouts bumped to 600s for slow non-vectorized builds
+
+### Bug fixes (cherry-picked from PR #12)
+
+- `tq_qjl.c`: NaN guard requires `dim > 0`
+- `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack array)
+- `tq_transformer.c`: NULL-check key/value cache calloc results
+- `tq_ops.c`: Windows pthread_cond_wait must use `SleepConditionVariableSRW` not `CS` (caused test_ops deadlock on first Windows green run)
+
+### Tracked for next release (issue #14 follow-ups)
+
+- Per-channel outlier handling (Google paper's 32-channel split)
+- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
+- 5-bit codebook variant for higher quality at ~5 bpc
+
 ## [0.5.0] — 2026-04-05
 
 ### Highlights
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -49,26 +49,27 @@ The world's simplest way to add LLM to a C/C++ project.
 A C reference engine for KV cache quantization research.
 
 ### Production-ready
+- [x] **`turbo_kv_4b` ⭐** — RHT + 4-bit Lloyd-Max codebook, beats `uniform_4b` and llama.cpp `q4_0` KV at the same bit budget (Llama 3.2 3B PPL 14.28, +5.3% vs FP32)
+- [x] **`turbo_kv_3b`** — RHT + 3-bit Lloyd-Max codebook (PPL 15.39, +13.5%)
 - [x] `uniform_4b` KV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B)
 - [x] `uniform_4b` + Q4 V combo (6.9x KV memory reduction)
 - [x] Delta compression (P-frame encoding)
 - [x] QK-norm aware compression (Gemma 4 / hybrid attention models)
 - [x] Plugin architecture (3 functions to add new type)
 - [x] 35 unit tests
 
-### Building blocks (research, not yet production-ready)
+### Building blocks
 - [x] Random Hadamard Transform (`tq_rht.c`)
-- [x] Lloyd-Max-Gaussian codebook quantizer (`tq_codebook.c`)
-- [x] 1-bit QJL sign hash (`tq_qjl.c`)
+- [x] Lloyd-Max-Gaussian codebook quantizer (`tq_codebook.c`, 1–4 bit)
+- [x] 1-bit QJL sign hash (`tq_qjl.c`) — research, contributes ~0 to scores in our regime
 - [x] PolarQuant (polar coordinate) compression (`tq_polar.c`)
-- [x] `turbo_kv_*` types composing the building blocks (paper structure, gap in quality)
-
-### Open: TurboQuant paper reproduction
-- [ ] Close the gap on `turbo_kv_*` quality vs Google paper — see issue #14
-- [ ] Per-channel outlier handling (paper's 32-channel split)
-- [ ] QJL constant verification for Rademacher rows
-- [ ] Per-head rotation seeds
-- [ ] Regression test pinning `turbo_kv_4b` PPL on Llama 3.2 3B ≤ 14.5
+
+### TurboQuant paper reproduction (issue #14, partially resolved)
+- [x] Identify the gap in literal port (commit 4da6915 — QJL contributes byte-identical zero)
+- [x] Variant F: drop QJL stage, double codebook size (commit ac3c46a — beats baseline)
+- [ ] Per-channel outlier handling (Google paper's 32-channel split)
+- [ ] Paper-faithful Llama 3.1 8B + LongBench-E reproduction
+- [ ] 5-bit codebook variant for ~5 bpc quality budget
 
 ### Planned (after Direction 2 reproduction)
 - [ ] "Add Your Own Type" tutorial polish (docs/custom-quantization.md)