Skip to content

Commit 5f6387b

Browse files
unamedkrclaude
andcommitted
v0.6.0: turbo_kv_4b champion, TurboQuant story, follow-ups tracked
CHANGELOG.md: - New v0.6.0 entry as the headline release of the Variant F win - Document the entire arc: literal port → ablation → 6 Karpathy rounds → drop QJL → reinvest in larger codebook → beats baseline - KV PPL comparison table, layout change description, all v0.6 work ROADMAP.md: - Move turbo_kv_4b/3b into Production-ready section with measured PPL - Mark issue #14 as partially resolved (Variant F shipped; per-channel outlier handling and Llama 3.1 8B reproduction tracked in #15) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b72f05f commit 5f6387b

2 files changed

Lines changed: 61 additions & 11 deletions

File tree

CHANGELOG.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,54 @@
11
# Changelog
22

3+
## [0.6.0] — 2026-04-08
4+
5+
### Highlights
6+
7+
- **🏆 turbo_kv_4b is the new champion** — Beats both `uniform_4b` and llama.cpp `q4_0` KV at the same 4-bit budget on Llama 3.2 3B (PPL 14.28 vs 14.41 vs ~14.99). Reached after 6 rounds of Karpathy-loop iteration starting from a literal port of [Google TurboQuant (ICLR 2026)](https://arxiv.org/abs/2504.19874).
8+
- **CLI default switched**`quant model.gguf` now uses `turbo_kv_4b` automatically. `uniform_4b` remains available via `-k uniform_4b`.
9+
- **Honest TurboQuant reproduction story** — full ablation history, public issue #14, no overstated claims. The shipped `turbo_kv_*` is structurally simpler than the paper (single-stage RHT + Lloyd-Max codebook + ‖x‖) but empirically beats the literal two-stage port on our benchmark.
10+
- **@quantcpp/wasm npm package**`npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project.
11+
- **Windows CI green** — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `__builtin_*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
12+
- **Public PR & issue triage** — PR #12 (5 critical bug fixes from MChorfa) cherry-picked into main; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.
13+
14+
### KV quantization
15+
16+
The `turbo_kv_3b` / `turbo_kv_4b` block layouts changed in this release. The `qjl_signs` field is gone — Karpathy-loop ablation showed it contributed byte-identical zero to attention scores. The freed 16 bytes per block are now used for a 2× larger Lloyd-Max codebook. Same total block size, finer reconstruction, single-stage estimator.
17+
18+
| KV type | Bits/elem | Llama 3.2 3B PPL | Δ vs FP32 |
19+
|---|---:|---:|---:|
20+
| FP32 baseline | 32 | 13.56 ||
21+
| **`turbo_kv_4b`**| 4 | **14.28** | **+5.3%** |
22+
| `uniform_4b` | 4 | 14.41 | +6.3% |
23+
| `turbo_kv_3b` | 3 | 15.39 | +13.5% |
24+
| llama.cpp q4_0 KV (rough) | 4 | ~14.99 | +10.6% |
25+
26+
### Strategy & positioning
27+
28+
- New `docs/positioning.md` — quant.cpp = the single-header C reference engine for the embedded niche (iOS, Android, WASM, MSVC, microcontrollers, game engines)
29+
- README repositioned to honest "production = turbo_kv_4b, research = building blocks" framing with full PPL methodology
30+
- Citations to Google TurboQuant, PolarQuant, QJL papers added throughout
31+
32+
### Tooling & ecosystem
33+
34+
- `wasm/package.json` + ESM `index.mjs` + `index.d.ts` for npm publishing
35+
- `examples/README.md` (cherry-picked from PR #13) — comprehensive embedding examples doc
36+
- CMake `TQ_BUILD_EXAMPLES` option, single-header examples link only against libm + threads
37+
- Windows CI test timeouts bumped to 600s for slow non-vectorized builds
38+
39+
### Bug fixes (cherry-picked from PR #12)
40+
41+
- `tq_qjl.c`: NaN guard requires `dim > 0`
42+
- `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack array)
43+
- `tq_transformer.c`: NULL-check key/value cache calloc results
44+
- `tq_ops.c`: Windows pthread_cond_wait must use `SleepConditionVariableSRW` not `CS` (caused test_ops deadlock on first Windows green run)
45+
46+
### Tracked for next release (issue #14 follow-ups)
47+
48+
- Per-channel outlier handling (Google paper's 32-channel split)
49+
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
50+
- 5-bit codebook variant for higher quality at ~5 bpc
51+
352
## [0.5.0] — 2026-04-05
453

554
### Highlights

ROADMAP.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -49,26 +49,27 @@ The world's simplest way to add LLM to a C/C++ project.
4949
A C reference engine for KV cache quantization research.
5050

5151
### Production-ready
52+
- [x] **`turbo_kv_4b`** — RHT + 4-bit Lloyd-Max codebook, beats `uniform_4b` and llama.cpp `q4_0` KV at the same bit budget (Llama 3.2 3B PPL 14.28, +5.3% vs FP32)
53+
- [x] **`turbo_kv_3b`** — RHT + 3-bit Lloyd-Max codebook (PPL 15.39, +13.5%)
5254
- [x] `uniform_4b` KV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B)
5355
- [x] `uniform_4b` + Q4 V combo (6.9x KV memory reduction)
5456
- [x] Delta compression (P-frame encoding)
5557
- [x] QK-norm aware compression (Gemma 4 / hybrid attention models)
5658
- [x] Plugin architecture (3 functions to add new type)
5759
- [x] 35 unit tests
5860

59-
### Building blocks (research, not yet production-ready)
61+
### Building blocks
6062
- [x] Random Hadamard Transform (`tq_rht.c`)
61-
- [x] Lloyd-Max-Gaussian codebook quantizer (`tq_codebook.c`)
62-
- [x] 1-bit QJL sign hash (`tq_qjl.c`)
63+
- [x] Lloyd-Max-Gaussian codebook quantizer (`tq_codebook.c`, 1–4 bit)
64+
- [x] 1-bit QJL sign hash (`tq_qjl.c`) — research, contributes ~0 to scores in our regime
6365
- [x] PolarQuant (polar coordinate) compression (`tq_polar.c`)
64-
- [x] `turbo_kv_*` types composing the building blocks (paper structure, gap in quality)
65-
66-
### Open: TurboQuant paper reproduction
67-
- [ ] Close the gap on `turbo_kv_*` quality vs Google paper — see issue #14
68-
- [ ] Per-channel outlier handling (paper's 32-channel split)
69-
- [ ] QJL constant verification for Rademacher rows
70-
- [ ] Per-head rotation seeds
71-
- [ ] Regression test pinning `turbo_kv_4b` PPL on Llama 3.2 3B ≤ 14.5
66+
67+
### TurboQuant paper reproduction (issue #14, partially resolved)
68+
- [x] Identify the gap in literal port (commit 4da6915 — QJL contributes byte-identical zero)
69+
- [x] Variant F: drop QJL stage, double codebook size (commit ac3c46a — beats baseline)
70+
- [ ] Per-channel outlier handling (Google paper's 32-channel split)
71+
- [ ] Paper-faithful Llama 3.1 8B + LongBench-E reproduction
72+
- [ ] 5-bit codebook variant for ~5 bpc quality budget
7273

7374
### Planned (after Direction 2 reproduction)
7475
- [ ] "Add Your Own Type" tutorial polish (docs/custom-quantization.md)

0 commit comments

Comments
 (0)