Commit 9b31d94
WBS 1.2-1.4: llama.cpp fork with TurboQuant KV type
Applied patch to refs/llama.cpp/:
- GGML_TYPE_TQ_KV_1B = 41 added to type system
- ggml-turbo-quant.c/h compiled into ggml-base library
- CLI: --cache-type-k tq_kv_1b works
- Build: compiles cleanly, llama-simple runs at 46 tok/s
Quality issue: SmolLM2 (head_dim=64) output is garbled with 1-bit.
Root cause: 1-bit dequant reconstruction cosine ~0.8 is insufficient
for llama.cpp's standard attention path. Same issue we found in our
engine — FP32 attention works, but llama.cpp doesn't have that fallback.
Next: test on models with head_dim≥128, or implement higher-bit KV type.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 49222c5 commit 9b31d94
1 file changed
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
0 commit comments