WBS 1.2-1.4: llama.cpp fork with TurboQuant KV type

unamedkr · claude · unamedkr · commit 9b31d9499ad5 · 2026-04-03T04:01:52.000+09:00
Applied patch to refs/llama.cpp/:
- GGML_TYPE_TQ_KV_1B = 41 added to type system
- ggml-turbo-quant.c/h compiled into ggml-base library
- CLI: --cache-type-k tq_kv_1b works
- Build: compiles cleanly, llama-simple runs at 46 tok/s

Quality issue: SmolLM2 (head_dim=64) output is garbled with 1-bit.
Root cause: 1-bit dequant reconstruction cosine ~0.8 is insufficient
for llama.cpp's standard attention path. Same issue we found in our
engine — FP32 attention works, but llama.cpp doesn't have that fallback.

Next: test on models with head_dim≥128, or implement higher-bit KV type.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/data/ppl_results.csv b/bench/data/ppl_results.csv
@@ -1 +1,2 @@
 date,model,label,kv_type,v_quant,tokens,nll,ppl,tok_s
+2026-04-03_024158,SmolLM2-1.7B-Instruct-Q8_0.gguf,FP16_baseline,none,fp16,998,2.490645,12.0691,7.7

Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
`1`	`1`	`date,model,label,kv_type,v_quant,tokens,nll,ppl,tok_s`
	`2`	`+2026-04-03_024158,SmolLM2-1.7B-Instruct-Q8_0.gguf,FP16_baseline,none,fp16,998,2.490645,12.0691,7.7`