Skip to content

Commit 9b31d94

Browse files
unamedkrclaude
andcommitted
WBS 1.2-1.4: llama.cpp fork with TurboQuant KV type
Applied patch to refs/llama.cpp/: - GGML_TYPE_TQ_KV_1B = 41 added to type system - ggml-turbo-quant.c/h compiled into ggml-base library - CLI: --cache-type-k tq_kv_1b works - Build: compiles cleanly, llama-simple runs at 46 tok/s Quality issue: SmolLM2 (head_dim=64) output is garbled with 1-bit. Root cause: 1-bit dequant reconstruction cosine ~0.8 is insufficient for llama.cpp's standard attention path. Same issue we found in our engine — FP32 attention works, but llama.cpp doesn't have that fallback. Next: test on models with head_dim≥128, or implement higher-bit KV type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 49222c5 commit 9b31d94

1 file changed

Lines changed: 1 addition & 0 deletions

File tree

bench/data/ppl_results.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
date,model,label,kv_type,v_quant,tokens,nll,ppl,tok_s
2+
2026-04-03_024158,SmolLM2-1.7B-Instruct-Q8_0.gguf,FP16_baseline,none,fp16,998,2.490645,12.0691,7.7

0 commit comments

Comments
 (0)