Skip to content

Commit e3ae4b5

Browse files
unamedkrclaude
andcommitted
docs: Reddit response drafts for 3 unanswered questions
Addresses: llama.cpp integration plan, codebook domain sensitivity, GPU memory benefits for 3090 users. Key learning: "zero quality loss" was overstated, fixed to "almost no" with PPL numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 835b7d0 commit e3ae4b5

1 file changed

Lines changed: 49 additions & 0 deletions

File tree

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Reddit r/LocalLLM Response Drafts (2026-04-03)
2+
3+
Post: 16 upvotes, 5.4K views, 19 comments
4+
5+
---
6+
7+
## @MrRandom04 — "re-implementing all of llama.cpp just to add whatever approach"
8+
9+
We don't intend to replace llama.cpp. We have a self-contained llama.cpp integration patch (`integrations/llamacpp/patch/`, 4 files, ~1000 lines) that adds `--cache-type-k tq_kv_1b` as a drop-in option. The standalone engine exists for research and to verify the algorithm on multiple architectures (Llama, Gemma, Qwen, Qwen-MoE — 4 verified). The goal is to get TurboQuant KV into llama.cpp as a native cache type.
10+
11+
---
12+
13+
## @dinerburgeryum — "codebook calibration sensitive to out-of-domain data?"
14+
15+
Good question. The **1-bit path doesn't use a codebook at all** — it's just `sign(RHT(key))`, so there's nothing to calibrate and nothing domain-sensitive. The RHT seed is fixed per-block and model-independent.
16+
17+
The codebook is only used for 3-bit and 4-bit modes (Lloyd-Max optimal for N(0,1)). Our `--calibrate` tool showed 49.7% MSE improvement with model-specific codebooks, but the 1-bit path skips all of this.
18+
19+
---
20+
21+
## @Viper-Reflex — "does this make my 24GB 3090 run bigger models?"
22+
23+
KV compression helps most with **long contexts**, not bigger models. With 1-bit K + Q4 V, KV memory drops ~5x. For a 27B model at 32K context:
24+
- Before: ~2.5 GB KV cache
25+
- After: ~500 MB KV cache → frees ~2 GB for longer context or larger batch
26+
27+
If you're already fitting a model in 24GB, TurboQuant lets you push context from 32K → 100K+ on the same hardware. But it won't help you fit a model that's too large for VRAM (weight memory is separate from KV cache).
28+
29+
Note: we currently don't have CUDA GPU acceleration (it compiles but is untested). That's next on the roadmap.
30+
31+
---
32+
33+
## @Blizado — "zero quality loss claim" (already responded)
34+
35+
Updated README: "almost no quality loss (PPL +0.03%)".
36+
37+
Clarification:
38+
- K-only (V as FP16): PPL is exactly +0.00% — measured identical on both Gemma 4B and SmolLM2 1.7B (Llama arch)
39+
- K + Q4 V: PPL +0.03% — near-zero, not zero
40+
- "byte-identical" refers to greedy decoding up to ~100 tokens, not infinite sequences
41+
42+
---
43+
44+
## Key takeaway from Reddit feedback
45+
46+
1. **"zero quality loss" was overstated** → fixed to "almost no" with exact PPL
47+
2. **"why not just integrate into llama.cpp?"** → we have a patch, that's the plan
48+
3. **Technical curiosity is high** — 5.4K views, people want to understand the math
49+
4. **Skepticism is healthy** — the Blizado/No-Manufacturer criticism pushed us to be more precise

0 commit comments

Comments
 (0)