Skip to content

Commit 9e09e3d

Browse files
unamedkrclaude
andcommitted
docs: Reddit responses for BillDStrong, ganonfirehouse420, TopChard1274
- BillDStrong: why 1-bit works (softmax robustness, not better MSE) - ganonfirehouse420: concrete context extension example (32K→128K+) - TopChard1274: KV compression helps all hardware tiers equally Total: 6 response drafts covering all substantive Reddit questions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 189581b commit 9e09e3d

1 file changed

Lines changed: 48 additions & 0 deletions

File tree

docs/pr/2026-04-03-reddit-responses.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,57 @@ Clarification:
4141

4242
---
4343

44+
## @BillDStrong — "What magic is this. I didn't realize there was a 1-bit version"
45+
46+
Good observation — the paper (TurboQuant, ICLR 2026) focuses on 2.5-bit and 3.5-bit configurations. The 1-bit version is our extension of the paper's framework.
47+
48+
The key insight: the paper's RHT (Randomized Hadamard Transform) makes the quantization error **unbiased** for inner products at any bit-width. We pushed this to the extreme — 1 bit = just the sign of each dimension after RHT. Mathematically, this gives a cosine similarity of 2/pi ≈ 0.637 (we measured 0.634), which is the information-theoretic maximum for sign-only quantization.
49+
50+
Why does 1-bit "beat" 2-3 bit? It doesn't in terms of reconstruction quality (MSE is worse). But for **attention scoring** (which only needs inner product ranking, not exact values), the softmax function is surprisingly tolerant of noise. The attention weights after softmax are nearly identical because:
51+
1. RHT distributes errors uniformly (no systematic bias)
52+
2. Softmax amplifies the largest scores and suppresses small ones
53+
3. The top-attended tokens stay the same even with noisy scores
54+
55+
So it's not that 1-bit is "better" — it's that attention is robust enough that 1-bit is sufficient.
56+
57+
---
58+
59+
## @ganonfirehouse420 — "I hope I will be able to have a huge context for my local models in the future"
60+
61+
That's exactly the use case. With 1-bit K + Q4 V, KV cache memory drops ~5x. Concrete example:
62+
63+
```
64+
Gemma 3 4B at 32K context:
65+
FP16 KV: 4,352 MB → barely fits in 16GB with model weights
66+
1-bit K + Q4 V: 885 MB → room for 128K+ context on same hardware
67+
```
68+
69+
For a 16GB Mac or laptop, this means going from 32K → 100K+ context without any hardware upgrade. The limiting factor shifts from KV memory to model weight memory.
70+
71+
This is available today — `./build/tq_run model.gguf -p "your long prompt" -k turbo_kv_1b -v q4 --ctx 131072`. The `--ctx` flag overrides the default context limit.
72+
73+
---
74+
75+
## @TopChard1274 — "big breakthroughs ... seem brutal for people who invested in nearly-unaffordable systems"
76+
77+
Appreciate the perspective. A few thoughts:
78+
79+
1. **KV compression helps everyone** — whether you have 8GB, 24GB, or 80GB. The ratio is the same (5x KV reduction). High-end systems benefit by running longer contexts or larger batches, not just by fitting a model.
80+
81+
2. **This doesn't obsolete hardware** — weight memory is still the bottleneck for model size. A 70B model still needs ~35GB for Q4 weights regardless of KV compression. What changes is that you can push context much further on whatever hardware you have.
82+
83+
3. **The "0.03% quality loss" criticism is fair** — some critics in this thread pushed back on "zero loss" and they're right. We've updated to "almost no quality loss" with exact PPL numbers. The honest framing matters more than hype.
84+
85+
The real unlock is for use cases like RAG with long documents, code analysis with large repos, or multi-turn conversations that previously hit context limits.
86+
87+
---
88+
4489
## Key takeaway from Reddit feedback
4590

4691
1. **"zero quality loss" was overstated** → fixed to "almost no" with exact PPL
4792
2. **"why not just integrate into llama.cpp?"** → we have a patch, that's the plan
4893
3. **Technical curiosity is high** — 5.4K views, people want to understand the math
4994
4. **Skepticism is healthy** — the Blizado/No-Manufacturer criticism pushed us to be more precise
95+
5. **1-bit vs 2-3 bit confusion** → clarified: softmax robustness, not better MSE
96+
6. **Long context is the killer app** — multiple users asking about context extension
97+
7. **Hardware democratization** resonates — people want more from existing hardware

0 commit comments

Comments
 (0)