quantumaikr
diff --git a/‎docs/pr/2026-04-03-reddit-responses.md‎
Lines changed: 63 additions & 6 deletions b/‎docs/pr/2026-04-03-reddit-responses.md‎
Lines changed: 63 additions & 6 deletions
@@ -86,12 +86,69 @@ The real unlock is for use cases like RAG with long documents, code analysis wit
 
 ---
 
+## @MrRandom04 (follow-up) — "Why not just fork llama.cpp?"
+
+> "It is very hard for me to trust the correctness of a re-implementation of such a complex codebase... Why not just fork llama.cpp and add it in there so we know the code for all the other crucial parts is fairly robust and dependable?"
+
+Valid concern. Two reasons for the standalone engine:
+
+1. **Algorithm verification across architectures.** We needed to test TurboQuant KV on Llama, Gemma (sliding window), Qwen3.5 (DeltaNet hybrid), and Qwen-MoE (256 experts) — each with very different attention mechanisms. A standalone engine let us control every variable and measure PPL impact precisely. Debugging quantization bugs inside llama.cpp's 200K+ line codebase would have been much harder during research.
+
+2. **The integration path is real.** `integrations/llamacpp/` has a working GGML type registration that adds TurboQuant types alongside existing Q4/Q8 types. The plan is an upstream PR — not maintaining a parallel engine forever.
+
+You're right that a fork would give more confidence in correctness. Once the algorithm is validated (which is what the standalone engine proved), the next step is exactly that — getting it into llama.cpp where it benefits from their battle-tested infrastructure. The standalone engine is the research prototype; llama.cpp integration is the production path.
+
+---
+
+## @OftenTangential — "36 is an absurd PPL for Gemma 3 4B"
+
+> "36 is an absurd ppl for Gemma 3 4B on English text lol. That implies it's literally outputting GPT-2 levels of coherence... Either your perplexity test set is bad, or the baseline implementation is broken."
+
+Fair point — the PPL of 35.99 is high for Gemma 3 4B. Here's the context:
+
+1. **Short test set (101 tokens).** This was a small fixed prompt used to compare FP16 vs 1-bit, not a proper benchmark corpus. PPL on short sequences is noisy and inflated — it doesn't reflect the model's true capability on longer text.
+
+2. **What matters is the delta, not the absolute value.** The point of the measurement is FP16 → 1-bit: 35.99 → 36.00 (+0.03%). Whether the baseline is 6.0 or 36.0, a +0.03% delta from quantization is negligible.
+
+3. **Confirmed on SmolLM2 1.7B (Llama arch) with lower baseline PPL.** SmolLM2 gives baseline 5.84 PPL on 105 tokens — a more expected range for a small model. 1-bit K: 5.84 (+0.00%). This cross-architecture result is stronger evidence.
+
+You're right that a proper PPL evaluation should use a standard benchmark (WikiText-2, C4) with thousands of tokens. That's on the roadmap. The 101-token measurement was meant to show the quantization delta, not the model's absolute quality.
+
+---
+
+## @Viper-Reflex — ":O ty for the info!"
+
+(No response needed — positive acknowledgment.)
+
+---
+
+## @MaybeADragon — "Em dashes. No more to be said."
+
+Implying the post was AI-generated due to em dash usage. Not worth engaging directly — the code and results speak for themselves. If asked seriously, the response is: the code is open source, 30K lines of C, 32 test suites, all reproducible.
+
+---
+
+## @Candid_Koala_3602 — "Can TurboQuant also replace transformers in the same mechanism? Angular mappings instead of weights?"
+
+> "Can TurboQuant also replace transformers in the same mechanism? That would be the real win. Angular mappings instead of weights?"
+
+Interesting idea. Short answer: TurboQuant doesn't replace the transformer architecture — it compresses the **data** (KV cache, weights) that the transformer operates on.
+
+But the underlying insight — that angular/directional information is sufficient for attention — is related to what you're describing. The 1-bit path essentially reduces attention to cosine similarity via sign hashing, which is a form of angular mapping. Whether this could extend to replacing weight matrices with purely angular representations is an open research question.
+
+The closest existing work is probably binary/ternary weight networks (BWN/TWN) and more recently BitNet (1-bit weights). TurboQuant's contribution is showing that the **KV cache** specifically tolerates extreme quantization because attention is inherently a ranking operation, not a reconstruction operation.
+
+---
+
 ## Key takeaway from Reddit feedback
 
 1. **"zero quality loss" was overstated** → fixed to "almost no" with exact PPL
-2. **"why not just integrate into llama.cpp?"** → we have a patch, that's the plan
-3. **Technical curiosity is high** — 5.4K views, people want to understand the math
-4. **Skepticism is healthy** — the Blizado/No-Manufacturer criticism pushed us to be more precise
-5. **1-bit vs 2-3 bit confusion** → clarified: softmax robustness, not better MSE
-6. **Long context is the killer app** — multiple users asking about context extension
-7. **Hardware democratization** resonates — people want more from existing hardware
+2. **"why not just integrate into llama.cpp?"** → we have a patch, standalone is for research; llama.cpp PR is the production path
+3. **"why not fork llama.cpp?"** → valid, standalone engine proved the algorithm, next step is upstream integration
+4. **"PPL 36 is absurd for Gemma 4B"** → fair: short test set (101 tokens) inflates PPL; delta (+0.03%) is what matters, confirmed on SmolLM2 at lower baseline PPL
+5. **Technical curiosity is high** — 5.4K views, people want to understand the math
+6. **Skepticism is healthy** — Blizado/No-Manufacturer/OftenTangential criticism pushed us to be more precise
+7. **1-bit vs 2-3 bit confusion** → clarified: softmax robustness, not better MSE
+8. **Long context is the killer app** — multiple users asking about context extension
+9. **Hardware democratization** resonates — people want more from existing hardware
+10. **Need proper PPL benchmark** → WikiText-2/C4 with 1000+ tokens, not 101-token micro test