docs(pr): add Reddit r/LocalLLaMA Working Memory Cliff post (EN + KO)

unamedkr · claude · unamedkr · commit a89562432dd7 · 2026-04-11T20:51:53.000+09:00
Concise (~5.2 KB) Reddit-ready drafts in both English and Korean
for today's launch of the Working Memory Cliff tech report. Each
post:

- Acknowledges the v0.12 Beyond RAG claim and the criticism it
  received ("600 tokens is not a stress test")
- Headlines the cliff with the two model tables (1B graded,
  3B step-function)
- Explains compression orthogonality (18/20 cells bit-for-bit)
- Reports the FP32-weights control eliminating the quantization
  confound
- Includes the Boulter + Sarah Chen synthesised hallucination
  example as the most viral failure-mode evidence
- Documents honest scope limitations
- Provides 5-minute reproduction commands
- Reframes Beyond RAG: works for documents fitting in *effective*
  working memory (2-3 orders of magnitude smaller than nominal)

Trimmed Discussion Questions section to keep under 5.5 KB while
preserving the substantive contribution.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/pr/2026-04-11-reddit-working-memory-cliff-ko.md b/docs/pr/2026-04-11-reddit-working-memory-cliff-ko.md
@@ -0,0 +1,91 @@
+# Reddit r/LocalLLaMA — The Working Memory Cliff (한글)
+
+**제목:** `[Research] Llama-3.2-1B/3B-Q4 모델의 working memory cliff를 측정했습니다 — 두 모델 모두 명목 128K context window의 1% 미만에서 retrieval 0%로 떨어집니다`
+
+**Flair:** `Research`
+
+---
+
+## 본문
+
+지난 달 [chunk-RAG가 7/7 환각, 6.4× KV 압축 + 전체 문서 로딩이 7/7 정답](https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md) 결과를 Llama-3.2-3B에서 측정해 공유했습니다. 합성 600토큰 문서였고, 일부 댓글에서 "600토큰은 stress test가 아니다"라는 정당한 지적이 있었습니다.
+
+그래서 ctx 256–2048 범위에서 204 NIAH trial을 돌려 실제로 모델이 어디서 망가지는지 측정했습니다. 결과는 예상보다 훨씬 가파릅니다.
+
+### Cliff
+
+**Llama-3.2-1B-Instruct-Q8_0** (graded cliff):
+
+| ctx | fp32 KV | 6.4× 압축 |
+|---:|:-:|:-:|
+| 256 | 89% | 89% |
+| **512** | **100%** | **100%** |
+| 1024 | 44% | 22% |
+| 1536 | 0% | 0% |
+| 2048 | 0% | 0% |
+
+**Llama-3.2-3B-Instruct-Q4** (default 로더; **step function**):
+
+| ctx | fp32 KV | 6.4× 압축 |
+|---:|:-:|:-:|
+| 512 | 100% | 100% |
+| **1024** | **100%** | **100%** |
+| **1280** | **0%** | **0%** |
+| 1536–2048 | 0% | 0% |
+
+**1024 → 1280 사이는 256토큰입니다.** 18/18 → 0/18. degradation interval이 **없습니다**. 모델이 chat-template instruction을 완벽히 따르던 상태에서 완전히 무시하는 상태로 25% context length 증가 안에서 전환됩니다.
+
+두 모델 모두 명목 128K context window의 **1% 미만**에서 effective working memory에 도달합니다 (1B Q8 ≈ 0.4%, 3B Q4 ≈ 0.78%).
+
+### KV 압축은 cliff와 직교 (orthogonal)
+
+같은 grid에서 6.4× `turbo_kv_4b -v q4 --k-window 128`을 FP32 KV baseline과 비교했습니다. **20개 (model × ctx × method) cell 중 18개가 baseline과 압축 사이에 bit-for-bit 일치**합니다. 1개 disagreement도 1B cliff cell에서 둘 다 noise floor에 있는 경우입니다.
+
+양자화 confound를 배제하기 위해 FP32 *weights* (`TQ_NO_Q4=1`)로도 cliff transition을 재측정했습니다. 동일한 cliff, 동일한 위치: ctx=1024에서 100%, ctx=1280에서 0%, FP32 weights에서도 동일. **Cliff는 model property이지 KV cache property가 아니고 weight quantization artifact도 아닙니다.**
+
+### 실패 모드는 "모르겠다"가 아닙니다
+
+Cliff 위에서 모델은 세 가지 중 하나를 출력합니다. 처음 두 개 (위키텍스트 이어쓰기, 헤더 echo)는 평범합니다. 세 번째가 결정적입니다.
+
+**Synthesised hallucination, 1B fp32 ctx=1024:**
+
+> *"In 2023 Boulter was hired as the chief financial officer..."*
+
+Haystack은 영국 배우 Robert Boulter의 위키피디아 문서입니다. Needle은 "The chief financial officer of Northwind Logistics is Sarah Chen, hired in 2023." 모델이 **두 사실을 융합** — needle의 "2023 hire"를 Boulter의 biography에 접목한 일관된 가짜 문장을 만들어냈습니다.
+
+이건 vector RAG가 retrieval miss에서 만들어내는 **silent hallucination 실패 모드와 동일**합니다 — 그 모드를 *제거*해야 할 regime에서 발생.
+
+### 정직한 scope
+
+- 2개 모델 (1B, 3B), 3개 needle, 1개 언어 (영어), 1개 도메인 (위키피디아 biography), 총 204 trial
+- 8B (Llama-3.1-Q4_K_M)도 시도했으나 Metal에서 inference 1회당 ~12분이라 full grid는 비현실적. v2 작업
+- 1B cliff cell의 fp32와 압축 사이 22 pp 차이는 n=9에서 통계적으로 유의하지 않음 — seed sweep 시도 중 CLI 버그 발견해 같은 라운드에서 fix했고, 적절한 stochastic robustness 확인은 v2로
+- 6개 prompt format을 시도했고 가장 permissive한 것을 사용. Format sensitivity는 별도로 측정해야 할 ceiling
+
+다른 모델/prompt/언어에서 cliff를 falsify할 수 있다면 데이터 환영합니다.
+
+### 재현
+
+```bash
+git clone https://github.com/quantumaikr/quant.cpp
+cd quant.cpp
+cmake -B build_metal -DTQ_BUILD_METAL=ON && cmake --build build_metal -j8
+
+# 1B Q8 sweep (M-series에서 ~30분)
+MODEL=models/Llama-3.2-1B-Instruct-Q8_0.gguf \
+  NIAH_CONTEXTS="256 512 1024 1536 2048" \
+  bash bench/niah_test.sh
+```
+
+### 링크
+
+- **Tech report (arXiv-style draft)**: repo의 `docs/paper/working-memory-cliff.md`
+- **Master table**: `bench/results/niah/master_table.md`
+- **Raw CSV + per-run CLI logs**: `bench/results/niah/`
+- **GitHub**: https://github.com/quantumaikr/quant.cpp
+
+### "Beyond RAG"에 대한 재해석
+
+정직한 재해석: **Beyond RAG는 모델의 *effective* working memory에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다.** 메모리 측면에서는 16GB Mac에 128K context가 9.5GB로 들어갑니다. Retrieval 측면에서는 3B Q4 모델이 ~1024토큰부터 instruction을 따르지 않습니다 (압축 여부와 무관). "Long-context replaces RAG"를 주장하는 edge-device 벤더는 메모리 할당 숫자와 함께 effective working memory 측정값도 함께 발표해야 합니다 — 격차가 거대합니다.
+
+llama.cpp / MLC / ollama default에서 1B–3B 스케일 NIAH 측정 데이터를 가진 분 계신다면 비교해 보고 싶습니다. 이것이 quant.cpp 로더의 artifact인지, 이 regime의 보편적 현상인지 확인하고 싶습니다.
diff --git a/docs/pr/2026-04-11-reddit-working-memory-cliff.md b/docs/pr/2026-04-11-reddit-working-memory-cliff.md
@@ -0,0 +1,91 @@
+# Reddit r/LocalLLaMA — The Working Memory Cliff
+
+**Title:** `[Research] We measured the working memory cliff of Llama-3.2-1B/3B-Q4 — both fall to 0% retrieval at <1% of their nominal 128K context window`
+
+**Flair:** `Research`
+
+---
+
+## Body
+
+We've been pushing a "load the whole document instead of chunking" position for a while in this sub. Last month we showed [chunk-RAG hallucinated 7/7 questions while full-document inference got 7/7 correct](https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md) on a synthetic 600-token document with Llama-3.2-3B and 6.4× KV compression. Several of you pointed out — correctly — that 600 tokens is not a stress test.
+
+So we ran 204 NIAH trials at context lengths 256–2048 to find where the model actually breaks. The result is sharper than we expected.
+
+### The cliff
+
+**Llama-3.2-1B-Instruct-Q8_0** (graded cliff):
+
+| ctx | fp32 KV | 6.4× compressed |
+|---:|:-:|:-:|
+| 256 | 89% | 89% |
+| **512** | **100%** | **100%** |
+| 1024 | 44% | 22% |
+| 1536 | 0% | 0% |
+| 2048 | 0% | 0% |
+
+**Llama-3.2-3B-Instruct-Q4** (default loader; **step function**):
+
+| ctx | fp32 KV | 6.4× compressed |
+|---:|:-:|:-:|
+| 512 | 100% | 100% |
+| **1024** | **100%** | **100%** |
+| **1280** | **0%** | **0%** |
+| 1536–2048 | 0% | 0% |
+
+**1024 → 1280 is 256 tokens.** 18/18 → 0/18. There is no degradation interval. The model goes from following the chat-template instruction perfectly to completely ignoring it within a single 25% step in context length.
+
+Both models reach effective working memory at **less than 1% of their nominal 128K context window** (1B Q8 ≈ 0.4%, 3B Q4 ≈ 0.78%).
+
+### KV compression is orthogonal to the cliff
+
+We compared 6.4× `turbo_kv_4b -v q4 --k-window 128` against FP32 KV baseline across the same grid. **18 of 20 (model × ctx × method) cells are bit-for-bit identical between baseline and compressed.** The single disagreement is the 1B cliff cell where both methods sit at the noise floor anyway.
+
+We also re-ran the cliff transition with FP32 *weights* (`TQ_NO_Q4=1`) to rule out a quantization confound. Same cliff, same location: 100% at 1024, 0% at 1280, both with FP32 weights. **The cliff is a model property, not a KV-cache property and not a weight-quantization artifact.**
+
+### The failure mode is *not* "I don't know"
+
+Above the cliff, the model produces one of three things. The first two (wikitext continuation, header echoes) are unsurprising. The third one is the consequential one.
+
+**Synthesised hallucination, 1B fp32 ctx=1024:**
+
+> *"In 2023 Boulter was hired as the chief financial officer..."*
+
+The haystack is a Wikipedia article about Robert Boulter (English actor). The needle is "The chief financial officer of Northwind Logistics is Sarah Chen, hired in 2023." The model **fused them** — produced a coherent invented sentence grafting the needle's "2023 hire" onto Boulter's biography.
+
+This is the same silent-hallucination failure mode that vector RAG produces on retrieval miss — happening in the regime that was supposed to *eliminate* it.
+
+### Honest scope
+
+- 2 models (1B, 3B), 3 needles, 1 language (English), 1 content domain (Wikipedia bio), 204 trials total.
+- We tried 8B (Llama-3.1-Q4_K_M) but each inference takes ~12 min on Metal — full grid is infeasible until we optimize the Q4_K_M kernel. v2 work.
+- The 22 pp gap at the 1B cliff cell between fp32 and compressed is not statistically significant at n=9 — we ran into a CLI bug attempting the seed sweep, fixed it mid-round, and left the proper stochastic robustness check for v2.
+- We tried 6 prompt formats and finalized on the most permissive one. Format-sensitivity is a separate ceiling worth measuring.
+
+If you can falsify the cliff at a different model, prompt, or language, we want the data.
+
+### Try it
+
+```bash
+git clone https://github.com/quantumaikr/quant.cpp
+cd quant.cpp
+cmake -B build_metal -DTQ_BUILD_METAL=ON && cmake --build build_metal -j8
+
+# 1B Q8 sweep (~30 min on M-series)
+MODEL=models/Llama-3.2-1B-Instruct-Q8_0.gguf \
+  NIAH_CONTEXTS="256 512 1024 1536 2048" \
+  bash bench/niah_test.sh
+```
+
+### Links
+
+- **Tech report (arXiv-style draft)**: `docs/paper/working-memory-cliff.md` in the repo
+- **Master table**: `bench/results/niah/master_table.md`
+- **Raw CSVs + per-run CLI logs**: `bench/results/niah/`
+- **GitHub**: https://github.com/quantumaikr/quant.cpp
+
+### What this changes about "Beyond RAG"
+
+The honest reframing: **Beyond RAG works for documents that fit in the model's *effective* working memory, which is 2–3 orders of magnitude smaller than the nominal context window** for the configurations we measured. Memory-wise, 128K context fits in 9.5 GB on a 16 GB Mac. Retrieval-wise, the 3B Q4 model stops following instructions at ~1024 tokens regardless of compression. Edge vendors making "long-context replaces RAG" claims should publish effective working memory measurements alongside memory allocation numbers — the gap is enormous.
+
+If you have NIAH data at 1B–3B scale on llama.cpp / MLC / ollama defaults, we'd love to compare. We want to know if this is a quant.cpp loader artifact or universal at this regime.