Skip to content

Commit a895624

Browse files
unamedkrclaude
andcommitted
docs(pr): add Reddit r/LocalLLaMA Working Memory Cliff post (EN + KO)
Concise (~5.2 KB) Reddit-ready drafts in both English and Korean for today's launch of the Working Memory Cliff tech report. Each post: - Acknowledges the v0.12 Beyond RAG claim and the criticism it received ("600 tokens is not a stress test") - Headlines the cliff with the two model tables (1B graded, 3B step-function) - Explains compression orthogonality (18/20 cells bit-for-bit) - Reports the FP32-weights control eliminating the quantization confound - Includes the Boulter + Sarah Chen synthesised hallucination example as the most viral failure-mode evidence - Documents honest scope limitations - Provides 5-minute reproduction commands - Reframes Beyond RAG: works for documents fitting in *effective* working memory (2-3 orders of magnitude smaller than nominal) Trimmed Discussion Questions section to keep under 5.5 KB while preserving the substantive contribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e07abce commit a895624

2 files changed

Lines changed: 182 additions & 0 deletions

File tree

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Reddit r/LocalLLaMA — The Working Memory Cliff (한글)
2+
3+
**제목:** `[Research] Llama-3.2-1B/3B-Q4 모델의 working memory cliff를 측정했습니다 — 두 모델 모두 명목 128K context window의 1% 미만에서 retrieval 0%로 떨어집니다`
4+
5+
**Flair:** `Research`
6+
7+
---
8+
9+
## 본문
10+
11+
지난 달 [chunk-RAG가 7/7 환각, 6.4× KV 압축 + 전체 문서 로딩이 7/7 정답](https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md) 결과를 Llama-3.2-3B에서 측정해 공유했습니다. 합성 600토큰 문서였고, 일부 댓글에서 "600토큰은 stress test가 아니다"라는 정당한 지적이 있었습니다.
12+
13+
그래서 ctx 256–2048 범위에서 204 NIAH trial을 돌려 실제로 모델이 어디서 망가지는지 측정했습니다. 결과는 예상보다 훨씬 가파릅니다.
14+
15+
### Cliff
16+
17+
**Llama-3.2-1B-Instruct-Q8_0** (graded cliff):
18+
19+
| ctx | fp32 KV | 6.4× 압축 |
20+
|---:|:-:|:-:|
21+
| 256 | 89% | 89% |
22+
| **512** | **100%** | **100%** |
23+
| 1024 | 44% | 22% |
24+
| 1536 | 0% | 0% |
25+
| 2048 | 0% | 0% |
26+
27+
**Llama-3.2-3B-Instruct-Q4** (default 로더; **step function**):
28+
29+
| ctx | fp32 KV | 6.4× 압축 |
30+
|---:|:-:|:-:|
31+
| 512 | 100% | 100% |
32+
| **1024** | **100%** | **100%** |
33+
| **1280** | **0%** | **0%** |
34+
| 1536–2048 | 0% | 0% |
35+
36+
**1024 → 1280 사이는 256토큰입니다.** 18/18 → 0/18. degradation interval이 **없습니다**. 모델이 chat-template instruction을 완벽히 따르던 상태에서 완전히 무시하는 상태로 25% context length 증가 안에서 전환됩니다.
37+
38+
두 모델 모두 명목 128K context window의 **1% 미만**에서 effective working memory에 도달합니다 (1B Q8 ≈ 0.4%, 3B Q4 ≈ 0.78%).
39+
40+
### KV 압축은 cliff와 직교 (orthogonal)
41+
42+
같은 grid에서 6.4× `turbo_kv_4b -v q4 --k-window 128`을 FP32 KV baseline과 비교했습니다. **20개 (model × ctx × method) cell 중 18개가 baseline과 압축 사이에 bit-for-bit 일치**합니다. 1개 disagreement도 1B cliff cell에서 둘 다 noise floor에 있는 경우입니다.
43+
44+
양자화 confound를 배제하기 위해 FP32 *weights* (`TQ_NO_Q4=1`)로도 cliff transition을 재측정했습니다. 동일한 cliff, 동일한 위치: ctx=1024에서 100%, ctx=1280에서 0%, FP32 weights에서도 동일. **Cliff는 model property이지 KV cache property가 아니고 weight quantization artifact도 아닙니다.**
45+
46+
### 실패 모드는 "모르겠다"가 아닙니다
47+
48+
Cliff 위에서 모델은 세 가지 중 하나를 출력합니다. 처음 두 개 (위키텍스트 이어쓰기, 헤더 echo)는 평범합니다. 세 번째가 결정적입니다.
49+
50+
**Synthesised hallucination, 1B fp32 ctx=1024:**
51+
52+
> *"In 2023 Boulter was hired as the chief financial officer..."*
53+
54+
Haystack은 영국 배우 Robert Boulter의 위키피디아 문서입니다. Needle은 "The chief financial officer of Northwind Logistics is Sarah Chen, hired in 2023." 모델이 **두 사실을 융합** — needle의 "2023 hire"를 Boulter의 biography에 접목한 일관된 가짜 문장을 만들어냈습니다.
55+
56+
이건 vector RAG가 retrieval miss에서 만들어내는 **silent hallucination 실패 모드와 동일**합니다 — 그 모드를 *제거*해야 할 regime에서 발생.
57+
58+
### 정직한 scope
59+
60+
- 2개 모델 (1B, 3B), 3개 needle, 1개 언어 (영어), 1개 도메인 (위키피디아 biography), 총 204 trial
61+
- 8B (Llama-3.1-Q4_K_M)도 시도했으나 Metal에서 inference 1회당 ~12분이라 full grid는 비현실적. v2 작업
62+
- 1B cliff cell의 fp32와 압축 사이 22 pp 차이는 n=9에서 통계적으로 유의하지 않음 — seed sweep 시도 중 CLI 버그 발견해 같은 라운드에서 fix했고, 적절한 stochastic robustness 확인은 v2로
63+
- 6개 prompt format을 시도했고 가장 permissive한 것을 사용. Format sensitivity는 별도로 측정해야 할 ceiling
64+
65+
다른 모델/prompt/언어에서 cliff를 falsify할 수 있다면 데이터 환영합니다.
66+
67+
### 재현
68+
69+
```bash
70+
git clone https://github.com/quantumaikr/quant.cpp
71+
cd quant.cpp
72+
cmake -B build_metal -DTQ_BUILD_METAL=ON && cmake --build build_metal -j8
73+
74+
# 1B Q8 sweep (M-series에서 ~30분)
75+
MODEL=models/Llama-3.2-1B-Instruct-Q8_0.gguf \
76+
NIAH_CONTEXTS="256 512 1024 1536 2048" \
77+
bash bench/niah_test.sh
78+
```
79+
80+
### 링크
81+
82+
- **Tech report (arXiv-style draft)**: repo의 `docs/paper/working-memory-cliff.md`
83+
- **Master table**: `bench/results/niah/master_table.md`
84+
- **Raw CSV + per-run CLI logs**: `bench/results/niah/`
85+
- **GitHub**: https://github.com/quantumaikr/quant.cpp
86+
87+
### "Beyond RAG"에 대한 재해석
88+
89+
정직한 재해석: **Beyond RAG는 모델의 *effective* working memory에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다.** 메모리 측면에서는 16GB Mac에 128K context가 9.5GB로 들어갑니다. Retrieval 측면에서는 3B Q4 모델이 ~1024토큰부터 instruction을 따르지 않습니다 (압축 여부와 무관). "Long-context replaces RAG"를 주장하는 edge-device 벤더는 메모리 할당 숫자와 함께 effective working memory 측정값도 함께 발표해야 합니다 — 격차가 거대합니다.
90+
91+
llama.cpp / MLC / ollama default에서 1B–3B 스케일 NIAH 측정 데이터를 가진 분 계신다면 비교해 보고 싶습니다. 이것이 quant.cpp 로더의 artifact인지, 이 regime의 보편적 현상인지 확인하고 싶습니다.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Reddit r/LocalLLaMA — The Working Memory Cliff
2+
3+
**Title:** `[Research] We measured the working memory cliff of Llama-3.2-1B/3B-Q4 — both fall to 0% retrieval at <1% of their nominal 128K context window`
4+
5+
**Flair:** `Research`
6+
7+
---
8+
9+
## Body
10+
11+
We've been pushing a "load the whole document instead of chunking" position for a while in this sub. Last month we showed [chunk-RAG hallucinated 7/7 questions while full-document inference got 7/7 correct](https://github.com/quantumaikr/quant.cpp/blob/main/docs/beyond-rag-manifesto.md) on a synthetic 600-token document with Llama-3.2-3B and 6.4× KV compression. Several of you pointed out — correctly — that 600 tokens is not a stress test.
12+
13+
So we ran 204 NIAH trials at context lengths 256–2048 to find where the model actually breaks. The result is sharper than we expected.
14+
15+
### The cliff
16+
17+
**Llama-3.2-1B-Instruct-Q8_0** (graded cliff):
18+
19+
| ctx | fp32 KV | 6.4× compressed |
20+
|---:|:-:|:-:|
21+
| 256 | 89% | 89% |
22+
| **512** | **100%** | **100%** |
23+
| 1024 | 44% | 22% |
24+
| 1536 | 0% | 0% |
25+
| 2048 | 0% | 0% |
26+
27+
**Llama-3.2-3B-Instruct-Q4** (default loader; **step function**):
28+
29+
| ctx | fp32 KV | 6.4× compressed |
30+
|---:|:-:|:-:|
31+
| 512 | 100% | 100% |
32+
| **1024** | **100%** | **100%** |
33+
| **1280** | **0%** | **0%** |
34+
| 1536–2048 | 0% | 0% |
35+
36+
**1024 → 1280 is 256 tokens.** 18/18 → 0/18. There is no degradation interval. The model goes from following the chat-template instruction perfectly to completely ignoring it within a single 25% step in context length.
37+
38+
Both models reach effective working memory at **less than 1% of their nominal 128K context window** (1B Q8 ≈ 0.4%, 3B Q4 ≈ 0.78%).
39+
40+
### KV compression is orthogonal to the cliff
41+
42+
We compared 6.4× `turbo_kv_4b -v q4 --k-window 128` against FP32 KV baseline across the same grid. **18 of 20 (model × ctx × method) cells are bit-for-bit identical between baseline and compressed.** The single disagreement is the 1B cliff cell where both methods sit at the noise floor anyway.
43+
44+
We also re-ran the cliff transition with FP32 *weights* (`TQ_NO_Q4=1`) to rule out a quantization confound. Same cliff, same location: 100% at 1024, 0% at 1280, both with FP32 weights. **The cliff is a model property, not a KV-cache property and not a weight-quantization artifact.**
45+
46+
### The failure mode is *not* "I don't know"
47+
48+
Above the cliff, the model produces one of three things. The first two (wikitext continuation, header echoes) are unsurprising. The third one is the consequential one.
49+
50+
**Synthesised hallucination, 1B fp32 ctx=1024:**
51+
52+
> *"In 2023 Boulter was hired as the chief financial officer..."*
53+
54+
The haystack is a Wikipedia article about Robert Boulter (English actor). The needle is "The chief financial officer of Northwind Logistics is Sarah Chen, hired in 2023." The model **fused them** — produced a coherent invented sentence grafting the needle's "2023 hire" onto Boulter's biography.
55+
56+
This is the same silent-hallucination failure mode that vector RAG produces on retrieval miss — happening in the regime that was supposed to *eliminate* it.
57+
58+
### Honest scope
59+
60+
- 2 models (1B, 3B), 3 needles, 1 language (English), 1 content domain (Wikipedia bio), 204 trials total.
61+
- We tried 8B (Llama-3.1-Q4_K_M) but each inference takes ~12 min on Metal — full grid is infeasible until we optimize the Q4_K_M kernel. v2 work.
62+
- The 22 pp gap at the 1B cliff cell between fp32 and compressed is not statistically significant at n=9 — we ran into a CLI bug attempting the seed sweep, fixed it mid-round, and left the proper stochastic robustness check for v2.
63+
- We tried 6 prompt formats and finalized on the most permissive one. Format-sensitivity is a separate ceiling worth measuring.
64+
65+
If you can falsify the cliff at a different model, prompt, or language, we want the data.
66+
67+
### Try it
68+
69+
```bash
70+
git clone https://github.com/quantumaikr/quant.cpp
71+
cd quant.cpp
72+
cmake -B build_metal -DTQ_BUILD_METAL=ON && cmake --build build_metal -j8
73+
74+
# 1B Q8 sweep (~30 min on M-series)
75+
MODEL=models/Llama-3.2-1B-Instruct-Q8_0.gguf \
76+
NIAH_CONTEXTS="256 512 1024 1536 2048" \
77+
bash bench/niah_test.sh
78+
```
79+
80+
### Links
81+
82+
- **Tech report (arXiv-style draft)**: `docs/paper/working-memory-cliff.md` in the repo
83+
- **Master table**: `bench/results/niah/master_table.md`
84+
- **Raw CSVs + per-run CLI logs**: `bench/results/niah/`
85+
- **GitHub**: https://github.com/quantumaikr/quant.cpp
86+
87+
### What this changes about "Beyond RAG"
88+
89+
The honest reframing: **Beyond RAG works for documents that fit in the model's *effective* working memory, which is 2–3 orders of magnitude smaller than the nominal context window** for the configurations we measured. Memory-wise, 128K context fits in 9.5 GB on a 16 GB Mac. Retrieval-wise, the 3B Q4 model stops following instructions at ~1024 tokens regardless of compression. Edge vendors making "long-context replaces RAG" claims should publish effective working memory measurements alongside memory allocation numbers — the gap is enormous.
90+
91+
If you have NIAH data at 1B–3B scale on llama.cpp / MLC / ollama defaults, we'd love to compare. We want to know if this is a quant.cpp loader artifact or universal at this regime.

0 commit comments

Comments
 (0)