Skip to content

Commit 361c40f

Browse files
unamedkrclaude
andcommitted
test: regression guards for today's DeltaNet fix + long-seq stress + README RLV
Hardening layer around f0091fc (Qwen3.5 DeltaNet regression): 1. scripts/test_models.sh — add Qwen3.5-4B STRICT case ("Hi" → "Hello") using --chat. Tier expanded to 8/8 PASS. If the DeltaNet guard ever regresses, this test fails immediately. 2. scripts/check_sync.sh — add [8] DeltaNet vs Phi-3 fused-QKV check. Inspects the `if (wqkv_t ...)` conditional in both quant.h and src/engine/tq_model.c — requires ssm_probe / layer_is_deltanet in the test. Verified: simulated regression → exit 1; current state → 0. 3. scripts/check_stale.sh — new. Compares each binary's mtime to its actual source (split-sources newest .c/.h, or quant.h for single- header binaries). Motivated by today's investigation where a stale build/quant-server-unified loaded a model differently from a fresh build/quant, making the DeltaNet bug look like a tokenization diff. 4. scripts/test_long_seq.sh — new. Generates 500 tokens at T=0 and rejects runs below 80% printable chars. Catches autoregressive collapse / repetition traps / NaN-spew that PPL (teacher-forced) can't see. 6/6 models PASS at 100% printable. 5. README.md — add v3 update block announcing RLV 10/10 on 12K-token wikitext (crosses the working memory cliff described in v2 update). Add "10/10 on 12K tokens" to the hero stats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f0091fc commit 361c40f

5 files changed

Lines changed: 238 additions & 1 deletion

File tree

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313

1414
<table align="center">
1515
<tr>
16+
<td align="center"><b>10/10 on 12K tokens</b><br>RLV crosses the cliff</td>
1617
<td align="center"><b>7/7 vs 0/7</b><br>Beyond RAG measured</td>
1718
<td align="center"><b>6.4x compression</b><br>+3% PPL</td>
1819
<td align="center"><b>128K context</b><br>on 16GB Mac</td>
@@ -160,6 +161,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
160161
161162
> **v2 update — the Working Memory Cliff (2026-04-11):** We followed up the v1 result with 204 NIAH trials across 1B and 3B at context lengths 256–2048, plus a 6-trial FP32-weights control. Both models hit a sharp cliff at **less than 1% of their nominal 128K context window** (1B Q8 at 512–1024, 3B Q4 at 1024–1280 *as a step function*). The 6.4× KV compression is bit-for-bit identical to FP32 baseline in 18 of 20 cells, so the cliff is a model property — not a KV property and not a weight-quantization artifact. The honest reframing: Beyond RAG works for documents that fit in the model's *effective* working memory, which is 2–3 orders of magnitude smaller than the nominal context window. Full tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HF blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
162163
164+
> **v3 update — Crossing the Cliff with RLV (2026-04-14):** If the cliff is real, the fix is to stop asking one LLM call to hold a full document in working memory. **RLV (Read-Locate-Verify)** is a 5-stage pipeline — gist → locate → lookup → verify → research — where each stage stays below the ~1K-token cliff while the *document* can be arbitrarily long. On 12K-token wikitext (≈10× the cliff for Llama 3.2 3B Q4), **RLV scores 10/10** vs. 8/10 for verify-only and 1/10 for long-context-only. Key trick: BM25 + Reciprocal Rank Fusion does the locating; the LLM is only a tiebreaker. Runs on the same 16GB Mac as the 3B model — no RAG index, no embeddings. [`bench/rlv/`](bench/rlv/) · [`docs/phase3_rlv_challenge.md`](docs/phase3_rlv_challenge.md)
165+
163166
---
164167

165168
## More Features

scripts/check_stale.sh

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
#!/usr/bin/env bash
2+
# check_stale.sh — warn when built binaries are older than the core library.
3+
#
4+
# Why: on 2026-04-15 we spent significant time chasing a "CLI vs server
5+
# behavior mismatch" that turned out to be a stale server binary linked
6+
# against an older libturboquant that pre-dated a Qwen3.5 DeltaNet fix.
7+
# Both tools loaded the same GGUF file and produced different results.
8+
#
9+
# Usage: bash scripts/check_stale.sh [build_dir]
10+
# Exit codes: 0 = all fresh, 1 = at least one binary is stale.
11+
12+
set -u
13+
BUILD="${1:-build}"
14+
RED='\033[0;31m'
15+
YELLOW='\033[1;33m'
16+
GREEN='\033[0;32m'
17+
NC='\033[0m'
18+
19+
if [ ! -d "$BUILD" ]; then
20+
echo "build dir '$BUILD' not found"
21+
exit 0
22+
fi
23+
24+
mtime() { stat -f %m "$1" 2>/dev/null || stat -c %Y "$1" 2>/dev/null || echo 0; }
25+
26+
# Compare binaries to the newest SOURCE file — not to build artifacts,
27+
# since CMake touches dylibs even when nothing changes.
28+
# lib_src_mtime — newest .c/.h in src/ and include/
29+
# header_mtime — quant.h mtime (covers single-header binaries)
30+
lib_src_mtime=0
31+
lib_src_path=""
32+
# Use find with -printf/-f and sort — portable across macOS/Linux.
33+
while IFS= read -r f; do
34+
m=$(mtime "$f")
35+
if [ "$m" -gt "$lib_src_mtime" ]; then
36+
lib_src_mtime="$m"
37+
lib_src_path="$f"
38+
fi
39+
done < <(find src include -type f \( -name '*.c' -o -name '*.h' -o -name '*.m' -o -name '*.metal' \) 2>/dev/null)
40+
41+
header_mtime=0
42+
if [ -f "quant.h" ]; then
43+
header_mtime=$(mtime "quant.h")
44+
fi
45+
46+
if [ "$lib_src_mtime" -eq 0 ] && [ "$header_mtime" -eq 0 ]; then
47+
echo "no sources found — are you in the repo root?"
48+
exit 0
49+
fi
50+
51+
[ "$lib_src_mtime" -gt 0 ] && echo "split-source newest: $lib_src_path ($(date -r "$lib_src_mtime" 2>/dev/null))"
52+
[ "$header_mtime" -gt 0 ] && echo "single-header ref: quant.h ($(date -r "$header_mtime" 2>/dev/null))"
53+
echo ""
54+
55+
STALE=0
56+
CHECKED=0
57+
58+
check_bin() {
59+
local bin="$1"
60+
local ref_mtime="$2"
61+
local ref_label="$3"
62+
if [ ! -f "$bin" ]; then return; fi
63+
CHECKED=$((CHECKED + 1))
64+
local m age hours
65+
m=$(mtime "$bin")
66+
age=$((ref_mtime - m))
67+
local name
68+
name=$(basename "$bin")
69+
if [ "$age" -gt 0 ]; then
70+
STALE=$((STALE + 1))
71+
hours=$((age / 3600))
72+
echo -e " ${RED}${NC} $name is STALE vs $ref_label (older by ${hours}h)"
73+
else
74+
echo -e " ${GREEN}${NC} $name is fresh vs $ref_label"
75+
fi
76+
}
77+
78+
# Split-source binaries — compared against newest src/ file.
79+
if [ "$lib_src_mtime" -gt 0 ]; then
80+
for bin in \
81+
"$BUILD/quant" \
82+
"$BUILD/quant-server" \
83+
"$BUILD/tq_convert" \
84+
"$BUILD/standalone"; do
85+
check_bin "$bin" "$lib_src_mtime" "split sources"
86+
done
87+
fi
88+
89+
# Single-header binaries — compile quant.h directly, compared against it.
90+
if [ "$header_mtime" -gt 0 ]; then
91+
for bin in \
92+
"$BUILD/quant-server-unified" \
93+
"$BUILD/single_header_example"; do
94+
check_bin "$bin" "$header_mtime" "quant.h"
95+
done
96+
fi
97+
98+
echo ""
99+
if [ "$STALE" -gt 0 ]; then
100+
echo -e "${RED}$STALE/$CHECKED binaries are stale.${NC} Rebuild with:"
101+
echo " cmake --build $BUILD --target quant-server-unified single_header_example quant"
102+
exit 1
103+
fi
104+
echo -e "${GREEN}All $CHECKED binaries are fresh.${NC}"
105+
exit 0

scripts/check_sync.sh

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,35 @@ echo "[7] GGUF dequant memory free"
140140
check_both_have "free(layer->attn_norm)" "free(layer->attn_norm)" \
141141
"$HEADER" "src/engine/tq_model.c"
142142

143+
# --- 8. DeltaNet / Phi-3 fused-QKV disambiguation ---
144+
# Regression guard (f0091fc, 2026-04-15): when a layer has attn_qkv.weight,
145+
# the Phi-3 fused-QKV path must NOT trigger for Qwen3.5 DeltaNet layers.
146+
# Both files must probe for ssm_a (DeltaNet marker) and skip the fused path.
147+
echo ""
148+
echo "[8] DeltaNet vs Phi-3 fused-QKV disambiguation"
149+
check_guard() {
150+
local label="$1"
151+
local file="$2"
152+
# Inspect the `if (wqkv_t ...)` conditional: it must also test a DeltaNet
153+
# marker (ssm_probe, layer_is_deltanet, or !deltanet). A bare `if (wqkv_t)`
154+
# means the Phi-3 fused-QKV path will mis-match Qwen3.5 DeltaNet layers.
155+
local cond
156+
cond=$(grep -E "if \(wqkv_t[^)]*\)" "$file" | head -1)
157+
if [ -z "$cond" ]; then
158+
echo -e " ${YELLOW}${NC} $label: no wqkv_t conditional found (OK if Phi-3 path absent)"
159+
return
160+
fi
161+
if echo "$cond" | grep -qE "ssm_probe|layer_is_deltanet|is_deltanet|!deltanet"; then
162+
echo -e " ${GREEN}${NC} $label: DeltaNet guard in conditional"
163+
else
164+
echo -e " ${RED}${NC} $label: bare 'if (wqkv_t)' — Phi-3 path will mis-match Qwen3.5 DeltaNet layers"
165+
echo " offending line: $cond"
166+
ERRORS=$((ERRORS + 1))
167+
fi
168+
}
169+
check_guard "quant.h" "$HEADER"
170+
check_guard "split-source tq_model.c" "src/engine/tq_model.c"
171+
143172
# --- Summary ---
144173
echo ""
145174
echo "========================================="

scripts/test_long_seq.sh

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#!/usr/bin/env bash
2+
# test_long_seq.sh — autoregressive stress test.
3+
#
4+
# Why: PPL (teacher-forced) can be fine while T=0 generation collapses
5+
# after a few hundred tokens — a failure mode KV compression bugs
6+
# typically produce. This test generates 500 tokens at T=0 and rejects
7+
# runs where printable chars fall below 80% of the output (indicating
8+
# repetition-trap garbage, NaN-spew, or token-ID soup).
9+
#
10+
# Complements test_models.sh (which tests first 10 tokens coherence).
11+
12+
set -u
13+
MODELS_DIR="${1:-models}"
14+
QUANT_BIN="${QUANT_BIN:-./build/quant}"
15+
N_TOKENS=500
16+
PASS=0
17+
FAIL=0
18+
SKIP=0
19+
TMP=$(mktemp)
20+
trap 'rm -f "$TMP"' EXIT
21+
22+
if [[ ! -x "$QUANT_BIN" ]]; then
23+
echo "ERROR: $QUANT_BIN not built." >&2
24+
exit 1
25+
fi
26+
27+
run_long() {
28+
local model="$1"
29+
local prompt="$2"
30+
local chat_flag="${3:-}"
31+
local extra_env="${4:-TQ_NO_METAL=1}"
32+
33+
if [[ ! -f "$MODELS_DIR/$model" ]]; then
34+
printf " %-50s [SKIP] not found\n" "$model"
35+
SKIP=$((SKIP + 1))
36+
return
37+
fi
38+
39+
env $extra_env "$QUANT_BIN" "$MODELS_DIR/$model" $chat_flag \
40+
-p "$prompt" -n "$N_TOKENS" -T 0 > "$TMP" 2>/dev/null
41+
42+
local total printable ratio
43+
total=$(wc -c < "$TMP")
44+
# Printable = ASCII printable + whitespace + valid UTF-8 multibyte
45+
# Approximation: chars passing tr -cd '[:print:][:space:]' OR bytes >= 0x80.
46+
printable=$(tr -cd '[:print:][:space:]\200-\377' < "$TMP" | wc -c)
47+
48+
if [ "$total" -lt 100 ]; then
49+
printf " %-50s [FAIL] too short (%d bytes)\n" "$model" "$total"
50+
FAIL=$((FAIL + 1))
51+
return
52+
fi
53+
54+
# integer percentage
55+
ratio=$(( printable * 100 / total ))
56+
# Preview: first 60 chars, newlines squashed
57+
preview=$(tr '\n' ' ' < "$TMP" | tr -s ' ' | cut -c1-60)
58+
59+
if [ "$ratio" -ge 80 ]; then
60+
printf " %-50s [PASS] %d%% printable, %d bytes | '%s...'\n" \
61+
"$model" "$ratio" "$total" "$preview"
62+
PASS=$((PASS + 1))
63+
else
64+
printf " %-50s [FAIL] %d%% printable, %d bytes | '%s...'\n" \
65+
"$model" "$ratio" "$total" "$preview"
66+
FAIL=$((FAIL + 1))
67+
fi
68+
}
69+
70+
echo "=== quant.cpp Long-Sequence Stress Test (N=$N_TOKENS, T=0) ==="
71+
echo "Models dir: $MODELS_DIR"
72+
echo ""
73+
74+
# Short story continuation prompts — must sustain coherent generation.
75+
run_long "Llama-3.2-1B-Instruct-Q8_0.gguf" \
76+
"Once upon a time in a small village by the sea, there lived a young woman named Elena who"
77+
run_long "Llama-3.2-3B-Instruct-Q8_0.gguf" \
78+
"Once upon a time in a small village by the sea, there lived a young woman named Elena who"
79+
run_long "Phi-3.5-mini-instruct-Q8_0.gguf" \
80+
"Here is a short essay on the importance of clear writing:"
81+
run_long "Phi-3.5-mini-instruct-Q4_K_M.gguf" \
82+
"Here is a short essay on the importance of clear writing:"
83+
run_long "Qwen3.5-4B-Q4_K_M.gguf" \
84+
"Write a short story about a robot who learns to paint" "--chat"
85+
run_long "gemma-4-e2b-it-Q8_0.gguf" \
86+
"Write a short paragraph about the solar system:"
87+
88+
echo ""
89+
echo "--- Summary ---"
90+
echo " PASS: $PASS"
91+
echo " FAIL: $FAIL"
92+
echo " SKIP: $SKIP"
93+
94+
[ "$FAIL" -gt 0 ] && exit 1
95+
exit 0

scripts/test_models.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ run_test() {
3030
local expected="$3"
3131
local tier="$4"
3232
local extra_env="${5:-}"
33+
local extra_args="${6:-}"
3334

3435
if [[ ! -f "$MODELS_DIR/$model" ]]; then
3536
printf " %-50s [SKIP] not found\n" "$model"
@@ -40,7 +41,7 @@ run_test() {
4041
local out
4142
# Capture full output (replace newlines with space) — avoids missing
4243
# output when first line is empty (newline-prefixed generation).
43-
out=$(env $extra_env "$QUANT_BIN" "$MODELS_DIR/$model" -p "$prompt" -n 10 -T 0 2>/dev/null | tr '\n' ' ' | sed 's/ */ /g')
44+
out=$(env $extra_env "$QUANT_BIN" "$MODELS_DIR/$model" $extra_args -p "$prompt" -n 10 -T 0 2>/dev/null | tr '\n' ' ' | sed 's/ */ /g')
4445

4546
case "$tier" in
4647
STRICT)
@@ -86,6 +87,10 @@ echo "--- COHERENT tier (must produce non-garbage text) ---"
8687
run_test "Llama-3.2-1B-Instruct-Q8_0.gguf" "Hello" "" COHERENT "TQ_NO_METAL=1"
8788
run_test "Llama-3.2-3B-Instruct-Q8_0.gguf" "Hello" "" COHERENT "TQ_NO_METAL=1"
8889
run_test "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" "Hello" "" COHERENT "TQ_NO_METAL=1"
90+
# Regression guard: DeltaNet layers must NOT be treated as self_attn.
91+
# Fixed in f0091fc (2026-04-15); before that CLI produced whitespace garbage.
92+
# Uses --chat so the ChatML template wrapping is tested end-to-end.
93+
run_test "Qwen3.5-4B-Q4_K_M.gguf" "Hi" "Hello" STRICT "TQ_NO_METAL=1" "--chat"
8994

9095
echo ""
9196
echo "--- Summary ---"

0 commit comments

Comments
 (0)