test: regression guards for today's DeltaNet fix + long-seq stress + README RLV

unamedkr · claude · unamedkr · commit 361c40f363c0 · 2026-04-15T08:22:40.000+09:00
Hardening layer around f0091fc (Qwen3.5 DeltaNet regression): 1. scripts/test_models.sh — add Qwen3.5-4B STRICT case ("Hi" → "Hello") using --chat. Tier expanded to 8/8 PASS. If the DeltaNet guard ever regresses, this test fails immediately. 2. scripts/check_sync.sh — add [8] DeltaNet vs Phi-3 fused-QKV check. Inspects the `if (wqkv_t ...)` conditional in both quant.h and src/engine/tq_model.c — requires ssm_probe / layer_is_deltanet in the test. Verified: simulated regression → exit 1; current state → 0. 3. scripts/check_stale.sh — new. Compares each binary's mtime to its actual source (split-sources newest .c/.h, or quant.h for single- header binaries). Motivated by today's investigation where a stale build/quant-server-unified loaded a model differently from a fresh build/quant, making the DeltaNet bug look like a tokenization diff. 4. scripts/test_long_seq.sh — new. Generates 500 tokens at T=0 and rejects runs below 80% printable chars. Catches autoregressive collapse / repetition traps / NaN-spew that PPL (teacher-forced) can't see. 6/6 models PASS at 100% printable. 5. README.md — add v3 update block announcing RLV 10/10 on 12K-token wikitext (crosses the working memory cliff described in v2 update). Add "10/10 on 12K tokens" to the hero stats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@
 
 <table align="center">
 <tr>
+<td align="center"><b>10/10 on 12K tokens</b><br>RLV crosses the cliff</td>
 <td align="center"><b>7/7 vs 0/7</b><br>Beyond RAG measured</td>
 <td align="center"><b>6.4x compression</b><br>+3% PPL</td>
 <td align="center"><b>128K context</b><br>on 16GB Mac</td>
@@ -160,6 +161,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
 
 > **v2 update — the Working Memory Cliff (2026-04-11):** We followed up the v1 result with 204 NIAH trials across 1B and 3B at context lengths 256–2048, plus a 6-trial FP32-weights control. Both models hit a sharp cliff at **less than 1% of their nominal 128K context window** (1B Q8 at 512–1024, 3B Q4 at 1024–1280 *as a step function*). The 6.4× KV compression is bit-for-bit identical to FP32 baseline in 18 of 20 cells, so the cliff is a model property — not a KV property and not a weight-quantization artifact. The honest reframing: Beyond RAG works for documents that fit in the model's *effective* working memory, which is 2–3 orders of magnitude smaller than the nominal context window. Full tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HF blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
 
+> **v3 update — Crossing the Cliff with RLV (2026-04-14):** If the cliff is real, the fix is to stop asking one LLM call to hold a full document in working memory. **RLV (Read-Locate-Verify)** is a 5-stage pipeline — gist → locate → lookup → verify → research — where each stage stays below the ~1K-token cliff while the *document* can be arbitrarily long. On 12K-token wikitext (≈10× the cliff for Llama 3.2 3B Q4), **RLV scores 10/10** vs. 8/10 for verify-only and 1/10 for long-context-only. Key trick: BM25 + Reciprocal Rank Fusion does the locating; the LLM is only a tiebreaker. Runs on the same 16GB Mac as the 3B model — no RAG index, no embeddings. [`bench/rlv/`](bench/rlv/) · [`docs/phase3_rlv_challenge.md`](docs/phase3_rlv_challenge.md)
+
 ---
 
 ## More Features
diff --git a/scripts/check_stale.sh b/scripts/check_stale.sh
@@ -0,0 +1,105 @@
+#!/usr/bin/env bash
+# check_stale.sh — warn when built binaries are older than the core library.
+#
+# Why: on 2026-04-15 we spent significant time chasing a "CLI vs server
+# behavior mismatch" that turned out to be a stale server binary linked
+# against an older libturboquant that pre-dated a Qwen3.5 DeltaNet fix.
+# Both tools loaded the same GGUF file and produced different results.
+#
+# Usage: bash scripts/check_stale.sh [build_dir]
+# Exit codes: 0 = all fresh, 1 = at least one binary is stale.
+
+set -u
+BUILD="${1:-build}"
+RED='\033[0;31m'
+YELLOW='\033[1;33m'
+GREEN='\033[0;32m'
+NC='\033[0m'
+
+if [ ! -d "$BUILD" ]; then
+    echo "build dir '$BUILD' not found"
+    exit 0
+fi
+
+mtime() { stat -f %m "$1" 2>/dev/null || stat -c %Y "$1" 2>/dev/null || echo 0; }
+
+# Compare binaries to the newest SOURCE file — not to build artifacts,
+# since CMake touches dylibs even when nothing changes.
+#   lib_src_mtime    — newest .c/.h in src/ and include/
+#   header_mtime     — quant.h mtime (covers single-header binaries)
+lib_src_mtime=0
+lib_src_path=""
+# Use find with -printf/-f and sort — portable across macOS/Linux.
+while IFS= read -r f; do
+    m=$(mtime "$f")
+    if [ "$m" -gt "$lib_src_mtime" ]; then
+        lib_src_mtime="$m"
+        lib_src_path="$f"
+    fi
+done < <(find src include -type f \( -name '*.c' -o -name '*.h' -o -name '*.m' -o -name '*.metal' \) 2>/dev/null)
+
+header_mtime=0
+if [ -f "quant.h" ]; then
+    header_mtime=$(mtime "quant.h")
+fi
+
+if [ "$lib_src_mtime" -eq 0 ] && [ "$header_mtime" -eq 0 ]; then
+    echo "no sources found — are you in the repo root?"
+    exit 0
+fi
+
+[ "$lib_src_mtime" -gt 0 ] && echo "split-source newest: $lib_src_path ($(date -r "$lib_src_mtime" 2>/dev/null))"
+[ "$header_mtime" -gt 0 ] && echo "single-header ref:   quant.h ($(date -r "$header_mtime" 2>/dev/null))"
+echo ""
+
+STALE=0
+CHECKED=0
+
+check_bin() {
+    local bin="$1"
+    local ref_mtime="$2"
+    local ref_label="$3"
+    if [ ! -f "$bin" ]; then return; fi
+    CHECKED=$((CHECKED + 1))
+    local m age hours
+    m=$(mtime "$bin")
+    age=$((ref_mtime - m))
+    local name
+    name=$(basename "$bin")
+    if [ "$age" -gt 0 ]; then
+        STALE=$((STALE + 1))
+        hours=$((age / 3600))
+        echo -e "  ${RED}✗${NC} $name is STALE vs $ref_label (older by ${hours}h)"
+    else
+        echo -e "  ${GREEN}✓${NC} $name is fresh vs $ref_label"
+    fi
+}
+
+# Split-source binaries — compared against newest src/ file.
+if [ "$lib_src_mtime" -gt 0 ]; then
+    for bin in \
+        "$BUILD/quant" \
+        "$BUILD/quant-server" \
+        "$BUILD/tq_convert" \
+        "$BUILD/standalone"; do
+        check_bin "$bin" "$lib_src_mtime" "split sources"
+    done
+fi
+
+# Single-header binaries — compile quant.h directly, compared against it.
+if [ "$header_mtime" -gt 0 ]; then
+    for bin in \
+        "$BUILD/quant-server-unified" \
+        "$BUILD/single_header_example"; do
+        check_bin "$bin" "$header_mtime" "quant.h"
+    done
+fi
+
+echo ""
+if [ "$STALE" -gt 0 ]; then
+    echo -e "${RED}$STALE/$CHECKED binaries are stale.${NC} Rebuild with:"
+    echo "  cmake --build $BUILD --target quant-server-unified single_header_example quant"
+    exit 1
+fi
+echo -e "${GREEN}All $CHECKED binaries are fresh.${NC}"
+exit 0
diff --git a/scripts/check_sync.sh b/scripts/check_sync.sh
@@ -140,6 +140,35 @@ echo "[7] GGUF dequant memory free"
 check_both_have "free(layer->attn_norm)" "free(layer->attn_norm)" \
     "$HEADER" "src/engine/tq_model.c"
 
+# --- 8. DeltaNet / Phi-3 fused-QKV disambiguation ---
+# Regression guard (f0091fc, 2026-04-15): when a layer has attn_qkv.weight,
+# the Phi-3 fused-QKV path must NOT trigger for Qwen3.5 DeltaNet layers.
+# Both files must probe for ssm_a (DeltaNet marker) and skip the fused path.
+echo ""
+echo "[8] DeltaNet vs Phi-3 fused-QKV disambiguation"
+check_guard() {
+    local label="$1"
+    local file="$2"
+    # Inspect the `if (wqkv_t ...)` conditional: it must also test a DeltaNet
+    # marker (ssm_probe, layer_is_deltanet, or !deltanet). A bare `if (wqkv_t)`
+    # means the Phi-3 fused-QKV path will mis-match Qwen3.5 DeltaNet layers.
+    local cond
+    cond=$(grep -E "if \(wqkv_t[^)]*\)" "$file" | head -1)
+    if [ -z "$cond" ]; then
+        echo -e "  ${YELLOW}—${NC} $label: no wqkv_t conditional found (OK if Phi-3 path absent)"
+        return
+    fi
+    if echo "$cond" | grep -qE "ssm_probe|layer_is_deltanet|is_deltanet|!deltanet"; then
+        echo -e "  ${GREEN}✓${NC} $label: DeltaNet guard in conditional"
+    else
+        echo -e "  ${RED}✗${NC} $label: bare 'if (wqkv_t)' — Phi-3 path will mis-match Qwen3.5 DeltaNet layers"
+        echo "      offending line: $cond"
+        ERRORS=$((ERRORS + 1))
+    fi
+}
+check_guard "quant.h" "$HEADER"
+check_guard "split-source tq_model.c" "src/engine/tq_model.c"
+
 # --- Summary ---
 echo ""
 echo "========================================="
diff --git a/scripts/test_long_seq.sh b/scripts/test_long_seq.sh
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+# test_long_seq.sh — autoregressive stress test.
+#
+# Why: PPL (teacher-forced) can be fine while T=0 generation collapses
+# after a few hundred tokens — a failure mode KV compression bugs
+# typically produce. This test generates 500 tokens at T=0 and rejects
+# runs where printable chars fall below 80% of the output (indicating
+# repetition-trap garbage, NaN-spew, or token-ID soup).
+#
+# Complements test_models.sh (which tests first 10 tokens coherence).
+
+set -u
+MODELS_DIR="${1:-models}"
+QUANT_BIN="${QUANT_BIN:-./build/quant}"
+N_TOKENS=500
+PASS=0
+FAIL=0
+SKIP=0
+TMP=$(mktemp)
+trap 'rm -f "$TMP"' EXIT
+
+if [[ ! -x "$QUANT_BIN" ]]; then
+    echo "ERROR: $QUANT_BIN not built." >&2
+    exit 1
+fi
+
+run_long() {
+    local model="$1"
+    local prompt="$2"
+    local chat_flag="${3:-}"
+    local extra_env="${4:-TQ_NO_METAL=1}"
+
+    if [[ ! -f "$MODELS_DIR/$model" ]]; then
+        printf "  %-50s [SKIP] not found\n" "$model"
+        SKIP=$((SKIP + 1))
+        return
+    fi
+
+    env $extra_env "$QUANT_BIN" "$MODELS_DIR/$model" $chat_flag \
+        -p "$prompt" -n "$N_TOKENS" -T 0 > "$TMP" 2>/dev/null
+
+    local total printable ratio
+    total=$(wc -c < "$TMP")
+    # Printable = ASCII printable + whitespace + valid UTF-8 multibyte
+    # Approximation: chars passing tr -cd '[:print:][:space:]' OR bytes >= 0x80.
+    printable=$(tr -cd '[:print:][:space:]\200-\377' < "$TMP" | wc -c)
+
+    if [ "$total" -lt 100 ]; then
+        printf "  %-50s [FAIL] too short (%d bytes)\n" "$model" "$total"
+        FAIL=$((FAIL + 1))
+        return
+    fi
+
+    # integer percentage
+    ratio=$(( printable * 100 / total ))
+    # Preview: first 60 chars, newlines squashed
+    preview=$(tr '\n' ' ' < "$TMP" | tr -s ' ' | cut -c1-60)
+
+    if [ "$ratio" -ge 80 ]; then
+        printf "  %-50s [PASS] %d%% printable, %d bytes | '%s...'\n" \
+            "$model" "$ratio" "$total" "$preview"
+        PASS=$((PASS + 1))
+    else
+        printf "  %-50s [FAIL] %d%% printable, %d bytes | '%s...'\n" \
+            "$model" "$ratio" "$total" "$preview"
+        FAIL=$((FAIL + 1))
+    fi
+}
+
+echo "=== quant.cpp Long-Sequence Stress Test (N=$N_TOKENS, T=0) ==="
+echo "Models dir: $MODELS_DIR"
+echo ""
+
+# Short story continuation prompts — must sustain coherent generation.
+run_long "Llama-3.2-1B-Instruct-Q8_0.gguf" \
+    "Once upon a time in a small village by the sea, there lived a young woman named Elena who"
+run_long "Llama-3.2-3B-Instruct-Q8_0.gguf" \
+    "Once upon a time in a small village by the sea, there lived a young woman named Elena who"
+run_long "Phi-3.5-mini-instruct-Q8_0.gguf" \
+    "Here is a short essay on the importance of clear writing:"
+run_long "Phi-3.5-mini-instruct-Q4_K_M.gguf" \
+    "Here is a short essay on the importance of clear writing:"
+run_long "Qwen3.5-4B-Q4_K_M.gguf" \
+    "Write a short story about a robot who learns to paint" "--chat"
+run_long "gemma-4-e2b-it-Q8_0.gguf" \
+    "Write a short paragraph about the solar system:"
+
+echo ""
+echo "--- Summary ---"
+echo "  PASS: $PASS"
+echo "  FAIL: $FAIL"
+echo "  SKIP: $SKIP"
+
+[ "$FAIL" -gt 0 ] && exit 1
+exit 0
diff --git a/scripts/test_models.sh b/scripts/test_models.sh
@@ -30,6 +30,7 @@ run_test() {
     local expected="$3"
     local tier="$4"
     local extra_env="${5:-}"
+    local extra_args="${6:-}"
 
     if [[ ! -f "$MODELS_DIR/$model" ]]; then
         printf "  %-50s [SKIP] not found\n" "$model"
@@ -40,7 +41,7 @@ run_test() {
     local out
     # Capture full output (replace newlines with space) — avoids missing
     # output when first line is empty (newline-prefixed generation).
-    out=$(env $extra_env "$QUANT_BIN" "$MODELS_DIR/$model" -p "$prompt" -n 10 -T 0 2>/dev/null | tr '\n' ' ' | sed 's/  */ /g')
+    out=$(env $extra_env "$QUANT_BIN" "$MODELS_DIR/$model" $extra_args -p "$prompt" -n 10 -T 0 2>/dev/null | tr '\n' ' ' | sed 's/  */ /g')
 
     case "$tier" in
         STRICT)
@@ -86,6 +87,10 @@ echo "--- COHERENT tier (must produce non-garbage text) ---"
 run_test "Llama-3.2-1B-Instruct-Q8_0.gguf"     "Hello" ""  COHERENT "TQ_NO_METAL=1"
 run_test "Llama-3.2-3B-Instruct-Q8_0.gguf"     "Hello" ""  COHERENT "TQ_NO_METAL=1"
 run_test "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf"   "Hello" ""  COHERENT "TQ_NO_METAL=1"
+# Regression guard: DeltaNet layers must NOT be treated as self_attn.
+# Fixed in f0091fc (2026-04-15); before that CLI produced whitespace garbage.
+# Uses --chat so the ChatML template wrapping is tested end-to-end.
+run_test "Qwen3.5-4B-Q4_K_M.gguf"              "Hi" "Hello" STRICT "TQ_NO_METAL=1" "--chat"
 
 echo ""
 echo "--- Summary ---"