Skip to content

Commit 6f25f34

Browse files
unamedkrclaude
andauthored
feat(wasm): Llama 3.2 1B Instruct + skip Q4 reconversion (#36)
* fix: n-gram loop detection + generation regression test Root cause of ~530-token text collapse: small model + T=0 greedy enters repetition loop → KV quant error compounds through softmax → collapse. NOT a code bug — FP32 also degenerates, just slower. Fix: - Add 4-gram loop detection (stop when same 4-gram repeats 3+ times) - Increase rep_window 32→64, recent_tokens buffer 64→128 - Add bench/generation_regression_test.sh (4 tests): 1. T=0 500-token coherence check (no garbage output) 2. Loop detection fires on repetitive generation 3. No false positives at T=0.7 4. PPL within 15% of FP32 Why this wasn't caught before: - PPL tests are teacher-forced (no error accumulation) - Generation tests were ≤100 tokens (collapse at ~530) - No T=0 stress test existed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(wasm): Llama 3.2 1B Instruct default + skip Q4 reconversion Two changes for WASM demo reliability and speed: 1. Model: switch from Qwen3.5-0.8B (base, gated, Qwen arch issues) to Llama 3.2 1B Instruct (verified working, good quality, public HuggingFace URL, proper Instruct tuning for chat). 2. Speed: add -DTQ_NO_Q4=1 to WASM build. Skips the load-time Q4 reconversion (GGUF Q4_K_M → FP32 → internal Q4) which was expensive and redundant for already-quantized models. Uses GGUF on-the-fly dequant instead. Saves several seconds of model init and reduces peak memory usage. Added compile-time #ifdef TQ_NO_Q4 guard in quant.h so it works in WASM (no getenv). Native builds are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 09adb11 commit 6f25f34

6 files changed

Lines changed: 190 additions & 25 deletions

File tree

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
#!/bin/bash
2+
# quant.cpp — Generation Regression Test
3+
#
4+
# Detects autoregressive generation collapse that PPL tests miss.
5+
# Tests: T=0 greedy 500-token generation → verify no garbage output.
6+
#
7+
# The key insight: PPL (teacher-forced) is near-identical for FP32 and
8+
# turbo_kv_4b at all context lengths. But autoregressive generation
9+
# can collapse at ~500 tokens when T=0 repetition compounds KV quant error.
10+
#
11+
# This test catches that class of bugs by checking:
12+
# 1. Loop detection triggers (prevents garbage, so verify it fires)
13+
# 2. Output before loop detection is coherent (no random Unicode)
14+
# 3. PPL sanity check at multiple context lengths
15+
#
16+
# Usage:
17+
# bash bench/generation_regression_test.sh [model.gguf]
18+
#
19+
# Requires: built quant binary in build/
20+
21+
set -e
22+
23+
MODEL="${1:-models/Llama-3.2-1B-Instruct-Q8_0.gguf}"
24+
TQ_RUN="./build/quant"
25+
THREADS=4
26+
PASS=0
27+
FAIL=0
28+
29+
if [ ! -f "$TQ_RUN" ]; then
30+
echo "Error: $TQ_RUN not found. Build first."
31+
exit 1
32+
fi
33+
if [ ! -f "$MODEL" ]; then
34+
echo "SKIP: Model not found: $MODEL"
35+
exit 0
36+
fi
37+
38+
echo "============================================"
39+
echo " Generation Regression Test"
40+
echo " Model: $MODEL"
41+
echo "============================================"
42+
echo ""
43+
44+
check() {
45+
local desc="$1" result="$2"
46+
if [ "$result" = "PASS" ]; then
47+
echo " [PASS] $desc"
48+
PASS=$((PASS + 1))
49+
else
50+
echo " [FAIL] $desc"
51+
FAIL=$((FAIL + 1))
52+
fi
53+
}
54+
55+
# Test 1: T=0 generation should NOT produce garbage at 500 tokens
56+
echo "[Test 1] T=0 500-token generation — no garbage output"
57+
OUTPUT=$($TQ_RUN "$MODEL" -p "Explain the theory of relativity in detail" \
58+
-n 500 -T 0.0 -j $THREADS -k turbo_kv_4b --chat 2>/dev/null)
59+
60+
# Check for garbage patterns: random Unicode, excessive non-ASCII
61+
# Garbage typically has lots of CJK/Arabic/Thai mixed with Latin
62+
GARBAGE_CHARS=$(echo "$OUTPUT" | tr -cd '\200-\377' | wc -c | tr -d ' ')
63+
TOTAL_CHARS=$(echo "$OUTPUT" | wc -c | tr -d ' ')
64+
if [ "$TOTAL_CHARS" -gt 0 ]; then
65+
GARBAGE_RATIO=$((GARBAGE_CHARS * 100 / TOTAL_CHARS))
66+
else
67+
GARBAGE_RATIO=100
68+
fi
69+
if [ "$GARBAGE_RATIO" -lt 30 ]; then
70+
check "turbo_kv_4b output coherence (${GARBAGE_RATIO}% non-ASCII)" "PASS"
71+
else
72+
check "turbo_kv_4b output coherence (${GARBAGE_RATIO}% non-ASCII, threshold 30%)" "FAIL"
73+
fi
74+
75+
# Test 2: Loop detection should fire for T=0 repetitive prompt
76+
echo ""
77+
echo "[Test 2] Loop detection fires on repetitive T=0 generation"
78+
LOOP_OUTPUT=$($TQ_RUN "$MODEL" -p "what is your name?" \
79+
-n 1000 -T 0.0 -j $THREADS -k turbo_kv_4b 2>&1)
80+
81+
if echo "$LOOP_OUTPUT" | grep -q "repetition loop detected"; then
82+
LOOP_TOKENS=$(echo "$LOOP_OUTPUT" | grep "repetition loop" | grep -o "after [0-9]* tokens" | grep -o "[0-9]*")
83+
check "loop detected at ${LOOP_TOKENS} tokens (before 500)" "PASS"
84+
else
85+
TOTAL_TOK=$(echo "$LOOP_OUTPUT" | grep "tok/s" | grep -o "^[0-9]*")
86+
if [ "${TOTAL_TOK:-1000}" -lt 500 ]; then
87+
check "EOS hit at ${TOTAL_TOK} tokens (no loop needed)" "PASS"
88+
else
89+
check "no loop detection in 1000 tokens" "FAIL"
90+
fi
91+
fi
92+
93+
# Test 3: Non-repetitive generation should NOT trigger loop detection
94+
echo ""
95+
echo "[Test 3] Non-repetitive generation (T=0.7) — no false positives"
96+
NORMAL_OUTPUT=$($TQ_RUN "$MODEL" -p "Tell me a creative story" \
97+
-n 200 -T 0.7 -j $THREADS -k turbo_kv_4b --chat 2>&1)
98+
99+
if echo "$NORMAL_OUTPUT" | grep -q "repetition loop detected"; then
100+
check "no false loop detection at T=0.7" "FAIL"
101+
else
102+
check "no false loop detection at T=0.7" "PASS"
103+
fi
104+
105+
# Test 4: FP32 vs turbo_kv_4b PPL sanity (if ppl data exists)
106+
PPL_FILE="bench/data/ppl_test_1k.txt"
107+
if [ -f "$PPL_FILE" ]; then
108+
echo ""
109+
echo "[Test 4] PPL sanity: turbo_kv_4b within 15% of FP32"
110+
FP32_PPL=$($TQ_RUN "$MODEL" --ppl "$PPL_FILE" -k fp32 -j $THREADS 2>&1 \
111+
| grep "PPL_CSV" | cut -d, -f3)
112+
Q4_PPL=$($TQ_RUN "$MODEL" --ppl "$PPL_FILE" -k turbo_kv_4b -j $THREADS 2>&1 \
113+
| grep "PPL_CSV" | cut -d, -f3)
114+
115+
if [ -n "$FP32_PPL" ] && [ -n "$Q4_PPL" ]; then
116+
# Compare using integer math (multiply by 1000)
117+
FP32_INT=$(echo "$FP32_PPL" | awk '{printf "%d", $1 * 1000}')
118+
Q4_INT=$(echo "$Q4_PPL" | awk '{printf "%d", $1 * 1000}')
119+
THRESHOLD=$((FP32_INT * 115 / 100)) # 15% margin
120+
if [ "$Q4_INT" -le "$THRESHOLD" ]; then
121+
DELTA=$(echo "$FP32_PPL $Q4_PPL" | awk '{printf "%.1f", ($2/$1 - 1)*100}')
122+
check "PPL delta: ${DELTA}% (within 15%)" "PASS"
123+
else
124+
DELTA=$(echo "$FP32_PPL $Q4_PPL" | awk '{printf "%.1f", ($2/$1 - 1)*100}')
125+
check "PPL delta: ${DELTA}% (exceeds 15%)" "FAIL"
126+
fi
127+
else
128+
check "PPL comparison (could not parse results)" "FAIL"
129+
fi
130+
fi
131+
132+
echo ""
133+
echo "============================================"
134+
echo " Results: ${PASS} passed, ${FAIL} failed"
135+
echo "============================================"
136+
137+
if [ "$FAIL" -gt 0 ]; then
138+
exit 1
139+
fi

quant.h

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12179,8 +12179,13 @@ tq_model_t* tq_load_gguf(const char* path) {
1217912179
}
1218012180

1218112181
const size_t MAX_FP32_BYTES = (size_t)16 * 1024 * 1024 * 1024ULL; /* 16 GB */
12182-
/* TQ_NO_Q4=1 disables Q4 recompression → use direct GGUF dequant for better quality */
12182+
/* TQ_NO_Q4=1 disables Q4 recompression → use direct GGUF dequant for better quality.
12183+
* Can be set via environment variable or compile-time define (useful for WASM). */
12184+
#ifdef TQ_NO_Q4
12185+
if (1) {
12186+
#else
1218312187
if (getenv("TQ_NO_Q4")) {
12188+
#endif
1218412189
fprintf(stderr, "tq_load_gguf: TQ_NO_Q4 set — skipping Q4 conversion, using GGUF on-the-fly dequant\n");
1218512190
goto skip_q4_conversion;
1218612191
}

src/engine/tq_generate.c

Lines changed: 40 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -254,22 +254,31 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
254254
int vocab_size = model->config.vocab_size;
255255
float rep_penalty = config->rep_penalty;
256256
int rep_window = config->rep_window;
257-
if (rep_window > 64) rep_window = 64;
258-
int recent_tokens[64];
257+
if (rep_window > 128) rep_window = 128;
258+
int recent_tokens[128];
259259
int recent_count = 0;
260260

261+
/* N-gram loop detection: track recent 4-grams to detect infinite loops.
262+
* Small models with T=0 greedy decoding enter repetition loops where
263+
* the same ~30-token pattern repeats endlessly. KV quantization error
264+
* compounds through these repetitions, eventually collapsing output
265+
* into garbage. Detecting loops early prevents wasted compute. */
266+
uint32_t ngram_hashes[64];
267+
int ngram_hash_count = 0;
268+
int loop_detected = 0;
269+
261270
/* Seed recent tokens with tail of prompt for better penalty coverage */
262271
for (int i = (n_prompt > rep_window ? n_prompt - rep_window : 0); i < n_prompt; i++) {
263-
recent_tokens[recent_count % 64] = prompt_tokens[i];
272+
recent_tokens[recent_count % 128] = prompt_tokens[i];
264273
recent_count++;
265274
}
266275

267276
/* Apply repetition penalty to logits before first sample */
268277
if (rep_penalty > 1.0f) {
269278
int window = recent_count < rep_window ? recent_count : rep_window;
270279
for (int r = 0; r < window; r++) {
271-
int idx = (recent_count - 1 - r) % 64;
272-
if (idx < 0) idx += 64;
280+
int idx = (recent_count - 1 - r) % 128;
281+
if (idx < 0) idx += 128;
273282
int tok = recent_tokens[idx];
274283
if (tok >= 0 && tok < vocab_size) {
275284
if (state->logits[tok] > 0)
@@ -288,7 +297,7 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
288297
&rng_state);
289298

290299
/* Record first sampled token */
291-
recent_tokens[recent_count % 64] = next_token;
300+
recent_tokens[recent_count % 128] = next_token;
292301
recent_count++;
293302

294303
int generated = 0;
@@ -483,8 +492,32 @@ int tq_generate(tq_model_t* model, tq_tokenizer_t* tokenizer,
483492
&rng_state);
484493

485494
/* Record sampled token for repetition penalty */
486-
recent_tokens[recent_count % 64] = next_token;
495+
recent_tokens[recent_count % 128] = next_token;
487496
recent_count++;
497+
498+
/* N-gram loop detection: hash recent 4-gram and check for repeats */
499+
if (recent_count >= 4) {
500+
uint32_t h = 0;
501+
for (int r = 0; r < 4; r++) {
502+
int gi = (recent_count - 4 + r) % 128;
503+
h = h * 31 + (uint32_t)recent_tokens[gi];
504+
}
505+
int matches = 0;
506+
int ring_len = ngram_hash_count < 64 ? ngram_hash_count : 64;
507+
for (int r = 0; r < ring_len; r++) {
508+
if (ngram_hashes[r] == h) matches++;
509+
}
510+
ngram_hashes[ngram_hash_count % 64] = h;
511+
ngram_hash_count++;
512+
if (matches >= 3) {
513+
loop_detected = 1;
514+
break;
515+
}
516+
}
517+
}
518+
519+
if (loop_detected) {
520+
fprintf(stderr, "[generate] repetition loop detected after %d tokens, stopping\n", generated);
488521
}
489522

490523
/* Null-terminate output */

wasm/build.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ emcc "$SCRIPT_DIR/quant_wasm.c" \
4040
-lm \
4141
-DNDEBUG \
4242
-D__EMSCRIPTEN__ \
43+
-DTQ_NO_Q4=1 \
4344
-Wno-gnu-zero-variadic-macro-arguments \
4445
-Wno-dollar-in-identifier-extension
4546

wasm/index.html

Lines changed: 4 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -174,16 +174,11 @@ <h2>Run an <span>LLM</span> in your browser</h2>
174174
<p class="subtitle">No install. No API key. No server.</p>
175175

176176
<div class="model-cards" id="modelCards">
177-
<div class="model-card recommended" id="card-qwen" onclick="loadDemoModel('qwen3.5-0.8b')">
178-
<div class="name">Qwen3.5 0.8B</div>
179-
<div class="meta" id="meta-qwen">~508 MB &middot; Q4_K_M</div>
177+
<div class="model-card recommended" id="card-llama" onclick="loadDemoModel('llama-3.2-1b')">
178+
<div class="name">Llama 3.2 1B Instruct</div>
179+
<div class="meta" id="meta-llama">~770 MB &middot; Verified quality</div>
180180
<span class="tag">Recommended</span>
181181
</div>
182-
<div class="model-card" id="card-llama" onclick="loadDemoModel('llama-3.2-1b')">
183-
<div class="name">Llama 3.2 1B</div>
184-
<div class="meta" id="meta-llama">~770 MB &middot; Q4_K_M</div>
185-
<span class="tag blue">Higher quality</span>
186-
</div>
187182
</div>
188183

189184
<div class="progress-wrap" id="progressWrap">
@@ -223,17 +218,9 @@ <h2>Run an <span>LLM</span> in your browser</h2>
223218
let activeModelId = null;
224219

225220
const MODELS = {
226-
'qwen3.5-0.8b': {
227-
url: 'https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-Q4_K_M.gguf',
228-
name: 'Qwen3.5 0.8B',
229-
size: 508,
230-
cacheKey: 'qwen3.5-0.8b-q4km',
231-
chatTemplate: (t) => `<|im_start|>user\n${t}<|im_end|>\n<|im_start|>assistant\n`,
232-
cardId: 'card-qwen', metaId: 'meta-qwen',
233-
},
234221
'llama-3.2-1b': {
235222
url: 'https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF/resolve/main/llama-3.2-1b-instruct-q4_k_m.gguf',
236-
name: 'Llama 3.2 1B',
223+
name: 'Llama 3.2 1B Instruct',
237224
size: 770,
238225
cacheKey: 'llama-3.2-1b-q4km',
239226
chatTemplate: (t) => `<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n${t}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n`,

wasm/quant.wasm

-31 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)