Skip to content

Commit 9b9ce04

Browse files
unamedkrclaude
andauthored
Fix Qwen RMSNorm: revert runtime +1 for GGUF + switch demo to Qwen3.5 (#24)
PR #23 incorrectly added RMSNorm +1 for all Qwen-family GGUF models. Investigation reveals: - Qwen2/Qwen3: standard RMSNorm (weight * norm(x)), no +1 needed - Qwen3.5/Gemma: use (1+weight), but llama.cpp's GGUF converter already bakes +1 into the weights during conversion - Runtime +1 was double-applying for Qwen3.5 and incorrectly applying for Qwen2/3, causing activation explosion Fix: skip runtime +1 for all GGUF models. Only apply for non-GGUF (raw checkpoint) DeltaNet models. Also switch WASM demo default from Qwen3-0.6B Q4_K_M (broken due to double-quantization on a tiny model) to Qwen3.5-0.8B Q4_K_M (~508 MB) which produces coherent output at 25 tok/s. Verified: - Qwen3.5 0.8B Q8_0: coherent English output - Llama 3.2 1B Q8_0: coherent English output (unchanged) - Qwen3 0.6B Q4_K_M: real words now (was garbage Unicode), but quality limited by double-quantization on 0.6B model Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a44df86 commit 9b9ce04

5 files changed

Lines changed: 25 additions & 31 deletions

File tree

bindings/python/quantcpp/__init__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,10 @@
5353
"smollm2-135m-instruct-q8_0.gguf",
5454
135,
5555
),
56-
"Qwen3-0.6B": (
57-
"unsloth/Qwen3-0.6B-GGUF",
58-
"Qwen3-0.6B-Q4_K_M.gguf",
59-
378,
56+
"Qwen3.5-0.8B": (
57+
"unsloth/Qwen3.5-0.8B-GGUF",
58+
"Qwen3.5-0.8B-Q4_K_M.gguf",
59+
508,
6060
),
6161
"Llama-3.2-1B": (
6262
"hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",

quant.h

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9982,24 +9982,11 @@ static tq_model_t* tq_load_safetensors(const char* path) {
99829982

99839983
free(tensors);
99849984

9985-
/* Qwen RMSNorm adjustment: Qwen's RMSNorm computes
9986-
* output = norm(x) * (1.0 + weight), NOT norm(x) * weight.
9987-
* We bake the "+1" into the weight so tq_rmsnorm can stay as
9988-
* out = x * rsqrt * weight.
9989-
*
9990-
* This applies to: input_layernorm, post_attention_layernorm,
9991-
* model.norm, q_norm, k_norm.
9992-
* It does NOT apply to: linear_attn.norm (Qwen3_5RMSNormGated
9993-
* uses plain weight without +1).
9994-
*
9995-
* Applies to all Qwen-family models (qwen2, qwen3, qwen3_5, etc.)
9996-
* Detected by arch string or DeltaNet presence. */
9997-
int is_qwen_family = (model->config.delta_n_heads > 0);
9998-
if (model->gguf_ctx) {
9999-
const tq_gguf_ctx_t* gctx = (const tq_gguf_ctx_t*)model->gguf_ctx;
10000-
if (strstr(gctx->arch, "qwen") != NULL) is_qwen_family = 1;
10001-
}
10002-
if (is_qwen_family) {
9985+
/* Qwen3.5 (DeltaNet hybrid) RMSNorm adjustment.
9986+
* Only for non-GGUF models (raw checkpoints). GGUF files from
9987+
* llama.cpp already have +1 baked in by the converter.
9988+
* Qwen2/Qwen3 use standard RMSNorm and never need +1. */
9989+
if (model->config.delta_n_heads > 0 && !model->gguf_ctx) {
100039990
int dim_h = model->config.hidden_dim;
100049991
int head_dim_h = model->config.head_dim;
100059992

src/engine/tq_model.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4065,6 +4065,13 @@ skip_q4_conversion: ;
40654065

40664066
#undef GGUF_KEY
40674067

4068+
/* NOTE: No runtime RMSNorm +1 adjustment for GGUF models.
4069+
* - Qwen2/Qwen3: standard RMSNorm (weight * norm(x)), no +1 needed.
4070+
* - Qwen3.5/Gemma: use (1+weight) convention, but llama.cpp's GGUF
4071+
* converter already bakes +1 into the weights during conversion.
4072+
* Adding +1 at runtime would double-apply and cause activation explosion.
4073+
* The Gemma heuristic above (mean > 2.0 check) handles the Gemma case. */
4074+
40684075
/* Initialize persistent Metal GPU buffers for layer-level compute */
40694076
#ifdef TQ_HAS_METAL
40704077
{

wasm/index.html

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -121,11 +121,11 @@ <h2>LLM in Your Browser</h2>
121121
<p style="margin-bottom:16px; color:#6ee7b7; font-size:15px">No install. No API key. No server. Just click.</p>
122122

123123
<div class="model-cards" id="modelCards">
124-
<div class="model-card recommended" onclick="loadDemoModel('qwen3-0.6b')">
125-
<div class="name">Qwen3 0.6B</div>
126-
<div class="meta">~378 MB download &middot; Q4_K_M</div>
124+
<div class="model-card recommended" onclick="loadDemoModel('qwen3.5-0.8b')">
125+
<div class="name">Qwen3.5 0.8B</div>
126+
<div class="meta">~508 MB download &middot; Q4_K_M</div>
127127
<span class="tag">Recommended</span>
128-
<div class="meta" style="margin-top:4px">Fast, multilingual, good for demo</div>
128+
<div class="meta" style="margin-top:4px">Fast, multilingual, best quality/size</div>
129129
</div>
130130
<div class="model-card" onclick="loadDemoModel('llama-3.2-1b')">
131131
<div class="name">Llama 3.2 1B</div>
@@ -167,11 +167,11 @@ <h2>LLM in Your Browser</h2>
167167

168168
// ---- Model registry ----
169169
const MODELS = {
170-
'qwen3-0.6b': {
171-
url: 'https://huggingface.co/unsloth/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q4_K_M.gguf',
172-
name: 'Qwen3-0.6B Q4_K_M',
173-
size: '~378 MB',
174-
cacheKey: 'qwen3-0.6b-q4km',
170+
'qwen3.5-0.8b': {
171+
url: 'https://huggingface.co/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-Q4_K_M.gguf',
172+
name: 'Qwen3.5-0.8B Q4_K_M',
173+
size: '~508 MB',
174+
cacheKey: 'qwen3.5-0.8b-q4km',
175175
chatTemplate: (text) => `<|im_start|>user\n${text}<|im_end|>\n<|im_start|>assistant\n`,
176176
},
177177
'llama-3.2-1b': {

wasm/quant.wasm

-39 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)