Skip to content

Commit a492db9

Browse files
unamedkrclaude
andcommitted
fix(gemma4): add V-norm + correct layer_output_scale to simple multiply
CRITICAL FINDINGS from llama.cpp gemma4-iswa.cpp source comparison: 1. V-norm missing (line 92 in llama.cpp): Vcur = ggml_rms_norm(Vcur, eps) — weight-free RMS normalization of V projection output. Added for Gemma 4 when QK-norm is present. 2. layer_output_scale — confirmed simple multiply is correct: llama.cpp line 228: cur = ggml_mul(cur, out_scale) Applied to ENTIRE layer output including residual. The model was trained with this scaling. 3. KV sharing (line 79, 105-109): has_kv(il) controls whether K/V are computed or reused from cache. Shared layers pass nullptr for Kcur/Vcur to build_attn(). 4. Gemma 4 chat template confirmed: <|turn>/<turn|>/<|think|> Still garbage after V-norm fix. Remaining candidates: - KV sharing (still disabled, need proper implementation) - Other subtle differences in attention computation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0c6aa0e commit a492db9

1 file changed

Lines changed: 22 additions & 10 deletions

File tree

quant.h

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14320,6 +14320,20 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
1432014320
}
1432114321
}
1432214322

14323+
/* Gemma 4: V also gets RMS norm (weight-free, just normalization).
14324+
* llama.cpp gemma4-iswa.cpp line 92: Vcur = ggml_rms_norm(Vcur, eps)
14325+
* This is applied when QK-norm is present (same condition). */
14326+
if (c->is_gemma4 && layer->k_norm) {
14327+
for (int h = 0; h < n_kv_heads; h++) {
14328+
/* Weight-free RMS norm: pass NULL weight → just normalize */
14329+
float* vh = s->v + h * head_dim;
14330+
float ss = 0.0f;
14331+
for (int d = 0; d < head_dim; d++) ss += vh[d] * vh[d];
14332+
ss = 1.0f / sqrtf(ss / head_dim + c->rms_norm_eps);
14333+
for (int d = 0; d < head_dim; d++) vh[d] *= ss;
14334+
}
14335+
}
14336+
1432314337
/* Apply RoPE (partial or full) */
1432414338
if (c->partial_rotary_factor > 0.0f && c->partial_rotary_factor < 1.0f) {
1432514339
/* Partial RoPE: only apply to first partial_rotary_factor * head_dim dims */
@@ -15676,15 +15690,13 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
1567615690
tq_add(s->x, s->x, ple_proj_out, dim);
1567715691
}
1567815692

15679-
/* Gemma 4: layer_output_scale scales layer CONTRIBUTION only.
15680-
* x_next = x_input + los * (x_current - x_input)
15681-
* This preserves the residual signal. With los=0.0178, only
15682-
* the layer's attn+ffn+PLE contribution is scaled down.
15683-
*
15684-
* CRITICAL: "x *= los" was WRONG — it destroys the residual
15685-
* (los=0.0178 multiplied onto the accumulated residual = catastrophic).
15686-
* The residual-separation formula is the correct implementation.
15687-
* TQ_NO_LOS=1 disables for debugging. */
15693+
/* Gemma 4: layer_output_scale — simple elementwise multiply.
15694+
* llama.cpp gemma4-iswa.cpp line 228: cur = ggml_mul(cur, out_scale)
15695+
* This multiplies the ENTIRE layer output (including residual).
15696+
* Despite out_scale being small (e.g., 0.0178 for layer 0), this is
15697+
* correct — the model was trained with this scaling, and each layer
15698+
* learns to compensate. The alternative "residual-separation" formula
15699+
* was WRONG (it prevents proper gradient flow as trained). */
1568815700
if (layer->layer_output_scale != 0.0f && !getenv("TQ_NO_LOS")) {
1568915701
float los = layer->layer_output_scale;
1569015702
if (pos == 0 && getenv("TQ_DEBUG") && l < 3) {
@@ -15696,7 +15708,7 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
1569615708
fprintf(stderr, "[DEBUG] layer%d pre_scale min=%.3f max=%.3f (los=%.4f)\n", l, minv, maxv, los);
1569715709
}
1569815710
for (int i = 0; i < dim; i++) {
15699-
s->x[i] = layer_residual_buf[i] + los * (s->x[i] - layer_residual_buf[i]);
15711+
s->x[i] *= los;
1570015712
}
1570115713
}
1570215714

0 commit comments

Comments
 (0)