fix(gemma4): revert layer_output_scale to residual-separation formula

unamedkr · claude · unamedkr · commit a86a83758f37 · 2026-04-13T08:06:28.000+09:00
R13: "x *= los" destroys residual (los=0.0178 → embedding scaled to 0
after 35 layers). Reverted to the original formula:
  x = x_input + los * (x_current - x_input)
which preserves the residual and only scales the layer's contribution.

Added TQ_NO_LOS=1 env var for debugging without layer_output_scale.

Still produces garbage — A/B test confirms the issue is in the forward
pass itself (garbage with AND without layer_output_scale, just different
patterns). Waiting for llama.cpp reference output to confirm if the
GGUF file itself is valid.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/quant.h b/quant.h
@@ -15668,11 +15668,16 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
             tq_add(s->x, s->x, ple_proj_out, dim);
         }
 
-        /* Gemma 4: layer_output_scale — simple multiplication of entire output.
-         * llama.cpp reference (gemma4-iswa.cpp): cur = ggml_mul(cur, out_scale)
-         * Previous implementation incorrectly separated residual contribution.
-         * The correct approach is a straight elementwise multiply. */
-        if (layer->layer_output_scale != 0.0f) {
+        /* Gemma 4: layer_output_scale scales layer CONTRIBUTION only.
+         * x_next = x_input + los * (x_current - x_input)
+         * This preserves the residual signal. With los=0.0178, only
+         * the layer's attn+ffn+PLE contribution is scaled down.
+         *
+         * CRITICAL: "x *= los" was WRONG — it destroys the residual
+         * (los=0.0178 multiplied onto the accumulated residual = catastrophic).
+         * The residual-separation formula is the correct implementation.
+         * TQ_NO_LOS=1 disables for debugging. */
+        if (layer->layer_output_scale != 0.0f && !getenv("TQ_NO_LOS")) {
             float los = layer->layer_output_scale;
             if (pos == 0 && getenv("TQ_DEBUG") && l < 3) {
                 float maxv = 0, minv = 0;
@@ -15683,7 +15688,7 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
                 fprintf(stderr, "[DEBUG] layer%d pre_scale min=%.3f max=%.3f (los=%.4f)\n", l, minv, maxv, los);
             }
             for (int i = 0; i < dim; i++) {
-                s->x[i] *= los;
+                s->x[i] = layer_residual_buf[i] + los * (s->x[i] - layer_residual_buf[i]);
             }
         }