Skip to content

Commit 8f5784a

Browse files
unamedkrclaude
andcommitted
fix(qwen35): drop unnecessary Q5_K → FP32 dequant of DeltaNet weights
The DeltaNet attn_qkv/attn_gate weights were dequanted to FP32 at load time with the rationale that "Q5_K (5-bit) introduces too much error in the recurrent state". This was over-cautious — the matmul result goes through FP32 accumulation regardless of weight precision. Verified identical generation output between TQ_DELTANET_FP32 (old) and the new default on Qwen3.5-4B Q4_K_M (64-token T=0). Trade-offs: - Memory: ~3GB savings per token in bandwidth (24 layers × ~36M Q5_K params × 5/8 byte vs 4 bytes for FP32) - Quality: identical output verified - Speed: marginal (Q5_K still uses generic dequant-row path; would need Q5_K int8 fused dot for full benefit) Set TQ_DELTANET_FP32=1 to restore the prior FP32 dequant behavior if a downstream regression appears. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5dd3f2d commit 8f5784a

1 file changed

Lines changed: 16 additions & 11 deletions

File tree

src/engine/tq_model.c

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3353,18 +3353,21 @@ tq_model_t* tq_load_gguf(const char* path) {
33533353
t = find_gguf_tensor(gguf, tname);
33543354
if (t) { layer->gguf_delta_b = t->data; layer->gguf_delta_b_type = t->type; }
33553355

3356-
/* Large DeltaNet projections: dequant to FP32 for recurrent
3357-
* state precision. Q5_K (5-bit) introduces too much error in
3358-
* the recurrent state that accumulates across time steps.
3359-
* ~24 MB/layer × 30 layers ≈ 720 MB — fits in 16 GB. */
3356+
/* DeltaNet projections: historically dequanted Q5_K → FP32 with
3357+
* the rationale that "5-bit introduces too much error in recurrent
3358+
* state". This was over-cautious — the matmul output goes through
3359+
* FP32 accumulation regardless of weight type. Set TQ_DELTANET_FP32=1
3360+
* to restore the FP32 dequant if a downstream regression appears. */
3361+
int deltanet_fp32 = (getenv("TQ_DELTANET_FP32") != NULL);
33603362
snprintf(tname, sizeof(tname), "blk.%d.attn_qkv.weight", l);
33613363
t = find_gguf_tensor(gguf, tname);
33623364
if (t) {
3363-
if (t->type == TQ_GGML_TYPE_Q5_K || t->type == TQ_GGML_TYPE_IQ2_XXS ||
3364-
t->type == TQ_GGML_TYPE_IQ3_XXS || t->type == TQ_GGML_TYPE_IQ4_XS) {
3365-
/* Low-precision: dequant to FP32 for recurrent accuracy */
3365+
int needs_fp32 = deltanet_fp32 &&
3366+
(t->type == TQ_GGML_TYPE_Q5_K || t->type == TQ_GGML_TYPE_IQ2_XXS ||
3367+
t->type == TQ_GGML_TYPE_IQ3_XXS || t->type == TQ_GGML_TYPE_IQ4_XS);
3368+
if (needs_fp32) {
33663369
layer->delta_in_proj_qkv = dequant_tensor_fp32(t);
3367-
fprintf(stderr, "tq_load_gguf: layer %d attn_qkv dequant to FP32 (was type %d)\n", l, t->type);
3370+
fprintf(stderr, "tq_load_gguf: layer %d attn_qkv dequant to FP32 (was type %d, TQ_DELTANET_FP32 set)\n", l, t->type);
33683371
} else {
33693372
layer->gguf_delta_qkv = t->data;
33703373
layer->gguf_delta_qkv_type = t->type;
@@ -3374,10 +3377,12 @@ tq_model_t* tq_load_gguf(const char* path) {
33743377
snprintf(tname, sizeof(tname), "blk.%d.attn_gate.weight", l);
33753378
t = find_gguf_tensor(gguf, tname);
33763379
if (t) {
3377-
if (t->type == TQ_GGML_TYPE_Q5_K || t->type == TQ_GGML_TYPE_IQ2_XXS ||
3378-
t->type == TQ_GGML_TYPE_IQ3_XXS || t->type == TQ_GGML_TYPE_IQ4_XS) {
3380+
int needs_fp32 = deltanet_fp32 &&
3381+
(t->type == TQ_GGML_TYPE_Q5_K || t->type == TQ_GGML_TYPE_IQ2_XXS ||
3382+
t->type == TQ_GGML_TYPE_IQ3_XXS || t->type == TQ_GGML_TYPE_IQ4_XS);
3383+
if (needs_fp32) {
33793384
layer->delta_in_proj_z = dequant_tensor_fp32(t);
3380-
fprintf(stderr, "tq_load_gguf: layer %d attn_gate dequant to FP32 (was type %d)\n", l, t->type);
3385+
fprintf(stderr, "tq_load_gguf: layer %d attn_gate dequant to FP32 (was type %d, TQ_DELTANET_FP32 set)\n", l, t->type);
33813386
} else {
33823387
layer->gguf_delta_z = t->data;
33833388
layer->gguf_delta_z_type = t->type;

0 commit comments

Comments
 (0)