Commit 8f5784a
fix(qwen35): drop unnecessary Q5_K → FP32 dequant of DeltaNet weights
The DeltaNet attn_qkv/attn_gate weights were dequanted to FP32 at load
time with the rationale that "Q5_K (5-bit) introduces too much error in
the recurrent state". This was over-cautious — the matmul result goes
through FP32 accumulation regardless of weight precision. Verified
identical generation output between TQ_DELTANET_FP32 (old) and the new
default on Qwen3.5-4B Q4_K_M (64-token T=0).
Trade-offs:
- Memory: ~3GB savings per token in bandwidth (24 layers × ~36M
Q5_K params × 5/8 byte vs 4 bytes for FP32)
- Quality: identical output verified
- Speed: marginal (Q5_K still uses generic dequant-row path; would
need Q5_K int8 fused dot for full benefit)
Set TQ_DELTANET_FP32=1 to restore the prior FP32 dequant behavior if
a downstream regression appears.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 5dd3f2d commit 8f5784a
1 file changed
Lines changed: 16 additions & 11 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3353 | 3353 | | |
3354 | 3354 | | |
3355 | 3355 | | |
3356 | | - | |
3357 | | - | |
3358 | | - | |
3359 | | - | |
| 3356 | + | |
| 3357 | + | |
| 3358 | + | |
| 3359 | + | |
| 3360 | + | |
| 3361 | + | |
3360 | 3362 | | |
3361 | 3363 | | |
3362 | 3364 | | |
3363 | | - | |
3364 | | - | |
3365 | | - | |
| 3365 | + | |
| 3366 | + | |
| 3367 | + | |
| 3368 | + | |
3366 | 3369 | | |
3367 | | - | |
| 3370 | + | |
3368 | 3371 | | |
3369 | 3372 | | |
3370 | 3373 | | |
| |||
3374 | 3377 | | |
3375 | 3378 | | |
3376 | 3379 | | |
3377 | | - | |
3378 | | - | |
| 3380 | + | |
| 3381 | + | |
| 3382 | + | |
| 3383 | + | |
3379 | 3384 | | |
3380 | | - | |
| 3385 | + | |
3381 | 3386 | | |
3382 | 3387 | | |
3383 | 3388 | | |
| |||
0 commit comments