Commit 7d10b63
fix: support head_dim > TQ_BK (Qwen3.5 head_dim=256)
Qwen3.5 uses head_dim=256 while TQ_BK=128. Two fixes:
1. KV quantize: loop over TQ_BK-sized blocks per head instead of
truncating to 128 elements. The remaining 128 dims were being
zeroed, causing catastrophic PPL (1.6M → garbage output).
2. Fused attention: guard with head_dim <= TQ_BK. For head_dim=256,
fall back to dequant → FP32 attention path. This is slower but
correct. Fused multi-block attention is a future optimization.
Before: Qwen3.5 + turbo_kv_4b → PPL 1,663,480 (garbage)
After: Qwen3.5 + turbo_kv_4b → coherent text at 12.2 tok/s
35/35 tests pass. Llama models (head_dim=128) unaffected.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent bc44a6a commit 7d10b63
1 file changed
Lines changed: 12 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1440 | 1440 | | |
1441 | 1441 | | |
1442 | 1442 | | |
1443 | | - | |
1444 | | - | |
| 1443 | + | |
| 1444 | + | |
| 1445 | + | |
| 1446 | + | |
| 1447 | + | |
| 1448 | + | |
| 1449 | + | |
| 1450 | + | |
| 1451 | + | |
| 1452 | + | |
1445 | 1453 | | |
1446 | 1454 | | |
1447 | 1455 | | |
| |||
1508 | 1516 | | |
1509 | 1517 | | |
1510 | 1518 | | |
1511 | | - | |
| 1519 | + | |
1512 | 1520 | | |
1513 | 1521 | | |
1514 | 1522 | | |
| |||
1659 | 1667 | | |
1660 | 1668 | | |
1661 | 1669 | | |
1662 | | - | |
| 1670 | + | |
1663 | 1671 | | |
1664 | 1672 | | |
1665 | 1673 | | |
| |||
0 commit comments