Commit f50ffa7
committed
fix(compute): CPU dequant fallback for Q4_K when K%256!=0
The DequantQ4KF32 GPU kernel uses blocks_per_row=K/256 (integer
division), which truncates when K is not 256-aligned. For Gemma3-1B
(hidden_size=1152, 1152%256=128), this missed the last 128 values
per row, producing incorrect results.
When K%256!=0, fall back to CPU dequantize + H2D upload for the
general GEMM path. The GEMV path already gates on k%256==0.1 parent 5f21cbb commit f50ffa7
1 file changed
Lines changed: 22 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1497 | 1497 | | |
1498 | 1498 | | |
1499 | 1499 | | |
1500 | | - | |
1501 | | - | |
| 1500 | + | |
| 1501 | + | |
| 1502 | + | |
| 1503 | + | |
| 1504 | + | |
| 1505 | + | |
| 1506 | + | |
| 1507 | + | |
| 1508 | + | |
| 1509 | + | |
1502 | 1510 | | |
1503 | 1511 | | |
1504 | 1512 | | |
| |||
1617 | 1625 | | |
1618 | 1626 | | |
1619 | 1627 | | |
1620 | | - | |
1621 | | - | |
| 1628 | + | |
| 1629 | + | |
| 1630 | + | |
| 1631 | + | |
| 1632 | + | |
| 1633 | + | |
| 1634 | + | |
| 1635 | + | |
| 1636 | + | |
| 1637 | + | |
| 1638 | + | |
| 1639 | + | |
1622 | 1640 | | |
1623 | 1641 | | |
1624 | 1642 | | |
| |||
0 commit comments