Commit 5f19e54
committed
fix(cuda): byte-wise loads in Q5_0 GEMV for ARM64 alignment
The Q5_0 GEMV kernel had misaligned 4-byte reads at blk+2 for the qh
field. Q5_0 blocks are 22 bytes (not a multiple of 4), so blocks
after the first start at 22*n which is misaligned. On ARM64 Grace
Hopper this caused cudaErrorMisalignedAddress (error 716) which
stuck on the CUDA context and broke all subsequent operations.
Use byte-wise __ldg loads for both the fp16 scale (blk[0:2]) and
the uint32 qh field (blk[2:6]), matching the alignment-safe pattern
used in the Q5_0 dequant kernel.1 parent f6a8f2e commit 5f19e54
1 file changed
Lines changed: 11 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
78 | 84 | | |
79 | 85 | | |
80 | 86 | | |
| |||
0 commit comments