Commit 34aba3b
committed
fix(cuda): remove float4 alignment requirement from gemv_q8_kernel
The gemv_q8_kernel cast the activation pointer x (float*) to float4*
for 16-byte vectorized loads into shared memory. When x is not 16-byte
aligned (common on ARM64/Grace Hopper with pool allocations), this
causes cudaMemcpy misaligned address errors.
Replace float4 global loads with per-element __ldg loads. Shared
memory float4 accesses are unaffected (shared memory is always
16-byte aligned). Performance impact: minimal -- the global-to-shared
load is a one-time cost per block, not in the inner loop.
Fixes: Gemma3 inference "misaligned address" on DGX Spark GB10.
Root cause confirmed via compute-sanitizer --tool memcheck.1 parent 0e9fb78 commit 34aba3b
1 file changed
Lines changed: 7 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
34 | 39 | | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
| 40 | + | |
43 | 41 | | |
44 | 42 | | |
45 | 43 | | |
| |||
0 commit comments