Skip to content

Commit 5f19e54

Browse files
committed
fix(cuda): byte-wise loads in Q5_0 GEMV for ARM64 alignment
The Q5_0 GEMV kernel had misaligned 4-byte reads at blk+2 for the qh field. Q5_0 blocks are 22 bytes (not a multiple of 4), so blocks after the first start at 22*n which is misaligned. On ARM64 Grace Hopper this caused cudaErrorMisalignedAddress (error 716) which stuck on the CUDA context and broke all subsequent operations. Use byte-wise __ldg loads for both the fp16 scale (blk[0:2]) and the uint32 qh field (blk[2:6]), matching the alignment-safe pattern used in the Q5_0 dequant kernel.
1 parent f6a8f2e commit 5f19e54

1 file changed

Lines changed: 11 additions & 5 deletions

File tree

internal/cuda/kernels/gemv_q5_0.cu

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -70,11 +70,17 @@ __global__ void gemv_q5_0_kernel(
7070
for (int bi = lane_id; bi < blocks_per_row; bi += Q5_0_WARP_SIZE) {
7171
const uint8_t* blk = row_data + bi * Q5_0_BLOCK_BYTES;
7272

73-
/* Read fp16 d. */
74-
float d = __half2float(__ldg((const __half*)(blk)));
75-
76-
/* Read qh (32 high bits). */
77-
uint32_t qh = __ldg((const uint32_t*)(blk + 2));
73+
/* Read fp16 d using byte-wise load (ARM64 alignment safety).
74+
* Q5_0 blocks are 22 bytes — not a multiple of 4, so blk may
75+
* be misaligned for uint16/uint32 casts after the first block. */
76+
uint16_t d_bits = (uint16_t)__ldg(&blk[0]) | ((uint16_t)__ldg(&blk[1]) << 8);
77+
float d = __half2float(*reinterpret_cast<const __half*>(&d_bits));
78+
79+
/* Read qh (32 high bits) using byte-wise load. */
80+
uint32_t qh = (uint32_t)__ldg(&blk[2])
81+
| ((uint32_t)__ldg(&blk[3]) << 8)
82+
| ((uint32_t)__ldg(&blk[4]) << 16)
83+
| ((uint32_t)__ldg(&blk[5]) << 24);
7884

7985
const uint8_t* qs = blk + 6;
8086
int k_base = bi * Q5_0_BLOCK_SIZE;

0 commit comments

Comments
 (0)