Skip to content

[Codegen] Fix shared-memory under-allocation for kernels #158

Open
yaoyaoding wants to merge 1 commit into
mainfrom
fix/smem-workspace-underalloc
Open

[Codegen] Fix shared-memory under-allocation for kernels #158
yaoyaoding wants to merge 1 commit into
mainfrom
fix/smem-workspace-underalloc

Conversation

@yaoyaoding

Copy link
Copy Markdown
Member

Kernels that request a shared workspace (tcgen05.alloc, reduce, scan, shuffle) could under-request dynamic shared memory, causing an illegal memory access at runtime (observed in examples/blackwell_matmul/matmul_v2.py).

Root cause was two-fold:

  1. smem_alloc_ctx.finalize() placed the workspace at a 128-aligned offset but computed dynamic_smem_bytes from the (possibly unaligned) high-water mark, so the alignment padding between the arena top and the workspace start was never reserved. The workspace tail then spilled past the requested dynamic shared memory. Now the total is sized from the aligned workspace offset.

  2. The barrier allocation in mbarrier_alloc_ctx.finalize() (added in [Example] Add More Hopper Matmul Examples #154) bypassed the shared-memory allocator, manually bumping maximum_allocated by the raw barrier byte count (unaligned) and leaving the region unregistered in the free list. This left the high-water mark unaligned, exposing bug (1). Reverted to allocating the barriers through allocate_shared_tensor(), which keeps the allocator accounting self-consistent. This is a byte-for-byte no-op for kernels without a shared workspace above the barriers (e.g. the Hopper matmul examples).

Verified matmul_v2 now runs correctly and tests/kernels/matmul/test_matmul_v2.py passes.

@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…d workspace

Kernels that request a shared workspace (tcgen05.alloc, reduce, scan,
shuffle) could under-request dynamic shared memory, causing an illegal
memory access at runtime (observed in examples/blackwell_matmul/matmul_v2.py).

Root cause was two-fold:

1. smem_alloc_ctx.finalize() placed the workspace at a 128-aligned offset
   but computed dynamic_smem_bytes from the (possibly unaligned) high-water
   mark, so the alignment padding between the arena top and the workspace
   start was never reserved. The workspace tail then spilled past the
   requested dynamic shared memory. Now the total is sized from the aligned
   workspace offset.

2. The barrier allocation in mbarrier_alloc_ctx.finalize() (added in #154)
   bypassed the shared-memory allocator, manually bumping maximum_allocated
   by the raw barrier byte count (unaligned) and leaving the region
   unregistered in the free list. This left the high-water mark unaligned,
   exposing bug (1). Reverted to allocating the barriers through
   allocate_shared_tensor(), which keeps the allocator accounting
   self-consistent. This is a byte-for-byte no-op for kernels without a
   shared workspace above the barriers (e.g. the Hopper matmul examples).

Verified matmul_v2 now runs correctly and tests/kernels/matmul/test_matmul_v2.py
passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
@yaoyaoding yaoyaoding force-pushed the fix/smem-workspace-underalloc branch from 0f20ba6 to 94b5fd2 Compare July 2, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant