[Codegen] Fix shared-memory under-allocation for kernels #158
Open
yaoyaoding wants to merge 1 commit into
Open
[Codegen] Fix shared-memory under-allocation for kernels #158yaoyaoding wants to merge 1 commit into
yaoyaoding wants to merge 1 commit into
Conversation
…d workspace Kernels that request a shared workspace (tcgen05.alloc, reduce, scan, shuffle) could under-request dynamic shared memory, causing an illegal memory access at runtime (observed in examples/blackwell_matmul/matmul_v2.py). Root cause was two-fold: 1. smem_alloc_ctx.finalize() placed the workspace at a 128-aligned offset but computed dynamic_smem_bytes from the (possibly unaligned) high-water mark, so the alignment padding between the arena top and the workspace start was never reserved. The workspace tail then spilled past the requested dynamic shared memory. Now the total is sized from the aligned workspace offset. 2. The barrier allocation in mbarrier_alloc_ctx.finalize() (added in #154) bypassed the shared-memory allocator, manually bumping maximum_allocated by the raw barrier byte count (unaligned) and leaving the region unregistered in the free list. This left the high-water mark unaligned, exposing bug (1). Reverted to allocating the barriers through allocate_shared_tensor(), which keeps the allocator accounting self-consistent. This is a byte-for-byte no-op for kernels without a shared workspace above the barriers (e.g. the Hopper matmul examples). Verified matmul_v2 now runs correctly and tests/kernels/matmul/test_matmul_v2.py passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
0f20ba6 to
94b5fd2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Kernels that request a shared workspace (tcgen05.alloc, reduce, scan, shuffle) could under-request dynamic shared memory, causing an illegal memory access at runtime (observed in examples/blackwell_matmul/matmul_v2.py).
Root cause was two-fold:
smem_alloc_ctx.finalize() placed the workspace at a 128-aligned offset but computed dynamic_smem_bytes from the (possibly unaligned) high-water mark, so the alignment padding between the arena top and the workspace start was never reserved. The workspace tail then spilled past the requested dynamic shared memory. Now the total is sized from the aligned workspace offset.
The barrier allocation in mbarrier_alloc_ctx.finalize() (added in [Example] Add More Hopper Matmul Examples #154) bypassed the shared-memory allocator, manually bumping maximum_allocated by the raw barrier byte count (unaligned) and leaving the region unregistered in the free list. This left the high-water mark unaligned, exposing bug (1). Reverted to allocating the barriers through allocate_shared_tensor(), which keeps the allocator accounting self-consistent. This is a byte-for-byte no-op for kernels without a shared workspace above the barriers (e.g. the Hopper matmul examples).
Verified matmul_v2 now runs correctly and tests/kernels/matmul/test_matmul_v2.py passes.