[Codegen] Fix shared-memory under-allocation for kernels by yaoyaoding · Pull Request #158 · NVIDIA/tilus

yaoyaoding · 2026-07-02T21:11:52Z

Kernels that request a shared workspace (tcgen05.alloc, reduce, scan, shuffle) could under-request dynamic shared memory, causing an illegal memory access at runtime (observed in examples/blackwell_matmul/matmul_v2.py).

Root cause was two-fold:

smem_alloc_ctx.finalize() placed the workspace at a 128-aligned offset but computed dynamic_smem_bytes from the (possibly unaligned) high-water mark, so the alignment padding between the arena top and the workspace start was never reserved. The workspace tail then spilled past the requested dynamic shared memory. Now the total is sized from the aligned workspace offset.
The barrier allocation in mbarrier_alloc_ctx.finalize() (added in [Example] Add More Hopper Matmul Examples #154) bypassed the shared-memory allocator, manually bumping maximum_allocated by the raw barrier byte count (unaligned) and leaving the region unregistered in the free list. This left the high-water mark unaligned, exposing bug (1). Reverted to allocating the barriers through allocate_shared_tensor(), which keeps the allocator accounting self-consistent. This is a byte-for-byte no-op for kernels without a shared workspace above the barriers (e.g. the Hopper matmul examples).

Verified matmul_v2 now runs correctly and tests/kernels/matmul/test_matmul_v2.py passes.

copy-pr-bot · 2026-07-02T21:11:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…d workspace Kernels that request a shared workspace (tcgen05.alloc, reduce, scan, shuffle) could under-request dynamic shared memory, causing an illegal memory access at runtime (observed in examples/blackwell_matmul/matmul_v2.py). Root cause was two-fold: 1. smem_alloc_ctx.finalize() placed the workspace at a 128-aligned offset but computed dynamic_smem_bytes from the (possibly unaligned) high-water mark, so the alignment padding between the arena top and the workspace start was never reserved. The workspace tail then spilled past the requested dynamic shared memory. Now the total is sized from the aligned workspace offset. 2. The barrier allocation in mbarrier_alloc_ctx.finalize() (added in #154) bypassed the shared-memory allocator, manually bumping maximum_allocated by the raw barrier byte count (unaligned) and leaving the region unregistered in the free list. This left the high-water mark unaligned, exposing bug (1). Reverted to allocating the barriers through allocate_shared_tensor(), which keeps the allocator accounting self-consistent. This is a byte-for-byte no-op for kernels without a shared workspace above the barriers (e.g. the Hopper matmul examples). Verified matmul_v2 now runs correctly and tests/kernels/matmul/test_matmul_v2.py passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding force-pushed the fix/smem-workspace-underalloc branch from 0f20ba6 to 94b5fd2 Compare July 2, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Codegen] Fix shared-memory under-allocation for kernels #158

[Codegen] Fix shared-memory under-allocation for kernels #158
yaoyaoding wants to merge 1 commit into
mainfrom
fix/smem-workspace-underalloc

yaoyaoding commented Jul 2, 2026

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yaoyaoding commented Jul 2, 2026

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant