GGEMM+srelu kernels for MxFP8 Nemotron by sraman-rgb · Pull Request #2981 · NVIDIA/TransformerEngine

sraman-rgb · 2026-05-12T19:11:28Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ksivaman · 2026-05-12T19:20:03Z

/te-ci pytorch

ksivaman · 2026-05-12T19:20:23Z

Please sign-off your commits @sraman-rgb

greptile-apps · 2026-05-12T19:23:56Z

Greptile Summary

This PR refactors the fused GroupedMLP kernel hierarchy into a shared base class and adds ScaledSReLU (squared-ReLU with per-row post-scaling) as a second supported activation alongside the existing GLU variants, wiring up new cuDNN FE grouped_gemm_srelu_wrapper_sm100 / grouped_gemm_dsrelu_wrapper_sm100 kernels.

New ScaledSReLU op (activation.py): standard BasicOperation with num_extra_inputs=1, implements both unfused and fused forward/backward paths.
Refactored fused forward/backward: common logic moved to abstract base classes; GLU and Unary concrete subclasses wire their respective cuDNN FE kernels.
Fusion plumbing (_common.py): fuse_grouped_mlp_ops parameterised with activation_op_types; validate_grouped_mlp_dims extended for unary activations; separate forward/backward fusion functions registered for each activation family.

Confidence Score: 5/5

The refactor is well-structured and the SReLU kernel wiring follows the established GLU pattern closely; the two flagged items are clarifying questions rather than confirmed failures.

The class hierarchy generalisation is clean, dscales_tensor is always an allocated tensor, the recompute-FC2-input path is guarded by multiple independent checks, and test coverage spans both unit-level ScaledSReLU and the full grouped-MLP integration.

forward_grouped_mlp.py (prob_tensor dtype) and _common.py (_nvidia_cudnn_frontend_supports_wgrad guard)

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/basic/activation.py	Adds ScaledSReLU with correct unfused fuser_forward/fuser_backward; dtype handling and grad accumulation look sound.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Base class refactor is clean; prob_tensor dtype (BF16/FP16 vs float32 fallback and backward) is an inconsistency worth confirming against the SReLU kernel spec.
transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py	dSReLU backward kernel wiring, recompute path, and grad_scales handling are logically correct; dscales_tensor is always allocated.
transformer_engine/pytorch/ops/_common.py	validate_grouped_mlp_dims and fuse_grouped_mlp_ops generalised cleanly; _nvidia_cudnn_frontend_supports_wgrad is a thin alias with no distinct version check.
transformer_engine/pytorch/ops/fused/init.py	Export list updated to expose the four new concrete fused-op classes; no issues.
transformer_engine/pytorch/ops/basic/init.py	Adds ScaledSReLU to the public API; straightforward.
tests/pytorch/test_fusible_ops.py	New test_scaled_srelu unit test and scaled_srelu parametrize for test_grouped_mlp look correct; reference implementation matches expected SReLU*scales semantics.

Sequence Diagram

sequenceDiagram
    participant Fuser
    participant GLUFwd as ForwardGroupedMLP_CuTeGEMMGLU_MXFP8
    participant SReLUFwd as ForwardGroupedMLP_CuTeGEMMUnary_MXFP8
    participant SReLUBwd as BackwardGroupedMLP_CuTeGEMMDUnary_MXFP8
    participant cuDNN as cuDNN FE Kernels

    Fuser->>GLUFwd: fuse_forward_ops GLU pattern
    GLUFwd->>cuDNN: grouped_gemm_glu_wrapper_sm100
    cuDNN-->>GLUFwd: fc2_in scales and activation_in

    Fuser->>SReLUFwd: fuse_forward_srelu_ops SReLU pattern
    SReLUFwd->>cuDNN: grouped_gemm_srelu_wrapper_sm100
    cuDNN-->>SReLUFwd: fc2_in scales and activation_in
    Note over SReLUFwd: Save activation_in and scales
    Note over SReLUFwd: optionally skip saving fc2_x

    Fuser->>SReLUBwd: fuse_backward_srelu_ops
    SReLUBwd->>cuDNN: grouped_gemm_dsrelu_wrapper_sm100
    cuDNN-->>SReLUBwd: FC1 dy tensors and grad_scales
    cuDNN-->>SReLUBwd: optional recomputed FC2 input
    SReLUBwd->>cuDNN: grouped_gemm_wgrad for FC1 and FC2

_{Reviews (8): Last reviewed commit: "Address grouped MLP ScaledSReLU review c..." | Re-trigger Greptile}

Signed-off-by: sraman-rgb <sraman@nvidia.com>

timmoon10

Overall looks good, but we've gotten to the point where we need to start thinking about how to gracefully handle adding new activations. It seems that every model has a different activation function.

greptile-apps · 2026-05-18T20:13:05Z

Want your agent to iterate on Greptile's feedback? Try greploops.

Signed-off-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

vthumbe1503

LGTM. We might want to wait on the cudnn release and apt cudnn guards are added.

vthumbe1503 · 2026-05-18T19:52:56Z

+        else:
+            try:
+                validate_grouped_mlp_dims(window[0], window[1], window[2])
+            except (TypeError, ValueError):
+                matches_pattern = False


We would want to disable srelu fusion based on cudnn version here eventually before the merge

vthumbe1503 · 2026-05-19T02:54:44Z

+                scales.detach().to(dtype=dtype).reshape(-1, 1, 1)
+                if scales is not None
+                else torch.ones((in_shape[0], 1, 1), dtype=torch.float32, device=device)


This might be a hold over from before right? And we do expect scales passed to be never None. So we can revert the change?

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

sraman-rgb force-pushed the fc1-srelu-main branch from 8373402 to 765d2e9 Compare May 12, 2026 20:33

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Outdated

vthumbe1503 reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Outdated

vthumbe1503 reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/basic/activation.py

vthumbe1503 reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

Add MXFP8 grouped MLP SReLU fusion

43093cc

Signed-off-by: sraman-rgb <sraman@nvidia.com>

sraman-rgb force-pushed the fc1-srelu-main branch from 765d2e9 to 43093cc Compare May 12, 2026 22:05

timmoon10 reviewed May 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Outdated

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

Comment thread tests/pytorch/test_fusible_ops.py Outdated

timmoon10 reviewed May 18, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

Siddhartha Raman S and others added 5 commits May 18, 2026 14:46

Address grouped MLP fused op review comments

e29544f

Signed-off-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>

Avoid quantizing ScaledSReLU backward in basic op

2a4d310

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

Wire ScaledSReLU recompute in grouped MLP

74a2395

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c83920

for more information, see https://pre-commit.ci

Address grouped MLP ScaledSReLU review comments

46b3169

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

sraman-rgb force-pushed the fc1-srelu-main branch from 912b1d9 to 46b3169 Compare May 18, 2026 21:47

vthumbe1503 reviewed May 19, 2026

View reviewed changes

Conversation

sraman-rgb commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman commented May 12, 2026

Uh oh!

ksivaman commented May 12, 2026

Uh oh!

greptile-apps Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sraman-rgb commented May 12, 2026 •

edited

Loading

greptile-apps Bot commented May 12, 2026 •

edited

Loading