[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion) by Fangzhou-Ai · Pull Request #1969 · SemiAnalysisAI/InferenceX

Fangzhou-Ai · 2026-07-01T08:46:56Z

Summary

Adds the four levers that bring minimaxm3-fp4-mi355x-vllm (single-node STP) to high-concurrency parity with the ATOM recipe:

INT4 quantized all-reduce — env knobs VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 + ..._CAST_BF16_TO_FP16=0 + ..._QUANTIZATION_MIN_SIZE_KB=256. The decode all-reduce is the single biggest decode kernel; INT4 makes it ~4× cheaper. The MIN_SIZE_KB override is required — vLLM's default INT4 gate for (bf16, TP4) is 16 MB, so it never fires for the ~1.5 MB decode all-reduces. Works on any nightly.
fp8 KV cache (--kv-cache-dtype fp8). Works on any nightly.
Cross-layer indexer top-k sharing (--hf-overrides index_topk_freq=4) — requires [ROCm][MiniMax-M3] Cross-layer lightning-indexer top-k sharing vllm-project/vllm#47269.
Input-side all-reduce + RMSNorm fusion — requires [ROCm][MiniMax-M3] Fuse input-side all-reduce + RMSNorm in the AMD decoder vllm-project/vllm#47270 (automatic once merged; no flag).

⚠️ Merge/sweep dependency

Levers 3 and 4 need the two upstream vLLM PRs in the served image:

[ROCm][MiniMax-M3] Cross-layer lightning-indexer top-k sharing vllm-project/vllm#47269 — MiniMax-M3 cross-layer lightning-indexer top-k sharing
[ROCm][MiniMax-M3] Fuse input-side all-reduce + RMSNorm in the AMD decoder vllm-project/vllm#47270 — MiniMax-M3 input-side all-reduce + RMSNorm fusion

Only after both merge should the image pin (.github/configs/amd-master.yaml → minimaxm3-fp4-mi355x-vllm.image) be bumped to a nightly that contains them and the perf swept. Until then, index_topk_freq is a harmless no-op and the AR fusion is simply absent (INT4 all-reduce + fp8 KV still apply).

Measured (local, amd/MiniMax-M3-MXFP4, MI355X gfx950, TP4, 8k1k)

Using ATOM's own benchmark_serving against both, vs ATOM's published 4×MI355 reference:

conc	vLLM (this config)	ATOM ref	gap
32	17.21 ms	16.74 ms	+2.8%
64	25.13 ms	25.00 ms	+0.2% (matched)

Improvement grows with concurrency (−12% to −17% TPOT at conc 64/128/256 vs the pre-INT4 baseline). GSM8K limit=100 = 0.95 (INT4 all-reduce is accuracy-safe; above ATOM's 0.9363).

Duplicate check

No open InferenceX PR adds these knobs to the M3 FP4 MI355X STP recipe.

AI assistance (Claude) was used to develop and validate this change.

Add the four levers that bring the single-node MXFP4 MI355X vLLM STP recipe to parity with the ATOM recipe at high concurrency: - INT4 quantized all-reduce (VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 + CAST_BF16_TO_FP16=0 + QUANTIZATION_MIN_SIZE_KB=256). The decode all-reduce is the biggest decode kernel; INT4 makes it ~4x cheaper. - fp8 KV cache (--kv-cache-dtype fp8). - cross-layer indexer top-k sharing (--hf-overrides index_topk_freq=4), which needs vllm-project/vllm#47269. - input-side all-reduce + RMSNorm fusion, which needs vllm-project/vllm#47270 (automatic once merged). Measured (amd/MiniMax-M3-MXFP4, MI355X, TP4, 8k1k, ATOM's benchmark on both): vLLM conc32 17.21ms / conc64 25.13ms vs ATOM ref 16.74 / 25.00 (matched). GSM8K limit=100 = 0.95 (INT4 all-reduce is accuracy-safe). The image pin must be bumped to a nightly containing #47269 and #47270 before sweeping for the full curve. AI assistance (Claude) was used. Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>

github-actions · 2026-07-01T08:47:04Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>

Fangzhou-Ai · 2026-07-01T08:49:55Z

I'll enable the sweep after the 2 vLLM PRs are merged and nightly image is ready.

claude

LGTM — self-contained benchmark recipe tuning; adds INT4 quick-reduce, fp8 KV, and index-topk-sharing knobs using the same patterns as sibling AMD MI355X recipes, with the upstream-dependent knobs staying no-ops until the image pin is bumped separately.

Extended reasoning...

Overview

The PR touches two files: (1) benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh — adds three VLLM_ROCM_QUICK_REDUCE_* env vars for INT4 all-reduce, a --kv-cache-dtype fp8 flag, an --hf-overrides JSON turning on use_index_cache/index_topk_freq=4, and a comment block explaining the four levers; and (2) perf-changelog.yaml — a corresponding entry documenting the change. No master-config pin bumps, no production code paths.

Security risks

None. This is benchmark recipe configuration — env vars and CLI flags passed to vllm serve in a single-node perf benchmark script. No auth, crypto, permissions, network exposure, or user-input surfaces are touched.

Level of scrutiny

Low. Benchmark recipes are non-critical, single-node scripts whose only downstream effect is CI perf numbers for a specific model+precision+SKU. The VLLM_ROCM_QUICK_REDUCE_QUANTIZATION env var is already used by four sibling MI355X recipes (minimaxm3_fp8_mi355x*, gptoss_fp4_mi3*x, kimik2.5_fp4_mi355x), so the pattern is well established. The author correctly notes that the index_topk_freq override and the AR+RMSNorm fusion depend on unmerged upstream vLLM PRs (#47269, #47270) and are harmless no-ops until the served image is bumped — a separate, explicit follow-up step.

Other factors

The bug hunting system found no issues. Changes come with measured perf and accuracy data (GSM8K limit=100 = 0.95, above ATOM's 0.9363), the comment block in the recipe clearly explains why each lever is safe/needed (notably the 16 MB→256 KB MIN_SIZE_KB rationale), and the perf-changelog entry follows the established format used by adjacent entries. Small, mechanical, self-contained tuning PR — a good fit for shadow approval.

Fangzhou-Ai requested a review from a team July 1, 2026 08:46

github-project-automation Bot added this to InferenceMAX Board Jul 1, 2026

Add perf-changelog entry for minimaxm3-fp4-mi355x-vllm QR-INT4

77c5a9d

Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>

Fangzhou-Ai marked this pull request as draft July 1, 2026 08:49

Fangzhou-Ai changed the title ~~[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close high-conc gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion)~~ [AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion) Jul 1, 2026

claude Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion)#1969

[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion)#1969
Fangzhou-Ai wants to merge 2 commits into
mainfrom
amd/minimax-m3-fp4-mi355x-vllm-qr-int4

Fangzhou-Ai commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Fangzhou-Ai commented Jul 1, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Fangzhou-Ai commented Jul 1, 2026

Summary

⚠️ Merge/sweep dependency

Measured (local, amd/MiniMax-M3-MXFP4, MI355X gfx950, TP4, 8k1k)

Duplicate check

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Fangzhou-Ai commented Jul 1, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant