Skip to content

[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion)#1969

Draft
Fangzhou-Ai wants to merge 2 commits into
mainfrom
amd/minimax-m3-fp4-mi355x-vllm-qr-int4
Draft

[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion)#1969
Fangzhou-Ai wants to merge 2 commits into
mainfrom
amd/minimax-m3-fp4-mi355x-vllm-qr-int4

Conversation

@Fangzhou-Ai

Copy link
Copy Markdown
Collaborator

Summary

Adds the four levers that bring minimaxm3-fp4-mi355x-vllm (single-node STP) to high-concurrency parity with the ATOM recipe:

  1. INT4 quantized all-reduce — env knobs VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 + ..._CAST_BF16_TO_FP16=0 + ..._QUANTIZATION_MIN_SIZE_KB=256. The decode all-reduce is the single biggest decode kernel; INT4 makes it ~4× cheaper. The MIN_SIZE_KB override is required — vLLM's default INT4 gate for (bf16, TP4) is 16 MB, so it never fires for the ~1.5 MB decode all-reduces. Works on any nightly.
  2. fp8 KV cache (--kv-cache-dtype fp8). Works on any nightly.
  3. Cross-layer indexer top-k sharing (--hf-overrides index_topk_freq=4) — requires [ROCm][MiniMax-M3] Cross-layer lightning-indexer top-k sharing vllm-project/vllm#47269.
  4. Input-side all-reduce + RMSNorm fusionrequires [ROCm][MiniMax-M3] Fuse input-side all-reduce + RMSNorm in the AMD decoder vllm-project/vllm#47270 (automatic once merged; no flag).

⚠️ Merge/sweep dependency

Levers 3 and 4 need the two upstream vLLM PRs in the served image:

Only after both merge should the image pin (.github/configs/amd-master.yamlminimaxm3-fp4-mi355x-vllm.image) be bumped to a nightly that contains them and the perf swept. Until then, index_topk_freq is a harmless no-op and the AR fusion is simply absent (INT4 all-reduce + fp8 KV still apply).

Measured (local, amd/MiniMax-M3-MXFP4, MI355X gfx950, TP4, 8k1k)

Using ATOM's own benchmark_serving against both, vs ATOM's published 4×MI355 reference:

conc vLLM (this config) ATOM ref gap
32 17.21 ms 16.74 ms +2.8%
64 25.13 ms 25.00 ms +0.2% (matched)

Improvement grows with concurrency (−12% to −17% TPOT at conc 64/128/256 vs the pre-INT4 baseline). GSM8K limit=100 = 0.95 (INT4 all-reduce is accuracy-safe; above ATOM's 0.9363).

Duplicate check

No open InferenceX PR adds these knobs to the M3 FP4 MI355X STP recipe.

AI assistance (Claude) was used to develop and validate this change.

Add the four levers that bring the single-node MXFP4 MI355X vLLM STP recipe
to parity with the ATOM recipe at high concurrency:
  - INT4 quantized all-reduce (VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 +
    CAST_BF16_TO_FP16=0 + QUANTIZATION_MIN_SIZE_KB=256). The decode all-reduce
    is the biggest decode kernel; INT4 makes it ~4x cheaper.
  - fp8 KV cache (--kv-cache-dtype fp8).
  - cross-layer indexer top-k sharing (--hf-overrides index_topk_freq=4),
    which needs vllm-project/vllm#47269.
  - input-side all-reduce + RMSNorm fusion, which needs vllm-project/vllm#47270
    (automatic once merged).

Measured (amd/MiniMax-M3-MXFP4, MI355X, TP4, 8k1k, ATOM's benchmark on both):
vLLM conc32 17.21ms / conc64 25.13ms vs ATOM ref 16.74 / 25.00 (matched).
GSM8K limit=100 = 0.95 (INT4 all-reduce is accuracy-safe).

The image pin must be bumped to a nightly containing #47269 and #47270
before sweeping for the full curve.

AI assistance (Claude) was used.

Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
@Fangzhou-Ai Fangzhou-Ai marked this pull request as draft July 1, 2026 08:49
@Fangzhou-Ai Fangzhou-Ai changed the title [AMD] MiniMax-M3 FP4 MI355X vLLM STP: close high-conc gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion) [AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion) Jul 1, 2026
@Fangzhou-Ai

Copy link
Copy Markdown
Collaborator Author

I'll enable the sweep after the 2 vLLM PRs are merged and nightly image is ready.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — self-contained benchmark recipe tuning; adds INT4 quick-reduce, fp8 KV, and index-topk-sharing knobs using the same patterns as sibling AMD MI355X recipes, with the upstream-dependent knobs staying no-ops until the image pin is bumped separately.

Extended reasoning...

Overview

The PR touches two files: (1) benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh — adds three VLLM_ROCM_QUICK_REDUCE_* env vars for INT4 all-reduce, a --kv-cache-dtype fp8 flag, an --hf-overrides JSON turning on use_index_cache/index_topk_freq=4, and a comment block explaining the four levers; and (2) perf-changelog.yaml — a corresponding entry documenting the change. No master-config pin bumps, no production code paths.

Security risks

None. This is benchmark recipe configuration — env vars and CLI flags passed to vllm serve in a single-node perf benchmark script. No auth, crypto, permissions, network exposure, or user-input surfaces are touched.

Level of scrutiny

Low. Benchmark recipes are non-critical, single-node scripts whose only downstream effect is CI perf numbers for a specific model+precision+SKU. The VLLM_ROCM_QUICK_REDUCE_QUANTIZATION env var is already used by four sibling MI355X recipes (minimaxm3_fp8_mi355x*, gptoss_fp4_mi3*x, kimik2.5_fp4_mi355x), so the pattern is well established. The author correctly notes that the index_topk_freq override and the AR+RMSNorm fusion depend on unmerged upstream vLLM PRs (#47269, #47270) and are harmless no-ops until the served image is bumped — a separate, explicit follow-up step.

Other factors

The bug hunting system found no issues. Changes come with measured perf and accuracy data (GSM8K limit=100 = 0.95, above ATOM's 0.9363), the comment block in the recipe clearly explains why each lever is safe/needed (notably the 16 MB→256 KB MIN_SIZE_KB rationale), and the perf-changelog entry follows the established format used by adjacent entries. Small, mechanical, self-contained tuning PR — a good fit for shadow approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant