[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion)#1969
[AMD] MiniMax-M3 FP4 MI355X vLLM STP: close gap vs ATOM (INT4 all-reduce + index-sharing + AR fusion)#1969Fangzhou-Ai wants to merge 2 commits into
Conversation
Add the four levers that bring the single-node MXFP4 MI355X vLLM STP recipe
to parity with the ATOM recipe at high concurrency:
- INT4 quantized all-reduce (VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 +
CAST_BF16_TO_FP16=0 + QUANTIZATION_MIN_SIZE_KB=256). The decode all-reduce
is the biggest decode kernel; INT4 makes it ~4x cheaper.
- fp8 KV cache (--kv-cache-dtype fp8).
- cross-layer indexer top-k sharing (--hf-overrides index_topk_freq=4),
which needs vllm-project/vllm#47269.
- input-side all-reduce + RMSNorm fusion, which needs vllm-project/vllm#47270
(automatic once merged).
Measured (amd/MiniMax-M3-MXFP4, MI355X, TP4, 8k1k, ATOM's benchmark on both):
vLLM conc32 17.21ms / conc64 25.13ms vs ATOM ref 16.74 / 25.00 (matched).
GSM8K limit=100 = 0.95 (INT4 all-reduce is accuracy-safe).
The image pin must be bumped to a nightly containing #47269 and #47270
before sweeping for the full curve.
AI assistance (Claude) was used.
Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
|
I'll enable the sweep after the 2 vLLM PRs are merged and nightly image is ready. |
There was a problem hiding this comment.
LGTM — self-contained benchmark recipe tuning; adds INT4 quick-reduce, fp8 KV, and index-topk-sharing knobs using the same patterns as sibling AMD MI355X recipes, with the upstream-dependent knobs staying no-ops until the image pin is bumped separately.
Extended reasoning...
Overview
The PR touches two files: (1) benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh — adds three VLLM_ROCM_QUICK_REDUCE_* env vars for INT4 all-reduce, a --kv-cache-dtype fp8 flag, an --hf-overrides JSON turning on use_index_cache/index_topk_freq=4, and a comment block explaining the four levers; and (2) perf-changelog.yaml — a corresponding entry documenting the change. No master-config pin bumps, no production code paths.
Security risks
None. This is benchmark recipe configuration — env vars and CLI flags passed to vllm serve in a single-node perf benchmark script. No auth, crypto, permissions, network exposure, or user-input surfaces are touched.
Level of scrutiny
Low. Benchmark recipes are non-critical, single-node scripts whose only downstream effect is CI perf numbers for a specific model+precision+SKU. The VLLM_ROCM_QUICK_REDUCE_QUANTIZATION env var is already used by four sibling MI355X recipes (minimaxm3_fp8_mi355x*, gptoss_fp4_mi3*x, kimik2.5_fp4_mi355x), so the pattern is well established. The author correctly notes that the index_topk_freq override and the AR+RMSNorm fusion depend on unmerged upstream vLLM PRs (#47269, #47270) and are harmless no-ops until the served image is bumped — a separate, explicit follow-up step.
Other factors
The bug hunting system found no issues. Changes come with measured perf and accuracy data (GSM8K limit=100 = 0.95, above ATOM's 0.9363), the comment block in the recipe clearly explains why each lever is safe/needed (notably the 16 MB→256 KB MIN_SIZE_KB rationale), and the perf-changelog entry follows the established format used by adjacent entries. Small, mechanical, self-contained tuning PR — a good fit for shadow approval.
Summary
Adds the four levers that bring
minimaxm3-fp4-mi355x-vllm(single-node STP) to high-concurrency parity with the ATOM recipe:VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4+..._CAST_BF16_TO_FP16=0+..._QUANTIZATION_MIN_SIZE_KB=256. The decode all-reduce is the single biggest decode kernel; INT4 makes it ~4× cheaper. TheMIN_SIZE_KBoverride is required — vLLM's default INT4 gate for (bf16, TP4) is 16 MB, so it never fires for the ~1.5 MB decode all-reduces. Works on any nightly.--kv-cache-dtype fp8). Works on any nightly.--hf-overrides index_topk_freq=4) — requires [ROCm][MiniMax-M3] Cross-layer lightning-indexer top-k sharing vllm-project/vllm#47269.Levers 3 and 4 need the two upstream vLLM PRs in the served image:
Only after both merge should the image pin (
.github/configs/amd-master.yaml→minimaxm3-fp4-mi355x-vllm.image) be bumped to a nightly that contains them and the perf swept. Until then,index_topk_freqis a harmless no-op and the AR fusion is simply absent (INT4 all-reduce + fp8 KV still apply).Measured (local, amd/MiniMax-M3-MXFP4, MI355X gfx950, TP4, 8k1k)
Using ATOM's own
benchmark_servingagainst both, vs ATOM's published 4×MI355 reference:Improvement grows with concurrency (−12% to −17% TPOT at conc 64/128/256 vs the pre-INT4 baseline). GSM8K limit=100 = 0.95 (INT4 all-reduce is accuracy-safe; above ATOM's 0.9363).
Duplicate check
No open InferenceX PR adds these knobs to the M3 FP4 MI355X STP recipe.
AI assistance (Claude) was used to develop and validate this change.