-
Notifications
You must be signed in to change notification settings - Fork 217
[WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config #1982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jasonlizhengjian
wants to merge
3
commits into
main
Choose a base branch
from
codex/add-minimaxm3-fp4-b200-dynamo-vllm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
85 changes: 85 additions & 0 deletions
85
...marks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| name: "minimax-m3-vllm-disagg-b200-2p1d-fp4-dep2-dep8-8k1k" | ||
|
|
||
| model: | ||
| path: "nvidia/MiniMax-M3-NVFP4" | ||
| container: "vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41" | ||
| precision: "fp4" | ||
|
|
||
| resources: | ||
| gpu_type: "b200" | ||
| gpus_per_node: 8 | ||
| prefill_nodes: 1 | ||
| decode_nodes: 1 | ||
| prefill_workers: 2 | ||
| decode_workers: 1 | ||
| gpus_per_prefill: 2 | ||
| gpus_per_decode: 8 | ||
|
|
||
| dynamo: | ||
| install: true | ||
| version: 1.3.0.dev20260614 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: false | ||
|
|
||
| backend: | ||
| type: vllm | ||
| connector: null | ||
|
|
||
| prefill_environment: | ||
| VLLM_FLOAT32_MATMUL_PRECISION: high | ||
| UCX_TLS: "cuda_copy,rc" | ||
|
|
||
| decode_environment: | ||
| VLLM_FLOAT32_MATMUL_PRECISION: high | ||
| UCX_TLS: "cuda_copy,rc" | ||
|
|
||
| vllm_config: | ||
| prefill: | ||
| no-enable-flashinfer-autotune: true | ||
| tensor-parallel-size: 1 | ||
| data-parallel-size: 2 | ||
| data-parallel-rpc-port: 13345 | ||
| enable-expert-parallel: true | ||
| trust-remote-code: true | ||
| no-enable-prefix-caching: true | ||
| kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' | ||
| attention-config: '{"backend": "FLASHINFER", "use_trtllm_attention": true, "indexer_kv_dtype": "fp8"}' | ||
| kv-cache-dtype: fp8 | ||
| block-size: 128 | ||
| gpu-memory-utilization: 0.90 | ||
| max-model-len: 9472 | ||
| language-model-only: true | ||
| stream-interval: 32 | ||
|
|
||
| decode: | ||
| no-enable-flashinfer-autotune: true | ||
| tensor-parallel-size: 1 | ||
| data-parallel-size: 8 | ||
| data-parallel-rpc-port: 13345 | ||
| enable-expert-parallel: true | ||
| trust-remote-code: true | ||
| no-enable-prefix-caching: true | ||
| kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' | ||
| attention-config: '{"backend": "FLASHINFER", "use_trtllm_attention": true, "indexer_kv_dtype": "fp8"}' | ||
| kv-cache-dtype: fp8 | ||
| block-size: 128 | ||
| gpu-memory-utilization: 0.90 | ||
| max-model-len: 9472 | ||
| language-model-only: true | ||
| stream-interval: 32 | ||
| max-num-seqs: 1024 # One DP8 decode worker provides 8 DP ranks. | ||
| max-num-batched-tokens: 16384 | ||
| max-cudagraph-capture-size: 4096 | ||
|
|
||
| health_check: | ||
| max_attempts: 360 | ||
| interval_seconds: 10 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| concurrencies: "2048" | ||
| req_rate: "inf" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The launcher's model-path alias key and the recipe's
model.pathdon't match for this new b200-dgxc + minimaxm3-fp4 pairing:runners/launch_b200-dgxc.sh:74exportsSRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4"but the new recipe at line 4 usespath: "nvidia/MiniMax-M3-NVFP4", so srtctl'''smodel_pathslookup misses. Every other analogous case in the tree (b300-nv minimaxm3-fp4, b200 minimaxm2.5-fp4/fp8, b200 dsv4-fp4) has the launcher prefix exactly equal to the recipe path. Fix by changing one side to match the other — e.g. setSRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4"for the minimaxm3-fp4 branch on b200-dgxc (mirroringrunners/launch_b300-nv.sh:52), or change the new recipe'''smodel.pathto"minimax-m3-nvfp4".Extended reasoning...
The mismatch
runners/launch_b200-dgxc.sh:71-74(pre-existing from PR #1932) sets up the minimaxm3-fp4 model resolution:The launcher then writes
srtslurm.yaml(around line 157):But the new recipe at
benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml:4uses:srtctl looks up
"nvidia/MiniMax-M3-NVFP4"inmodel_paths— the only registered alias is"minimax-m3-nvfp4", so the lookup misses.Why this PR is the trigger
The pre-existing lines 71-74 were previously only exercised through the single-node code path (the
elsebranch at the bottom of the launcher), which never touchesSRT_SLURM_MODEL_PREFIX— it mounts$MODEL_PATHdirectly via--container-mountsand setsMODEL=$MODEL_PATH. So the mismatch was benign.This PR adds the new
elifatlaunch_b200-dgxc.sh:116-121that first routes minimaxm3-fp4 through the srtctl / srt-slurm code path, which is the path that actually consumesSRT_SLURM_MODEL_PREFIXas an alias key insrtslurm.yaml. So the pre-existing but previously-latent mismatch becomes load-bearing exactly at this PR.Cross-check with every other b200/b300 case
SRT_SLURM_MODEL_PREFIXmodel.pathlaunch_b300-nv.sh:52)nvidia/MiniMax-M3-NVFP4nvidia/MiniMax-M3-NVFP4launch_b300-nv.sh:55)MiniMaxAI/MiniMax-M3-MXFP8MiniMaxAI/MiniMax-M3-MXFP8minimax-m2.5-nvfp4minimax-m2.5-nvfp4minimax-m2.5-fp8minimax-m2.5-fp8deepseek-v4-prodeepseek-v4-prominimax-m3-nvfp4nvidia/MiniMax-M3-NVFP4Every other pairing in the tree matches exactly; the new b200-dgxc minimaxm3-fp4 case is the sole outlier. The b300 minimaxm3-fp4 case in particular is instructive because the new b200 recipe is a direct port of the b300 4p2d-dep2-dep8-8k1k recipe (per the PR description), so it inherits
nvidia/MiniMax-M3-NVFP4— which matches on b300 but not on b200.Step-by-step proof of the failure
minimaxm3-fp4-b200-dynamo-vllmon theb200-multinoderunner.launch_b200-dgxc.shruns withIS_MULTINODE=true,MODEL_PREFIX=minimaxm3,PRECISION=fp4,FRAMEWORK=dynamo-vllm.SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4"andMODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4".recipes/vllm/minimax-m3/b200-fp4/.cat > srtslurm.yaml <<EOFblock writesmodel_paths: { "minimax-m3-nvfp4": "/scratch/fsw/models/MiniMax-M3-NVFP4" }.srtctl apply -f $CONFIG_FILEis invoked; srtctl parses the recipe and readsmodel.path: "nvidia/MiniMax-M3-NVFP4".model_pathsfor the key"nvidia/MiniMax-M3-NVFP4"— not present.srtctl apply. Outcome B: srtctl treats the unmatched value as a HuggingFace hub identifier and attempts to downloadnvidia/MiniMax-M3-NVFP4from the hub on every job invocation, negating the pre-staging that the comment onlaunch_b200-dgxc.sh:72explicitly relies on.Either outcome makes the full-sweep check fail — either the job errors out at model resolution, or the HF pull blows the container FS / times out the runner. The PR is labeled
full-sweep-fail-fast-no-canary, so this will surface as a fail-fast failure.Fix
One-liner in either direction. The b300 side of the tree is the reference implementation, so the least surprising change is to line 74 of
runners/launch_b200-dgxc.sh:elif [[ $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then # NVFP4 checkpoint, pre-staged on the b200-dgxc scratch tree. export MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4" - export SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" + export SRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4"This also matches
runners/launch_b300-nv.sh:52verbatim, keeping the two clusters consistent for the same model+precision.