Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
208 commits
Select commit Hold shift + click to select a range
83761d0
Add CollectiveX experimental cross-vendor collective/EP benchmark
Oseltamivir Jun 23, 2026
b7ed913
CollectiveX: import container by multi-arch tag, fix CI import hang
Oseltamivir Jun 23, 2026
e6fdd84
Merge branch 'main' into collectivex
Oseltamivir Jun 23, 2026
ccfae8e
CollectiveX: copy staged results back to checkout for artifact upload
Oseltamivir Jun 23, 2026
b384171
CollectiveX: per-job summary table + address PR review findings
Oseltamivir Jun 23, 2026
f48daed
CollectiveX: render results as a GitHub Actions job summary
Oseltamivir Jun 23, 2026
be9cc91
CollectiveX: add MI355X / MoRI EP path (dispatch+combine)
Oseltamivir Jun 23, 2026
d8ee9bf
CollectiveX: run MI355X MoRI on push; align launcher with serving script
Oseltamivir Jun 23, 2026
ac3f1b9
CollectiveX: size MoRI symmetric heap (first MI355X run hit the 2 GiB…
Oseltamivir Jun 23, 2026
46208f2
CollectiveX: set MoRI heap to 6G (16 GiB failed RDMA MR registration)
Oseltamivir Jun 23, 2026
b62de99
CollectiveX: MoRI MI355X validated on hardware; fix heap/buffer/teardown
Oseltamivir Jun 23, 2026
481ef59
CollectiveX: wire rccl-tests collective primitives for MI355X (CX_BEN…
Oseltamivir Jun 23, 2026
78322de
CollectiveX: key dispatch concurrency by SKU so B200/MI355X runs don'…
Oseltamivir Jun 23, 2026
2b23573
CollectiveX: render busbw & latency vs bytes/rank sweep tables in the…
Oseltamivir Jun 23, 2026
a3a492c
CollectiveX: GB200 8-GPU multi-node MNNVL path (CX_NODES), validated …
Oseltamivir Jun 23, 2026
871086d
CollectiveX: fix multi-node build cache (MPI=0 vs MPI=1) + gate all-z…
Oseltamivir Jun 23, 2026
368cfbc
CollectiveX: EP dispatch/combine token sweep with separated timing (t…
Oseltamivir Jun 24, 2026
e2717a3
CollectiveX: make MI355X launcher CI-robust (writable lock dir + node…
Oseltamivir Jun 24, 2026
5c7b273
CollectiveX: fair-comparison EP rebuild — deterministic trace, real f…
Oseltamivir Jun 24, 2026
0052b11
CollectiveX: resource-normalized + tuned regimes for the EP comparison
Oseltamivir Jun 24, 2026
3a872a9
CollectiveX: fail-fast timeout guard + cap the MoRI push smoke (T>=32…
Oseltamivir Jun 24, 2026
5876ea0
CollectiveX: floor MoRI normalized block_num — it deadlocks at T>=32 …
Oseltamivir Jun 24, 2026
353c8ee
CollectiveX: FP8 dispatch + low-latency mode + reject-unsupported fra…
Oseltamivir Jun 24, 2026
3bc941c
CollectiveX: fix B300 warmup artifact + GHA matrix for h100-dgxc/b300…
Oseltamivir Jun 24, 2026
9f85d05
CollectiveX: fix h100-dgxc + b300 launcher slurm/storage from serving…
Oseltamivir Jun 24, 2026
c596882
CollectiveX: serialize same-SKU GHA dispatches + add 3-run reproducib…
Oseltamivir Jun 24, 2026
e71ef3c
CollectiveX: per-point clock-ramp burst (gated) — fixes MoRI wedge + …
Oseltamivir Jun 24, 2026
4e217f9
CollectiveX: MoRI repro/validation drivers pass COLLECTIVEX_IMAGE (pr…
Oseltamivir Jun 24, 2026
7a2f94f
CollectiveX: repro driver — match the T row (MoRI ramp-safe) + cap Mo…
Oseltamivir Jun 24, 2026
bbe0578
CollectiveX: dedicated MoRI repro driver (validation-exact invocation)
Oseltamivir Jun 24, 2026
f7b9d35
CollectiveX v3 measurement: explicit contracts, pooled-trial p50/p90/…
Oseltamivir Jun 25, 2026
1afd268
CollectiveX v3 workflow: capability resolver + NCCL phase-dedup + con…
Oseltamivir Jun 25, 2026
6122acb
CollectiveX v3 plotter: percentile + suite selectors, logical-payload…
Oseltamivir Jun 25, 2026
c136ec5
CollectiveX: v3 harness smoke driver (validates contracts/trials/rout…
Oseltamivir Jun 25, 2026
cf34cb3
CollectiveX: MoRI repro driver iters knob (MORI_ITERS, tighter fast-o…
Oseltamivir Jun 25, 2026
82ec864
CollectiveX: v3 re-run drivers (deepep _v3_rerun.sh + mori _v3_mori.s…
Oseltamivir Jun 25, 2026
cad380a
CollectiveX plotter: default to p50 (p99 too noisy a tail estimate at…
Oseltamivir Jun 25, 2026
81cddca
CollectiveX plotter: X-axis Log/Linear toggle (was hardcoded log)
Oseltamivir Jun 25, 2026
e97bc8b
CollectiveX plotter: auto-stitch decode range into prefill curves (co…
Oseltamivir Jun 25, 2026
6a3a185
chore: dispatch CollectiveX snapshot updates [skip ci]
Oseltamivir Jun 25, 2026
270b7b4
CollectiveX: GB300 EP8 across 2 NVL72 trays + EP-degree-aware plotter
Oseltamivir Jun 25, 2026
a6812dc
CollectiveX: routing axis (balanced/zipf) + EPLB expert-replication l…
Oseltamivir Jun 25, 2026
45c4570
CollectiveX v4 (goal Part 1 + scaffolding): workload identity, measur…
Oseltamivir Jun 25, 2026
600e909
CollectiveX: analyze_ep.py — operating-envelope analysis (skew penalt…
Oseltamivir Jun 25, 2026
171c7d1
CollectiveX: --workload-dir canonical-trace consumption + make_worklo…
Oseltamivir Jun 25, 2026
6dba193
CollectiveX: failure taxonomy (classify hang/OOM/registration/deadloc…
Oseltamivir Jun 25, 2026
8ff23bd
CollectiveX plotter: coverage table (publication status per measured …
Oseltamivir Jun 25, 2026
9e52693
CollectiveX: provenance enrichment (GitHub ref/job/artifact, image ar…
Oseltamivir Jun 25, 2026
82c6130
CollectiveX: structured placement metadata + routing locality fractio…
Oseltamivir Jun 25, 2026
e273009
CollectiveX: scaling efficiency (strong/weak from EP4/EP8) + regressi…
Oseltamivir Jun 25, 2026
978d338
CollectiveX: MI355X cross-vendor canonical-workload consume driver (D…
Oseltamivir Jun 25, 2026
a413de2
CollectiveX plotter: fix grid 'undefined' panel title (stale 'serial'…
Oseltamivir Jun 26, 2026
d799e0f
CollectiveX plotter: prefill panels show only the real prefill range …
Oseltamivir Jun 26, 2026
1622dff
CollectiveX plotter: --legacy {all,exclude,only} — v4-only main plot …
Oseltamivir Jun 26, 2026
f5df0ea
CollectiveX GHA: add routing/eplb inputs + h200/gb300 SKUs; wire CX_E…
Oseltamivir Jun 26, 2026
bb296c4
CollectiveX: launch_gb300-nv.sh — GHA launcher for GB300 (EP4 via run…
Oseltamivir Jun 26, 2026
73da67b
CollectiveX GHA: per-(SKU+config) concurrency group so a multi-config…
Oseltamivir Jun 26, 2026
0df55e8
CollectiveX: per-runner stage dir (fix concurrent-dispatch stale-hand…
Oseltamivir Jun 26, 2026
13f0a0f
CollectiveX: fix H200 GHA launcher FS (/home/sa-shared, not /mnt/nfs)
Oseltamivir Jun 26, 2026
9fb6e5d
CollectiveX: H200 partition main (not hpc-gpu-1)
Oseltamivir Jun 26, 2026
2b5e26c
CollectiveX: GB300 launcher uses docker tag, not squash path
Oseltamivir Jun 26, 2026
d2433e3
CollectiveX: pin h200 dispatch to the h200-dgxc runner pool
Oseltamivir Jun 26, 2026
156bf44
CollectiveX: GHA campaign tooling — collector + matrix dry-label fix
Oseltamivir Jun 26, 2026
59a05e0
CollectiveX: gitignore _ssh_v4_archive/ (superseded SSH result JSONs)
Oseltamivir Jun 26, 2026
a767844
CollectiveX: distribution-identity hardening + quant-combine (PR311) …
Oseltamivir Jun 26, 2026
fd23d02
CollectiveX: complete goal Part 1 + Part 2 — runtime-visible contract…
Oseltamivir Jun 26, 2026
70cfef3
CollectiveX: cohort official-membership gate (publication_status==off…
Oseltamivir Jun 26, 2026
60dec7d
CollectiveX: immediate-priority — LL fixed-kernel resource split, res…
Oseltamivir Jun 26, 2026
36d3eb6
CollectiveX: fix UnboundLocalError on EPLB canonical runs — define ro…
Oseltamivir Jun 26, 2026
ee4ffe7
CollectiveX: gitignore _seeded_archive/ (superseded seeded-runtime re…
Oseltamivir Jun 26, 2026
45fa504
CollectiveX: full-suite GHA dispatch — workflow inputs (hidden/topk/e…
Oseltamivir Jun 26, 2026
2c15d94
CollectiveX: full-suite completeness fixes — collect limit 500 (was 1…
Oseltamivir Jun 27, 2026
880f82c
CollectiveX: keep-newest cfg_key includes resource axis (resource_mod…
Oseltamivir Jun 27, 2026
ddc08e7
CollectiveX: add iters workflow input (CX_ITERS) — for the MoRI/MI355…
Oseltamivir Jun 27, 2026
8392632
CollectiveX: add trials/warmup workflow inputs (CX_TRIALS/CX_WARMUP) …
Oseltamivir Jun 27, 2026
74f52e0
CollectiveX: fix workflow_dispatch >25-input limit — consolidate iter…
Oseltamivir Jun 27, 2026
1495866
CollectiveX: add B300 to ep-nightly/ep-models/ep-routing (was missing…
Oseltamivir Jun 27, 2026
0cf9fc6
CollectiveX: DeepEP V2 build hook (CX_DEEPEP_V2 -> build NCCL-Gin V2 …
Oseltamivir Jun 27, 2026
76a3032
CollectiveX: kernel_gen (deepep v1/v2) as a distinct identity axis — …
Oseltamivir Jun 27, 2026
91c7acf
collectivex: fix DeepEP V2 build on PEP 668 images (H200/B300)
Oseltamivir Jun 27, 2026
df7fdde
collectivex: headline defaults, decision/summary/tabs UI, regression …
Oseltamivir Jun 27, 2026
803b785
collectivex: render NCCL all-reduce/all-gather (family=nccl) in plot …
Oseltamivir Jun 27, 2026
b6176a6
collectivex: collect family=nccl (all-reduce/all-gather) + uccl/flash…
Oseltamivir Jun 27, 2026
a504a3e
collectivex: model-shape selector in plot (DeepSeek-V3/V4, MiniMax-M3…
Oseltamivir Jun 27, 2026
1e21c72
collectivex: UCCL EP backend + memcpy-family collective benches (offl…
Oseltamivir Jun 27, 2026
eb6f953
collectivex: document hardware/kernel-gated items (honest blockers)
Oseltamivir Jun 27, 2026
c16f885
collectivex: fix UCCL build-check (import torch first) + capability/c…
Oseltamivir Jun 27, 2026
4c661f9
collectivex: summarize.py recognizes memcpy-family collectives (offlo…
Oseltamivir Jun 27, 2026
95137b8
collectivex: correct UCCL EP status — scaffolded, full run deferred
Oseltamivir Jun 27, 2026
645f9d5
collectivex: collect offload/copy_engine/kvcache files + robust _coll…
Oseltamivir Jun 27, 2026
f531529
collectivex: review upstream precision PRs (MoRI 311, FlashInfer 3376…
Oseltamivir Jun 27, 2026
0e54cde
collectivex: populate offload/copy-engine/kv-cache plot tabs (real data)
Oseltamivir Jun 27, 2026
71477ee
collectivex: RL mesh-to-mesh transfer benchmark (family=rl-mesh)
Oseltamivir Jun 27, 2026
e6224de
collectivex: rl-mesh passes capability pre-flight (non-EP bench passt…
Oseltamivir Jun 27, 2026
c40de99
collectivex: render RL mesh-to-mesh tab (family=rl-mesh) — final coll…
Oseltamivir Jun 27, 2026
925285d
collectivex: launchers/ contains only launch*; runtime/ + tools/ split
Oseltamivir Jun 27, 2026
ca8a505
collectivex: FlashInfer EP adapter + framework all-reduce bench (wire…
Oseltamivir Jun 27, 2026
762eb48
collectivex: direct-cast FP8 + per-token scale-layout dispatch recipes
Oseltamivir Jun 27, 2026
42eddb4
collectivex: fix fp8-variant CLI choices + allreduce-fw gate + surfac…
Oseltamivir Jun 27, 2026
ccb0b4a
collectivex: fix FlashInfer EP Mapping (tp_size=world_size for pure EP)
Oseltamivir Jun 27, 2026
9e1ac40
collectivex: FlashInfer MoeAlltoAll requires hidden_size (Mapping fix…
Oseltamivir Jun 27, 2026
91530dd
collectivex: FlashInfer MNNVL via TorchDistBackend (no MPI) — the rea…
Oseltamivir Jun 27, 2026
e150424
collectivex: FlashInfer EP combine — clone payload + payload_in_works…
Oseltamivir Jun 27, 2026
7aca33d
collectivex: FlashInfer EP — handle stateful dispatch/combine FSM
Oseltamivir Jun 27, 2026
1535869
collectivex: roundtrip-only timing for FlashInfer EP (stateful paired…
Oseltamivir Jun 27, 2026
511188e
collectivex: FlashInfer combine — pass recv as-is (source contract: s…
Oseltamivir Jun 27, 2026
2ebeba9
collectivex: FlashInfer EP correctness factor = distinct ranks per token
Oseltamivir Jun 27, 2026
04d83bf
collectivex: UCCL EP — vendor deep_ep_wrapper (group-based Buffer) + …
Oseltamivir Jun 27, 2026
5d08a93
collectivex: UCCL — pin vendored deep_ep_wrapper to the wheel's tag (…
Oseltamivir Jun 27, 2026
cfa1ec5
collectivex: UCCL EP finalize os._exit past teardown SIGSEGV (result …
Oseltamivir Jun 27, 2026
510fc17
CollectiveX: FlashInfer EP quant dispatch (fp8 e4m3 variants + mxfp8 …
Oseltamivir Jun 28, 2026
0b2753b
CollectiveX: real FlashInfer one-shot/two-shot all-reduce (trtllm_all…
Oseltamivir Jun 28, 2026
5c48dfd
CollectiveX: gate nvfp4 dispatch to Blackwell + refresh gated.md
Oseltamivir Jun 28, 2026
156e9ea
CollectiveX: render framework all-reduce in the All-reduce tab + gate…
Oseltamivir Jun 28, 2026
d8b4764
CollectiveX: document collective-suite serving-use mapping (all-reduc…
Oseltamivir Jun 28, 2026
02ef8d2
CollectiveX: DeepEP hybrid-ep branch backend (NVIDIA TMA HybridEPBuffer)
Oseltamivir Jun 28, 2026
90877fb
CollectiveX: allow AMD collective benches on the MI355X launcher (kv-…
Oseltamivir Jun 28, 2026
3850003
CollectiveX: FlashInfer quantized COMBINE output (fp8) via newer moe_…
Oseltamivir Jun 28, 2026
49dd8db
CollectiveX: fix flashinfer-combine upgrade — match cubin/jit-cache v…
Oseltamivir Jun 28, 2026
f684b37
CollectiveX: raise MI355X wall-clock guard to 1800s (slow shared clus…
Oseltamivir Jun 28, 2026
d9e0423
CollectiveX: install flashinfer from NIGHTLY index for combine output…
Oseltamivir Jun 28, 2026
c2c7feb
CollectiveX: upgrade nvidia-cutlass-dsl with the nightly flashinfer (…
Oseltamivir Jun 28, 2026
43614ad
CollectiveX: record exact upgraded FlashInfer library stack in proven…
Oseltamivir Jun 28, 2026
d4c508a
CollectiveX: build flashinfer main from source if the nightly wheel l…
Oseltamivir Jun 28, 2026
ba7c14a
CollectiveX: force JIT-from-main for combine kernel (uninstall stale …
Oseltamivir Jun 28, 2026
85273c6
CollectiveX: fix combine-quant output_scales to UE8M0 uint8 block-32 …
Oseltamivir Jun 28, 2026
4b3fe29
CollectiveX: NVFP4 quantized combine output (flashinfer fp4 path) — c…
Oseltamivir Jun 28, 2026
ddfbdf7
CollectiveX: gated.md — quant combine OUTPUT now DONE on B300 (flashi…
Oseltamivir Jun 28, 2026
2d65048
CollectiveX: add nvfp4 to harness --combine-dtype argparse choices
Oseltamivir Jun 28, 2026
0e61ac1
CollectiveX: nvfp4 combine dequant — view e4m3 scales as uint8 for e2…
Oseltamivir Jun 28, 2026
d6bf7b1
CollectiveX: gated.md — NVFP4 combine also DONE on B300 (valid, corre…
Oseltamivir Jun 28, 2026
94f03d5
CollectiveX: MXFP4 dispatch via fp4_quantize(ue8m0, swizzled=False) —…
Oseltamivir Jun 28, 2026
99e4ba0
CollectiveX: MoRI fp8 blockwise (e4m3fnuz) dispatch — the FNUZ precis…
Oseltamivir Jun 28, 2026
fe013ce
CollectiveX: NIXL via container switch — transfer bench (wired) + dev…
Oseltamivir Jun 28, 2026
a15bd8b
CollectiveX: AMD SDMA copy path — attempt the off-SM DMA engine on MI…
Oseltamivir Jun 28, 2026
f06b701
CollectiveX: direct-cast FP8 combine — output_scalar_scale-only on th…
Oseltamivir Jun 28, 2026
8405b10
CollectiveX: MoRI-IO transfer bench — the AMD RDMA p2p transfer engin…
Oseltamivir Jun 28, 2026
3ab6feb
CollectiveX: gated.md — NIXL container-switch result + direct-cast ke…
Oseltamivir Jun 28, 2026
83679b0
CollectiveX: methodology — named per-model TP-MoE handoff shapes table
Oseltamivir Jun 28, 2026
ae3032f
CollectiveX: copy-engine — add flash-attention victim for copy-vs-att…
Oseltamivir Jun 28, 2026
0078e31
CollectiveX: MoRI fp8 = fp8_direct_cast (not blockwise) — the validat…
Oseltamivir Jun 28, 2026
08a2f1e
CollectiveX: MoRI fp8_direct_cast needs non-zero-copy (use_external_i…
Oseltamivir Jun 28, 2026
e4f71c4
CollectiveX: MoRI fp8 correctness — gate against the e4m3fnuz consist…
Oseltamivir Jun 28, 2026
8eec44d
CollectiveX: gated.md — FNUZ fp8 VALIDATED (fp8_direct_cast e4m3fnuz,…
Oseltamivir Jun 28, 2026
0cbfe17
CollectiveX: NCCL/RCCL KV-cache transfer backend (p2p send/recv)
Oseltamivir Jun 28, 2026
744426a
CollectiveX: GB200 launcher — add EP multi-srun path (was nccl-only m…
Oseltamivir Jun 28, 2026
001626a
CollectiveX: MoonCake KV transfer backend — pip-import the transfer e…
Oseltamivir Jun 28, 2026
1d7e063
CollectiveX: AITER all-reduce builder (AMD framework-AR tier)
Oseltamivir Jun 28, 2026
a51018c
CollectiveX: workflow concurrency group += inputs.nodes (multi-node E…
Oseltamivir Jun 28, 2026
7a104f2
CollectiveX: gated.md — NVL72 rack-scale EP DONE up to EP64 via Flash…
Oseltamivir Jun 28, 2026
e8b5013
CollectiveX: framework all-reduce — replicate the serving distributed…
Oseltamivir Jun 28, 2026
0688f5d
CollectiveX: vLLM all-reduce via container switch (allreduce-fw-vllm …
Oseltamivir Jun 28, 2026
568b0a7
CollectiveX: AITER all-reduce via serving-init replication (like sglang)
Oseltamivir Jun 28, 2026
f8d87b4
CollectiveX: vLLM AR — enter VllmConfig context; NIXL EP — build UCX-…
Oseltamivir Jun 28, 2026
f594ab9
CollectiveX: gated.md — framework-AR (sglang/vllm/aiter) DONE; NIXL U…
Oseltamivir Jun 28, 2026
e3b1aad
CollectiveX: MI355X cross-node EP path — MoRI RDMA internode (goal 183)
Oseltamivir Jun 28, 2026
79cf2f6
CollectiveX: cross-node H100/H200 EP path — multi-node torchrun + UCC…
Oseltamivir Jun 28, 2026
22c2a12
CollectiveX: add prune_results.py — results hygiene (newest-N-valid p…
Oseltamivir Jun 28, 2026
aaf79c9
CollectiveX: cross-node EP — MASTER_ADDR = routable NodeAddr IP (fix …
Oseltamivir Jun 28, 2026
34943b1
CollectiveX: pin cross-node PG bootstrap iface for EP rendezvous
Oseltamivir Jun 28, 2026
45097ca
CollectiveX: drop superseded DeepEP capability probes
Oseltamivir Jun 28, 2026
308101a
CollectiveX: drop tools/_keep_newest.py — subsumed by prune_results.py
Oseltamivir Jun 28, 2026
53c4575
CollectiveX: xnode-net — always-on net diagnostic + missing-iproute2 …
Oseltamivir Jun 28, 2026
7b93bc0
CollectiveX: opt-in FileStore rendezvous for cross-node EP (CX_RDZV_F…
Oseltamivir Jun 28, 2026
f108874
CollectiveX: H200 cross-node EP via multi-srun + FileStore rendezvous
Oseltamivir Jun 28, 2026
344d051
CollectiveX: cross-node EP local-spawn via FileStore (no torchrun agent)
Oseltamivir Jun 28, 2026
e8d9a77
CollectiveX: add nccl-ep — NCCL/RCCL all-to-all EP (cross-node, both …
Oseltamivir Jun 28, 2026
127785d
CollectiveX: add nccl-ep to run_ep.py --backend argparse choices
Oseltamivir Jun 28, 2026
68d0e18
CollectiveX: gated.md — cross-node EP DONE via nccl-ep (rendezvous + …
Oseltamivir Jun 28, 2026
4113533
CollectiveX: allow nccl-ep on MI355X launcher (was remapped to mori)
Oseltamivir Jun 28, 2026
5a66645
CollectiveX: gated.md — goal 183 DONE, MI355X cross-node EP via nccl-…
Oseltamivir Jun 28, 2026
af2b445
CollectiveX: allow mooncake on MI355X launcher (was remapped to mori)
Oseltamivir Jun 29, 2026
3f2db08
CollectiveX: gated.md — MI355X collective backfill outcomes
Oseltamivir Jun 29, 2026
a274bdf
CollectiveX: capability — accept nccl primitives bench on AMD (rccl-t…
Oseltamivir Jun 29, 2026
ccfb3e3
CollectiveX: _gha_suite.sh — --deepep-v2 + --backend override for ful…
Oseltamivir Jun 29, 2026
680c397
CollectiveX: register b200 + gb200, un-drop gb300, thread rack-scale …
Oseltamivir Jun 29, 2026
fc76925
CollectiveX: collectivex-sweep.yml — setup -> matrix(shards) -> aggre…
Oseltamivir Jun 29, 2026
7e3380b
CollectiveX: fix sweep canonical-manifest failures (shard mode)
Oseltamivir Jun 29, 2026
593d4a4
CollectiveX: fix rack-scale EP8 sweep + b200 DeepEP-V2 arch
Oseltamivir Jun 29, 2026
c53e827
CollectiveX: fix JOB_ID race in salloc launchers (matrix concurrency)
Oseltamivir Jun 29, 2026
38890f6
CollectiveX: fix rack-scale EP8 shard-file path resolution
Oseltamivir Jun 29, 2026
1e4ab46
CollectiveX: plot_ep reads the consolidated ndjson (collapse loose re…
Oseltamivir Jun 29, 2026
40f30cd
CollectiveX: combine per-backend sweeps into ONE dispatch (backend=all)
Oseltamivir Jun 29, 2026
64a2495
CollectiveX: remove superseded tools/ SSH-orchestration scripts
Oseltamivir Jun 29, 2026
5a28f27
CollectiveX: document uccl + deepep-hybrid aarch64 GB200/GB300 wall
Oseltamivir Jun 29, 2026
5a98078
CollectiveX: plot defaults to All publication view (show the full sweep)
Oseltamivir Jun 30, 2026
c308858
CollectiveX: deepep-v2 x86-single-node only (was mislabeling V1 as v2…
Oseltamivir Jun 30, 2026
06dd4e8
CollectiveX: correct stale UCCL 'deferred/scaffold' docs — it produce…
Oseltamivir Jun 30, 2026
fd49614
CollectiveX: gb200/gb300 DeepEP V2 at EP4 (aarch64 V2 builds; only EP…
Oseltamivir Jun 30, 2026
0dfb124
CollectiveX: sweep_matrix sets explicit gb200/gb300 tray count (EP4 w…
Oseltamivir Jun 30, 2026
3c546cb
CollectiveX: gb300 EP8 rack builds V2/quant-combine once per node (pe…
Oseltamivir Jun 30, 2026
3e2eeb4
CollectiveX: gb300 EP8 deepep — force NVSHMEM off MNNVL for DeepEP LL…
Oseltamivir Jun 30, 2026
1630e0b
CollectiveX: gb300 EP8 deepep-v2 — pass allow_mnnvl=True to span tray…
Oseltamivir Jun 30, 2026
dc4e0c5
CollectiveX: gb300 EP8 deepep-v2 DONE — finalize (sweep re-enable, gb…
Oseltamivir Jun 30, 2026
dfaef9c
CollectiveX: h100 launcher gains cross-node EP path (CX_NODES>1, worl…
Oseltamivir Jun 30, 2026
b37c000
CollectiveX: correct h100 cross-node overclaim (WALLED, not 'same pat…
Oseltamivir Jun 30, 2026
81f42c9
CollectiveX sweep: add --max-nodes filter (symmetric to --min-nodes) …
Oseltamivir Jun 30, 2026
b8beb2d
CollectiveX: re-validate gb300 uccl/deepep-hybrid walls (per-backend,…
Oseltamivir Jun 30, 2026
b623948
CollectiveX: fix deepep-hybrid EP8 build-env propagation across srun …
Oseltamivir Jun 30, 2026
d7529a5
CollectiveX: deepep-hybrid build installs to site-packages (persist a…
Oseltamivir Jun 30, 2026
b1f0b4b
CollectiveX: sweep_matrix keeps mori PREFILL (capped), not decode-only
Oseltamivir Jun 30, 2026
c61961f
CollectiveX: correct deepep-hybrid gb300 EP8 — WORKS (not intranode-o…
Oseltamivir Jun 30, 2026
f0a8370
CollectiveX: correct ep_deepep_hybrid docstring/provenance (EP8 MNNVL…
Oseltamivir Jun 30, 2026
aab1172
CollectiveX: doc — deleting all runs de-registers a non-main workflow…
Oseltamivir Jul 1, 2026
6651a24
CollectiveX sweep: raise --max-cases default 14 -> 128 (eliminate chu…
Oseltamivir Jul 1, 2026
1bad711
CollectiveX sweep: drop mode/resource_mode from shard key -> 49 jobs …
Oseltamivir Jul 1, 2026
689861b
CollectiveX: from-source builds idempotent (build once per allocation…
Oseltamivir Jul 1, 2026
ffe663e
CollectiveX sweep: CX_TIME=120 for consolidated shards (up to ~74 cas…
Oseltamivir Jul 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
353 changes: 353 additions & 0 deletions .github/workflows/collectivex-experimental.yml

Large diffs are not rendered by default.

210 changes: 210 additions & 0 deletions .github/workflows/collectivex-sweep.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# CollectiveX Sweep — one structured run instead of thousands of dispatches.
#
# Shape (mirrors the InferenceX CI tracker): setup -> sweep (a MATRIX job = "a job with other jobs
# in it") -> aggregate (the collector "at the end"). The matrix unit is a SHARD = one allocation that
# sweeps many cases sharing (sku, backend, mode, resource) — generate_matrix's own grouping, chunked
# so no cell exceeds the job budget. Each cell emits a handful of per-case JSONs; the aggregate job
# collects every shard into ONE line-delimited file (results/aggregate/*.ndjson) so there aren't
# thousands of individual result files. Run once per backend (deepep / uccl / flashinfer /
# deepep-hybrid / nccl-ep, + deepep_v2) for full parity.
name: CollectiveX Sweep
on:
workflow_dispatch:
inputs:
backend:
description: "EP library to sweep — 'all' = every backend in ONE combined matrix run (recommended)"
type: choice
default: all
options: [all, deepep, uccl, flashinfer, deepep-hybrid, nccl-ep]
deepep_v2:
description: DeepEP V2 from-source kernels (kernel_gen=v2; only for a single-backend deepep run — 'all' already includes a deepep-v2 variant)
type: boolean
default: false
suites:
description: "'all' or comma-list of suite names"
type: string
default: all
only_sku:
description: Restrict to one SKU (h100-dgxc|h200|b300|b200-dgxc|gb200|gb300|mi355x); blank = all
type: string
default: ''
min_nodes:
description: Keep only shards with >= this tray count (2 = rack-scale EP8 only; blank = all)
type: string
default: ''
max_nodes:
description: Keep only shards with <= this tray count (1 = single-tray EP4 only; blank = all)
type: string
default: ''
max_cases:
description: Max cases per shard cell before chunking into another GHA job (128 = no chunking for current suites)
type: string
default: '128'

concurrency:
group: cx-sweep-${{ github.ref }}-${{ inputs.backend }}-${{ inputs.deepep_v2 }}-${{ inputs.only_sku }}
cancel-in-progress: false

jobs:
# ---- setup: resolve the suites into the shard matrix (the "pending jobs" node) ----
setup:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.gen.outputs.matrix }}
n: ${{ steps.gen.outputs.n }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
with: { clean: true }
- run: pip install --quiet pyyaml
- id: gen
working-directory: experimental/CollectiveX
run: |
set -euo pipefail
# backend='all' or a comma-list -> ONE combined multi-backend matrix; else a single backend.
case "${{ inputs.backend }}" in
all|*,*) bk="--backends ${{ inputs.backend }}" ;;
deepep) bk="" ;;
*) bk="--backend ${{ inputs.backend }}" ;;
esac
v2=""; [ "${{ inputs.deepep_v2 }}" = "true" ] && v2="--deepep-v2"
os=""; [ -n "${{ inputs.only_sku }}" ] && os="--only-sku ${{ inputs.only_sku }}"
mn=""; [ -n "${{ inputs.min_nodes }}" ] && mn="--min-nodes ${{ inputs.min_nodes }}"

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
xn=""; [ -n "${{ inputs.max_nodes }}" ] && xn="--max-nodes ${{ inputs.max_nodes }}"
# full matrix (with cases) -> artifact for the cells; slim (no cases) -> the strategy output.
python3 sweep_matrix.py --suites "${{ inputs.suites }}" --max-cases "${{ inputs.max_cases }}" $bk $v2 $os $mn $xn --out matrix_full.json >/dev/null
SLIM=$(python3 -c "import json;m=json.load(open('matrix_full.json'));print(json.dumps({'include':[{k:v for k,v in x.items() if k!='cases'} for x in m['include']]}))")
echo "matrix=$SLIM" >> "$GITHUB_OUTPUT"
echo "n=$(python3 -c "import json;print(len(json.load(open('matrix_full.json'))['include']))")" >> "$GITHUB_OUTPUT"
python3 -c "import json;m=json.load(open('matrix_full.json'));print('shard-cells:',len(m['include']),'cases:',sum(x['n'] for x in m['include']))"
- uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: cxsweep-matrix-${{ github.run_id }}
path: experimental/CollectiveX/matrix_full.json
if-no-files-found: error

# ---- sweep: ONE matrix cell per shard (the parent job with child jobs) ----
sweep:
Comment on lines +51 to +86
needs: setup
if: ${{ fromJSON(needs.setup.outputs.n) > 0 }}
strategy:
fail-fast: false
max-parallel: 10 # don't saturate the ~20-runner fleet; cells queue as slots free
matrix: ${{ fromJSON(needs.setup.outputs.matrix) }}
# h200 label spans two clusters; pin to the validated dgxc pool (mirrors collectivex-experimental).
runs-on: ${{ matrix.sku == 'h200' && 'h200-dgxc' || matrix.sku }}
timeout-minutes: 350
env:
CX_BENCH: ${{ matrix.backend }}
CX_DEEPEP_V2: ${{ matrix.deepep_v2 && '1' || '' }}
CX_NODES: ${{ matrix.nodes }}
CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json
COLLECTIVEX_SOURCE_SHA: ${{ github.sha }}
# Consolidated shards run a whole build-group (up to ~74 cases) + one from-source build in ONE
# slurm allocation, so the launcher's default 45-min --time is too short. 120 min gives headroom;
# the allocation releases early when the shard finishes, so short shards don't waste it.
CX_TIME: '120'
CX_NODELIST: ${{ matrix.sku == 'mi355x' && 'mia1-p01-g10,mia1-p01-g15' || '' }}
CX_STAGE_DIR: ${{ matrix.sku == 'gb200' && '/mnt/lustre01/users-public/sa-shared/cx-stage' || '' }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
with: { clean: true }
- uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: cxsweep-matrix-${{ github.run_id }}
path: experimental/CollectiveX
- name: Extract this shard's cases (stdlib only — no runner deps)
working-directory: experimental/CollectiveX
run: |
set -euo pipefail
python3 -c "
import json
m=json.load(open('matrix_full.json'))
s=[x for x in m['include'] if x['id']=='${{ matrix.id }}']
assert s, 'shard ${{ matrix.id }} not in matrix'
s=s[0]
json.dump({'id':s['id'],'sku':s['sku'],'backend':s['backend'],'nodes':s['nodes'],'deepep_v2':s['deepep_v2'],'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
print('shard ${{ matrix.id }}:', len(s['cases']), 'cases')
"
- name: Sweep shard ${{ matrix.id }} (${{ matrix.n }} cases, one allocation)
env:
RUNNER_NAME: ${{ runner.name }}
run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh"
- name: Shard summary
if: always()
run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY" || true
- name: Upload shard results
if: always()
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: cxshard-${{ matrix.id }}-${{ github.run_id }}
path: experimental/CollectiveX/results/*.json # glob skips the hidden .shard_*.json
if-no-files-found: warn

# ---- aggregate: collect every shard into ONE ndjson (the "result aggregator at the end") ----
aggregate:
Comment on lines +87 to +144
needs: sweep
if: always()
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0
with: { clean: true }
- uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
pattern: cxshard-*-${{ github.run_id }}
path: _shards
merge-multiple: true
- name: Aggregate shards -> one ndjson
working-directory: experimental/CollectiveX
run: |
set -euo pipefail
tag="${{ inputs.backend }}${{ inputs.deepep_v2 && '-v2' || '' }}"
python3 aggregate_results.py --in-dir ../../_shards --out "results/aggregate/collectivex_${tag}_${{ github.run_id }}.ndjson"
{
echo "## CollectiveX sweep aggregate (${tag})"
echo '```'
wc -l results/aggregate/*.ndjson 2>/dev/null || echo "no ndjson"
echo '```'
} >> "$GITHUB_STEP_SUMMARY"
- name: Upload aggregate
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: cxsweep-aggregate-${{ inputs.backend }}${{ inputs.deepep_v2 && '-v2' || '' }}-${{ github.run_id }}
path: experimental/CollectiveX/results/aggregate/*.ndjson
if-no-files-found: warn
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed

update-frontend-snapshot:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Comment on lines +145 to +175
name: Update InferenceX-app snapshot
needs: aggregate
if: always() && needs.aggregate.result == 'success'
runs-on: ubuntu-latest
steps:
- name: Trigger CollectiveX snapshot update
env:
FRONTEND_PAT: ${{ secrets.INFX_FRONTEND_PAT }}
run: |
set -euo pipefail
tmp="$(mktemp -d)"
trap 'rm -rf "$tmp"' EXIT
git clone --quiet --depth 1 --branch collectivex \
"https://x-access-token:${FRONTEND_PAT}@github.com/SemiAnalysisAI/InferenceX-app.git" \
"$tmp/app"
cd "$tmp/app"
git pull --rebase origin collectivex
mkdir -p .github
{
echo "source_run_id=${{ github.run_id }}"
echo "source_sha=${{ github.sha }}"
echo "source_workflow=${{ github.workflow }}"
echo "source_run_url=https://github.com/SemiAnalysisAI/InferenceX/actions/runs/${{ github.run_id }}"
echo "triggered_at=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
} > .github/collectivex-source-run.env

git config user.name "InferenceX Data Bot"
git config user.email "actions@users.noreply.github.com"
git add .github/collectivex-source-run.env
if git diff --cached --quiet; then
echo "CollectiveX source-run marker is already current."
exit 0
fi
git commit -m "chore: trigger CollectiveX data update for ${{ github.run_id }}"
git push origin HEAD:collectivex

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}
Comment on lines +176 to +210
22 changes: 22 additions & 0 deletions experimental/CollectiveX/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# in-container nccl-tests build cache
.nccl-tests/
# python
__pycache__/
*.pyc
# generated run artifacts: captured env embeds hostnames / GPU UUIDs / NIC GUIDs,
# so keep results out of git (CI uploads them as workflow artifacts instead).
# Sanitized headline numbers live in CONTAINERS.md.
results/*.json
results/plots/
results/raw_*.txt
results/raw_*.txt.stderr
# superseded SSH-provenance result JSONs moved aside so plot_ep's recursive glob
# won't double-load them; same hostname/UUID sensitivity as results/.
_ssh_v4_archive/
# running local-only reflection log (not a committed artifact)
notes.md
goal.md
# superseded seeded-runtime GHA results (canonical counterpart exists); kept out of the plot glob
_seeded_archive/
# newest-good-per-config kept in results/; superseded runs moved here (out of the plot glob)
_superseded/
75 changes: 75 additions & 0 deletions experimental/CollectiveX/CONTAINERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# CollectiveX — container & library versions

One **multi-arch, digest-pinned** container is used for all NVIDIA SKUs, so B200
(x86_64) and GB200 (aarch64) share a single reference and the cross-vendor
comparison is truly same-image. Set in `runtime/common.sh` (`cx_default_image`).

## Default container (all NVIDIA SKUs)

- **Image:** import by tag **`lmsysorg/sglang:v0.5.11-cu130`** (multi-arch OCI index). Expected index digest, recorded for provenance/verification: `sha256:061fb71f838e82000a1768c159654d526c2f17ebe751c21e7fc48ca53c8ef975`.
- **Multi-arch manifest list:** linux/amd64 + linux/arm64; `enroot import` on each host pulls the matching arch.
- **Import by TAG, not digest.** enroot builds its anonymous Docker Hub token scope from the *tag* and succeeds (no creds needed — same as the serving launchers). A bare `repo@sha256:` ref makes enroot prompt for a password and **hang** in non-interactive CI; a combined `tag@sha256:` ref 400s. `cx_ensure_squash` therefore imports by tag with `</dev/null` (a missing token fails fast instead of hanging). First import is multi-GB (~minutes); subsequent runs reuse the staged squash.
- **Why v0.5.11-cu130 (chosen):** it's the newest cu130 release **pre-staged on BOTH clusters** — B200 `/home/sa-shared/containers/` (amd64 squash) and GB200 `/mnt/lustre01/users-public/sa-shared/` (arm64 squash), same filename — so neither side imports at all. (Shared cu130 multi-arch squashes across both clusters: v0.5.8.post1, v0.5.9, v0.5.11 — v0.5.11 is newest.) `v0.5.12-cu130` is staged on B200 but **not** GB200: its 62 layers overflow enroot's overlay-based squash creation on the GB200 kernel (`enroot-mksquashovlfs: failed to mount overlay … Invalid argument`), so it can't be the shared default.
- **DeepEP: NOT bundled** here → `run_in_container.sh` builds it via `rebuild-deepep` at job setup (CX_BENCH=deepep). The NCCL path needs no DeepEP.
- **nccl-tests build:** in-container (login nodes have no `nvcc`), `CX_NCCL_HOME=/usr` (system `nccl.h` in `/usr/include`), `CX_CUDA_HOME=/usr/local/cuda`. cu130 lineage ⇒ CUDA 13; confirm exact NCCL/torch on first run and append below.

## Audited reference (cu130 lineage)

Live audit of the sibling DeepSeek-V4 image `lmsysorg/sglang:deepseek-v4-grace-blackwell` (aarch64) on GB200, 2026-06-23 — the multi-arch `v0.5.11-cu130` should match closely (same cu130 base); reconfirm on first run:

| Component | Version |
|---|---|
| OS / arch | Ubuntu 24.04.3, aarch64 |
| CUDA (`nvcc`) | 13.0 (V13.0.88) |
| NCCL (system `/usr/include/nccl.h`) | 2.28.3; torch-bundled 2.27.7 |
| PyTorch | 2.9.1+cu130 |
| DeepEP | bundled in *that* image; **not** in the multi-arch default |
| NVSHMEM | `libnvshmem_host.so.3` present |
| OpenMPI / gcc / make | 4.1.6 / 13.3.0 / 4.3 |
| GPU / driver | GB200, 580.126.20 |

**Version caveat:** the nccl-tests binary links **system NCCL** (2.28.x), while torch/DeepEP use the **bundled** NCCL (2.27.x). Record both in provenance (env_capture does); don't compare an nccl-tests curve against a DeepEP run as if NCCL were identical.

## Bundled-DeepEP reference images (not the default)

If a bundled DeepEP is needed before `rebuild-deepep` is wired on the multi-arch image, these arch-specific images bundle it (pin by digest):

- B200 (amd64): `lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4aa9ecf59451002b49ba00cae58042de9e2a96378bbd21b404dd62c7b` (pre-staged on B200)
- GB200 (arm64): `lmsysorg/sglang:deepseek-v4-grace-blackwell@sha256:4f583347d7ff08aef7e16dbb4985b2a7c147ff49a0c261d5e27b8f5f41719368` (staged on GB200 Lustre)

Select via `CX_IMAGE=…@sha256:…` on the launch script.

## AMD container (MI355X) — MoRI EP

AMD CDNA4 cannot run the CUDA multi-arch image; MI355X uses a ROCm image that
bundles **MoRI** (AMD's EP dispatch/combine library). Set in `cx_default_image`
for `mi355x*` (also `mi350x*`/`mi325x*`/`mi300x*`).

- **Image:** `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2` (single-arch ROCm 7.2.0 runtime; from the AMD master serving config). **Not digest-pinned yet** — record the digest here and pin once validated on the runner, like the NVIDIA image.
- **MoRI:** bundled in-image (build tag `mori-0227`). `tests/ep_mori.py` follows the upstream `ROCm/mori` `tests`/`examples` dispatch+combine path; capture the exact MoRI commit (`MORI_COMMIT` env → provenance) on first run.
- **Squash is NODE-LOCAL** (`/var/lib/squash`), not a shared FS, so `launch_mi355x-amds.sh` imports via `srun` on the allocated node (the NVIDIA adapters import on the login node onto shared FS). pyxis flags `--container-writable --container-remap-root` (matches the AMD serving launcher); workspace is bind-mounted directly (no `CX_STAGE_DIR`).
- **Transport:** intra-node **XGMI** (8× MI355X). Two backends wired: `CX_BENCH=mori` (MoRI EP dispatch/combine) and `CX_BENCH=nccl` (collective primitives via **rccl-tests**, the ROCm nccl-tests fork — built in-container with `make` against `/opt/rocm`/`amdclang++`/`librccl`; same `<op>_perf` binaries + output format as nccl-tests, so `run_nccl.py` parses it unchanged).
- **Validated on MI355X** (on-node via `salloc`+`srun`, nodes `mia1-p01-g10`/`g15`): `salloc` → enroot import (anonymous auth + tag, 24 layers → ~60 GB node-local squash) → torchrun → 8-rank Gloo + MoRI shmem → `EpDispatchCombineConfig`/dispatch/combine **numerically correct** (combine within tol, `max_rel ~2e-3`, ~85 µs round-trip at the decode shape). Three ionic_rdma-fabric constraints, all handled in `tests/ep_mori.py`:
- **RDMA MR size ceiling (~4 GiB).** MoRI registers the *entire* symmetric heap as one RDMA MR at init — even single-node (no disable-RDMA knob exists; only `MORI_DISABLE_P2P`, which forces the opposite). On these ionic NICs a 6 GiB MR fails (`RegisterRdmaMemoryRegion … errno 22 EINVAL`) while 2 GiB registers. Heap is held at **`MORI_SHMEM_HEAP_SIZE=2G`** (override `CX_MORI_HEAP_SIZE`). The reference test's hardcoded `6G` is exactly why it can't run as-is here.
- **Buffer sizing.** `max_num_inp_token_per_rank` is bounded (512 at the decode shape) so dispatch/combine buffers fit the 2 GiB heap. Much larger token counts would need a heap past the MR ceiling — out of reach on this fabric for now.
- **Teardown.** MoRI's shmem teardown asserts (`CheckStatusValid` → SIGABRT) when the op is destroyed after `shmem_finalize()`; `tests/ep_mori.py`'s `finalize()` hard-exits after writing results to avoid it.

Still TODO: capture the exact MoRI commit + a version table (ROCm/torch/RCCL) into provenance, and digest-pin the image.

## Cluster access / QOS

- **B200** (`slurm-login-slinky`): account `benchmark`, **only `gpu-2_qos`** → partition `gpu-2` only (shared with the serving sweep). `gpu-1`/`all` (idle) need `gpu-1_qos`/`all_qos`, not associated with this account.
- **GB200** (`watchtower`): account `benchmark`, qos `normal`, partition `batch` (`AllowQos=ALL`); idle capacity available. Runner workspace is **not** compute-visible → set `CX_STAGE_DIR` to a Lustre path (the launcher rsyncs there).

## First real results (Milestone-0 spike, on the DeepSeek-V4 images)

nccl-tests (system NCCL 2.28.3), all correctness-passed, peak bus-bw:

| op | B200 8× (NVLink island, x86_64) | GB200 4× (NVL72 MNNVL, aarch64) |
|---|---|---|
| all_reduce | 835 GB/s | 689 GB/s |
| all_gather | 653 | 658 |
| reduce_scatter | 667 | 661 |
| alltoall | 638 | 666 |

(B200 vs GB200 carry distinct `comparison_key`s by topology-class, so they are labelled-distinct, not silently merged. Re-run on the multi-arch default to refresh under one image.)
Loading
Loading