feat(utils.sh): add ezpz_load_modules_{aurora,sunspot,polaris}#131
feat(utils.sh): add ezpz_load_modules_{aurora,sunspot,polaris}#131
Conversation
Symmetric to ezpz_setup_xpu, but covers the full ALCF module stack
each system needs for an externally-managed Python (e.g. fresh
`uv venv`) instead of the prebuilt frameworks/conda env.
For each system, loads everything the canonical environment module
would have loaded as deps, but NOT the framework/conda module
itself — so a user can do:
ezpz_load_modules_aurora # or _sunspot / _polaris
uv venv --python=$(which python3)
source .venv/bin/activate
uv pip install ...
without dragging in the prebuilt PyTorch / Python.
- aurora / sunspot: load the same oneAPI + HDF5 + PTI stack as
ezpz_setup_xpu, set the matching XPU runtime envvars (ZE flat
hierarchy, CCL launcher, oneAPI device selector). Aurora also
exports FI_MR_CACHE_MONITOR=userfaultfd for parity with
ezpz_setup_conda_aurora.
- polaris: load PrgEnv-gnu, craype-x86-milan, cray-hdf5-parallel,
cudnn, gcc-native — the same deps the polaris conda module pulls
in. Mirror the conda module's envvar set: CC/CXX, CUDA paths +
PATH/LD_LIBRARY_PATH, NCCL + TensorRT lib paths, ALCF
http(s)_proxy, MPI/JAX hints. Reverse-engineered from
/soft/modulefiles/conda/2025-09-25.lua so we stay in sync with
the canonical module's behavior.
There was a problem hiding this comment.
Copilot wasn't able to review any files in this pull request.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Reviewer's GuideAdds three system-specific helper functions to load ALCF Aurora, Sunspot, and Polaris module stacks (mirroring their canonical frameworks/conda modules minus the actual framework/conda activation) so users can build custom Python virtualenvs atop the standard HPC environments, including matching runtime environment variables. Sequence diagram for ezpz_load_modules_polaris and venv creationsequenceDiagram
actor User
participant Shell
participant ModuleSystem
participant Environment
User->>Shell: ezpz_load_modules_polaris
Shell->>ModuleSystem: module use /soft/modulefiles
ModuleSystem-->>Shell: Module search path updated
Shell->>ModuleSystem: module load PrgEnv-gnu craype-x86-milan cray-hdf5-parallel cudnn gcc-native
ModuleSystem-->>Shell: Modules loaded
Shell->>Environment: export CC CXX
Shell->>Environment: export CUDA_HOME CUDA_PATH CUDA_TOOLKIT_BASE
Shell->>Environment: update PATH LD_LIBRARY_PATH for CUDA
Shell->>Environment: set TORCH_CUDA_ARCH_LIST FLASHINFER_CUDA_ARCH_LIST
Shell->>Environment: set NCCL_and_TensorRT_paths
Shell->>Environment: set http_proxy https_proxy
Shell->>Environment: set MPICH_GPU_SUPPORT_ENABLED MPI4JAX_USE_CUDA_MPI
Shell->>Environment: set XLA_FLAGS XLA_PYTHON_CLIENT_PREALLOCATE
User->>Shell: uv venv --python=$(which python3)
Shell->>ModuleSystem: resolve python3 from loaded stack
ModuleSystem-->>Shell: /path/to/system/python3
Shell->>Shell: create .venv with system python
User->>Shell: source .venv/bin/activate
Shell->>Environment: activate virtualenv
Flow diagram for selecting ezpz_load_modules function by systemflowchart TD
Start([Start])
S["Select target system"]
A[Aurora]
Su[Sunspot]
P[Polaris]
Start --> S
S --> A
S --> Su
S --> P
A --> A1["Call ezpz_load_modules_aurora"]
A1 --> A2["module load oneapi/release/2025.3.1 hdf5 pti-gpu"]
A2 --> A3["Export XPU runtime envvars"]
A3 --> A4["Export FI_MR_CACHE_MONITOR=userfaultfd (if unset)"]
Su --> Su1["Call ezpz_load_modules_sunspot"]
Su1 --> Su2["module load oneapi/release/2025.3.1 hdf5 pti-gpu"]
Su2 --> Su3["Export XPU runtime envvars"]
P --> P1["Call ezpz_load_modules_polaris"]
P1 --> P2["module use /soft/modulefiles"]
P2 --> P3["module load PrgEnv-gnu craype-x86-milan cray-hdf5-parallel cudnn gcc-native"]
P3 --> P4["Export CC CXX CUDA_* paths"]
P4 --> P5["Export NCCL TensorRT HTTP proxy vars"]
P5 --> P6["Export MPICH GPU JAX XLA envvars"]
A4 --> V["uv venv --python=$(which python3) and activate"]
Su3 --> V
P6 --> V
V --> End([Custom venv on system stack])
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6f9b86c63
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| export CUDA_PATH="${CUDA_HOME}" | ||
| export CUDA_TOOLKIT_BASE="${CUDA_HOME}" | ||
| export PATH="${CUDA_HOME}/bin:${PATH}" | ||
| export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH:-}" |
There was a problem hiding this comment.
Avoid exporting empty LD_LIBRARY_PATH entry
When LD_LIBRARY_PATH is unset, this assignment expands to a trailing : (empty path element), which makes the dynamic loader search the current working directory for shared libraries. In ezpz_load_modules_polaris, that can lead to unexpected library resolution (or loading unintended .so files) on clean shells before launching jobs, causing hard-to-debug runtime behavior. Build this string conditionally so no empty segment is introduced when the variable is initially absent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Aurora and Sunspot share identical module loads and XPU-related environment exports; consider factoring this into a single helper (e.g.,
ezpz_load_modules_xpu_common) to avoid future drift between the two stacks. - In
ezpz_load_modules_polaris, adding${_nccl}/includetoPATHis unusual for header directories and may be unintended; you may want to export it viaCPATH/C_INCLUDE_PATHinstead or drop it if not needed.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Aurora and Sunspot share identical module loads and XPU-related environment exports; consider factoring this into a single helper (e.g., `ezpz_load_modules_xpu_common`) to avoid future drift between the two stacks.
- In `ezpz_load_modules_polaris`, adding `${_nccl}/include` to `PATH` is unusual for header directories and may be unintended; you may want to export it via `CPATH`/`C_INCLUDE_PATH` instead or drop it if not needed.
## Individual Comments
### Comment 1
<location path="src/ezpz/bin/utils.sh" line_range="1015" />
<code_context>
+ local _nccl="/soft/libraries/nccl/nccl_2.28.3-1+cuda12.9_x86_64"
+ local _trt="/soft/libraries/trt/TensorRT-10.13.3.9.Linux.x86_64-gnu.cuda-12.9"
+ export PATH="${_nccl}/include:${PATH}"
+ export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH}"
+
+ # ALCF HTTP proxy (compute nodes can't reach the internet otherwise).
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard LD_LIBRARY_PATH expansion to avoid issues under `set -u`
Here you reference `LD_LIBRARY_PATH` without guarding for the unset case, unlike earlier where you used `"${LD_LIBRARY_PATH:-}"`. Under `set -u` this will cause the function to fail. Please align this line with the earlier pattern:
```sh
export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH:-}"
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| local _nccl="/soft/libraries/nccl/nccl_2.28.3-1+cuda12.9_x86_64" | ||
| local _trt="/soft/libraries/trt/TensorRT-10.13.3.9.Linux.x86_64-gnu.cuda-12.9" | ||
| export PATH="${_nccl}/include:${PATH}" | ||
| export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH}" |
There was a problem hiding this comment.
issue (bug_risk): Guard LD_LIBRARY_PATH expansion to avoid issues under set -u
Here you reference LD_LIBRARY_PATH without guarding for the unset case, unlike earlier where you used "${LD_LIBRARY_PATH:-}". Under set -u this will cause the function to fail. Please align this line with the earlier pattern:
export LD_LIBRARY_PATH="${_nccl}/lib:${_trt}/lib:${LD_LIBRARY_PATH:-}"
Summary
Adds three symmetric module-loader functions to `bin/utils.sh`:
Each loads everything its system's canonical environment module would
have pulled in (the framework module on Aurora/Sunspot, the conda
module on Polaris) except the framework/conda module itself — so
a user can stand up their own `uv venv` on top of the system stack
without dragging in the prebuilt PyTorch / Python.
Pairs naturally with `ezpz tar-env` + `ezpz yeet`:
```bash
ezpz_load_modules_polaris # or _aurora / _sunspot
uv venv --python=$(which python3)
source .venv/bin/activate
uv pip install ...
ezpz tar-env
ezpz yeet .venv.tar.gz
```
Implementation notes
`ezpz_setup_xpu`) and exports XPU runtime envvars + Aurora-specific
`FI_MR_CACHE_MONITOR=userfaultfd` (matches
`ezpz_setup_conda_aurora`).
`ezpz_setup_conda_sunspot`).
(`PrgEnv-gnu`, `craype-x86-milan`, `cray-hdf5-parallel`, `cudnn`,
`gcc-native`) and exports the matching envvar set (CUDA paths,
NCCL/TensorRT lib paths, ALCF proxy, MPI/JAX hints).
Reverse-engineered from
`/soft/modulefiles/conda/2025-09-25.lua` to stay in sync with the
canonical module's behavior.
Test plan
Summary by Sourcery
Add helper functions to load system module stacks for ALCF Aurora, Sunspot, and Polaris without activating their framework/conda environments, enabling custom virtualenvs on top of the canonical stacks.
New Features: