This document explains how to set up and run language model training on Olivia, Norwayβs national supercomputer.
The guide is reproducible for users at the National Library under project nn30001k and is built around Apptainer containers and SquashFS overlays for reproducible environments.
It might be useful for others as well, so this repository is public. Thanks to Magnar BjΓΈrve at NRIS for assisting with this setup.
From this repository:
wget https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/build_overlay.slurm
wget https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/train_single.slurm
wget https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/requirements_simplified.txt
wget https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/test_imports.py
wget https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/README.mdPlace all of these files in:
/cluster/work/projects/nn30001k/$USER/code/
Oliviaβs compute nodes run Apptainer containers on a shared parallel filesystem (Lustre).
Because Lustre performs poorly with many small files, we package Python environments into a compressed read-only overlay (.sqsh).
This overlay is mounted together with a base container at runtime.
| Component | Description |
|---|---|
pytorch_nvidia_25.06_arm64.sif |
Base container with CUDA + PyTorch |
myenv_<hash>_arm64.sqsh |
Squashed Python venv overlay |
build_overlay.slurm |
Builds the overlay |
train_single.slurm |
Runs training |
requirements_simplified.txt |
Python requirements |
test_imports.py |
Sanity test of the environment |
nb-gpt-posttrain |
Example training repository (Nynorsk/BokmΓ₯l SFT) |
/cluster/work/projects/nn30001k/$USER/
βββ apptainer_cache/ # Cache for pulled images
βββ containers/ # Base .sif containers
βββ overlays/ # Squashed Python venvs (.sqsh)
βββ code/ # Scripts, requirements, helper code
βββ nb-gpt-posttrain/ # Training repo (large)
βββ runs/ # Output models & checkpoints
βββ logs/ # SLURM logs
βββ hf_cache/ # Hugging Face cache
cd /cluster/work/projects/nn30001k/$USER/
git clone https://github.com/NationalLibraryOfNorway/nb-gpt-posttrain.gitcd /cluster/work/projects/nn30001k/$USER/
mkdir -p {apptainer_cache,containers,overlays,code,runs,logs,hf_cache}export MYROOT=/cluster/work/projects/nn30001k/$USER
export APPTAINER_CACHEDIR=$MYROOT/apptainer_cache
apptainer pull --arch arm64 \
$MYROOT/containers/pytorch_nvidia_25.06_arm64.sif \
docker://nvcr.io/nvidia/pytorch:25.06-py3requirements_simplified.txt (already included):
trl
datasets
sacrebleu
wandb
Submit the job:
cd /cluster/work/projects/nn30001k/$USER/code
sbatch build_overlay.slurmResult:
/cluster/work/projects/nn30001k/$USER/overlays/myenv_<hash>_arm64.sqsh
myenv_<12-char SHA256 hash of requirements_simplified.txt>_arm64.sqsh
Example:
myenv_998211f8576b_arm64.sqsh
Submit:
cd /cluster/work/projects/nn30001k/$USER/
sbatch code/train_single.slurmMonitor:
tail -f logs/nb_train_1gpu_*.{out,err}The .sqsh file is a read-only compressed filesystem containing a Python virtual environment.
Mount order:
Base container (.sif)
β
Overlay (.sqsh)
β
Merged environment
Inside the container:
/user-software/bin/activate
/user-software/lib/python3.x/site-packages/
To add packages:
- Edit
requirements_simplified.txt - Run:
cd /cluster/work/projects/nn30001k/$USER/code sbatch build_overlay.slurm
- A new
.sqshwith a new hash will be created.
Proxy settings (already inside the SLURM scripts):
export http_proxy=http://10.63.2.48:3128/
export https_proxy=http://10.63.2.48:3128/
export no_proxy="localhost,127.0.0.1,.local,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"mkdir -p ~/.huggingface
echo "<your_token_here>" > ~/.huggingface/tokenThe scripts automatically export HF_TOKEN inside the container.
| Path | Contents |
|---|---|
logs/ |
SLURM logs |
runs/ |
Training outputs |
hf_cache/ |
Hugging Face cache |
cd /cluster/work/projects/nn30001k/$USER/
git clone https://github.com/NationalLibraryOfNorway/nb-gpt-posttrain.git
mkdir -p {apptainer_cache,containers,overlays,code,runs,logs,hf_cache}
apptainer pull --arch arm64 containers/pytorch_nvidia_25.06_arm64.sif docker://nvcr.io/nvidia/pytorch:25.06-py3
wget -P code/ https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/build_overlay.slurm
wget -P code/ https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/train_single.slurm
wget -P code/ https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/requirements_simplified.txt
wget -P code/ https://raw.githubusercontent.com/NbAiLab/NB-Olivia-tutorial/main/test_imports.py
sbatch code/build_overlay.slurm
sbatch code/train_single.slurmPer Egil Kummervold
Senior Researcher β National Library of Norway
Project: nn30001k