Skip to content

Latest commit

 

History

History
202 lines (121 loc) · 11.1 KB

File metadata and controls

202 lines (121 loc) · 11.1 KB

Slurm :id=slurm

The official Slurm documentation can be found at the SchedMD site.

Why should I use Batch?

Whilst your desktop or laptop computer has a fast processor and quick access to data stored on its local hard disk/ssd; you may want to run large compute tasks that require more CPU/GPU/memory to run, or to process a large amount of data. Our compute servers are part of a Batch system that allows you to accomplish such tasks in a reasonable amount of time. Our servers also have fast access to centralised storage, have widely-used, common software packages and images pre-installed, and will enable you to run these larger compute tasks without impacting your own local desktop/laptop resources.

Why should I use Slurm?

Historically, SLAC has used IBM's LSF as our Batch scheduler software. However, with the addition of new hardware such as our NVIDIA GPUs, we have decided to switch to Slurm to schedule compute jobs as it is also commonly used across other academic and laboratory environments. We hope that this commonality and consistency with other facilities will enable easier usage for users, as well as simpler administration for the Science Computing team here at SLAC.

What should I know about using Batch?

The purpose of a batch system is to enable efficient sharing of the CPUs, GPUs, memory, and ephemeral storage that exists in a compute Cluster. The cluster is comprised of many servers - often called batch nodes. As the number of these batch nodes and their resources (such as GPUs) in our environment is finite, we need to keep account of which users consume which resources so that we can provide access to all users in a fair manner.

What is a Slurm Partition? :id=partition

A partition is a logical grouping of batch nodes. These are servers of a similar technical specification (eg Cascade Lake CPUs, Telsa GPUs etc). Examples of partition names are roma and milano.

How do I See the Status of the available resources?

To view the status of the nodes on SDF from the command line use sinfo. The following produces a reasonably informative summary of all the nodes on SDF:

sinfo --Node --format="%10N %.6D %10P %10T %20E %.4c %.8z %8O %.6m %10e %.6w %.60f"

To get only information on a specific partition use --partition=<partition>. To get more information on a specfic node, use the following scontrol command:

scontrol show node <node name>

The names of the nodes can be found in the left-most column of the above sinfo command (called NODELIST) for some reason.

How do I use Slurm? :id=slurmexample

There are two ways to interact with slurm

  • Using command line tools on the interactive pools.
  • Using the ondemand web interface.

Common actions that you may want to perform are:

Submit a job srun or sbatch request a quick job to be ran - eg an interactive terminal for srun and a longer job(s) with sbatch
Show information about a job scontrol show job <jobid> shows detailed information about the state, resources requested etc. for a job
Cancel or terminate a job scancel <jobid> cancel a job; you can also use --signal=INT to send a unix signal to the job to cleanly terminate
Show position in squeue sprio shows the fairshare calculations that determine your place in line for the job to start
Show running statistics about a job sstat show job usage details
Modify accout/add users to partitions etc. sacctmgr manage Associations

How do I submit a Batch Job?

In order to submit a batch job, you have to:

  1. create a text file containing some slurm commands (lines starting with #SBATCH) and a list of commands/programs that you wish to run. This is called a batch script.
  2. submit this batch script to the cluster using the sbatch command
  3. monitor the job using scontrol show job

Create a Batch Script

Create a job submission script (text file) script.sh (or whatever filename you wish):

#!/bin/bash
 
#SBATCH --partition=milano
#
#SBATCH --job-name=test
#SBATCH --output=output-%j.txt
#SBATCH --error=output-%j.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=1g
#
#SBATCH --time=0-00:10:00
#
#SBATCH --gpus 1
 
<commands here>

In the above example, we write a batch script for a job named 'test' (using the --job-name). You can choose what ever name you wish to give the job so that you may be able to quickly identify it yourself. Both stdout and stderr will be outputted the same file output-%j.txt in the current working working - %j will be replaced with the slurm job id (using the --output and --error options). We request a single Task (think of it as an MPI rank) and that single task will request 12 CPUs; each of which will be allocated 1GB of RAM - so a total of 12GB. By default, the --ntasks will be equivalent to the number of nodes (servers) asked for. In order to aid scheduling (and potentially prioritising the Job), we limit the duration of the Job to 10 minutes. The format of the time limit field is D-HH:MM:SS. We also request a single GPU with the Job. This will be exposed via CUDA_VISIBLE_DEVICES.

?> TIP: only lines starting with #SBATCH will be processed by the slurm interpretor. As the script itself is just a bash script, any line beginning with # will be ignored. As such you may also comment out slurm directives by using somethign like ##SBATCH

We can define where the job will run using the --partition option.

?> You can think of the batch script as a shell script but with some extra directives that only slurm understand. As such, you can also just run the same script in hte command line to ensure that your job will work; ie sh script.sh will run the same set of commands but on the local host. This therefore also means that if you already have a shell script that runs your code, you can 'slurmify' it by adding slurm directives with #SBATCH. Please note, however, that if you are using GPUs in your code etc. the login node may not have any GPUs and hence your local run will fail.

?> add something about submitting sbatch commands directy without using a batch script - ie --wrap

Specifying how long the job should run for :id=time

It is important that you specify a meaningful duration for which your expect your job to run for. This allows the slurm scheduler to appropriately priortize your job against other jobs that are competing for the limited resources in the cluster. The duration of a job may depend upon many different factors such as the type of hardware that you may be constraining your job to run against, how well your code/application scales with multiple nodes, the speed of memory and disk access etc. etc.

You can specify the expected duration with the --time option. Valid time formats are:

M (M minutes)
M:S (M minutes, S seconds)
H:M:S (H hours, M minutes, S seconds)
D-H (D days, H hours)
D-H:M (D days, H hours, M minutes)
D-H:M:S (D days, H hours, M minutes, S seconds)

Once the job exceeds the specified job time, it will terminate. Unless you checkpoint your application as it progresses this may result in wasted cycles and the need to submit the job again with a longer duration.

Specifying CPU requirements :id=cpu

!> TODO

Specifying memory requirements :id=memory

!> TODO

Specifying nodes with specific resources (constraints) :id=constraints

!> TODO

Specifying local scratch space :id=scratch

!> TODO

Notification of job status :id=notification

Changing the working directory :id=workingdir

!> diff between cd'ing on the script and --workingdir

Specifying a reservation :id=reservation

Submit the job :id=submit-batch-script

?> note stuff about workign directories etc.

After you have created a batch script, you then need to tell slurm to queue it so that it may run. The command to you is sbatch and is synonymous with the bsub command in LSF. Therfore to submit the script script.sh we simply run

sbatch script.sh

If successful, it should provide you with the job id that the script will run as. You can use this job id to monitor your job progress.

?> TIP: you can also submit the slurm directives directly on the command line rather than within the batch script. When submitted as arguments to srun or sbatch, they will take precedence over any same directives that may already be specified in the batch script. e.g. if you run sbatch --partition ml script.sh and your script.sh contains a definiting to use the shared partition, your job will be submitted into the ml partition.

Monitor job progress :id=monitor-job

You can then use the command to monitor your job progress

squeue

And you can cancel the job with

scancel <jobid>

How can I request GPUs?

You can use the --gpus to specify gpus for your jobs: Using a number will request the number of any gpu that is available on the parition that yo choose. The type of gpu you get will depend upon the partition (cluster) that you request.

# request single gpu
srun -A <account name> -p ampere -n 1 --gpus 1 --pty /bin/bash

How can I see what GPUs are available?

# sinfo -o "%12P %5D %14F %7z %7m %10d %11l %42G %38N %f"
PARTITION    NODES NODES(A/I/O/T) S:C:T   MEMORY  TMP_DISK   TIMELIMIT   GRES                                       NODELIST                               AVAIL_FEATURES
roma*        61    56/0/5/61      2:64:1  512000  0          10-00:00:00 (null)                                     sdfrome[003-063]                       CPU_GEN:RME,CPU_SKU:7702,CPU_FRQ:2.00GHz
milano       135   52/78/5/135    2:64:1  512000  0          10-00:00:00 (null)                                     sdfmilan[001-072,101-131,201-232]      CPU_GEN:RME,CPU_SKU:7713,CPU_FRQ:2.00GHz
ampere       23    23/0/0/23      2:64:2  1029344 0          10-00:00:00 gpu:a100:4                                 sdfampere[001-023]                     CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0

Why is My Job taking a long time to start?

This is often due to limited resources. The simplest way is to request less CPU (--cpus) or less memory (--memfor your Job. However, this will also likely increase the amount of time that you need for the Job to complete. Note that perfect scaling is often very difficult (ie using 16 CPUs will not run twice as fast as 8 CPUs, as will using 4 nodes via MPI will not run twice as fast as 2 nodes), so it may be beneficial to submit many smaller Jobs if your code allows it. You can also set the --time option to specify that your job will only run upto that amount of time so that the scheduler can better fit your job in.

You can also make use of the Scavenger QoS such that your job may run on any available resources available at SLAC. This, however, has the disadvantage that should higher priority jobs run on the same resources, your jobs may be terminated (preempted) - possibly before it has completed.