Organizing your Slurm batch scripts

About

Note

We expect the users of Discoverer CPU and GPU HPC clusters to be familiar with the Slurm batch script syntax and the basic concepts of resource allocation. In other words, they must have a prior knowledge of the Slurm batch script syntax and the basic concepts of resource allocation. There are plenty of HPC cources and initatives, where this type of knowledge could be acquired.

The purpose of this document is to provide the users with the specific instructions related to preparing Slurm jobs that are acceptable for submission to the Slurm queue manager running on Discoverer and Discoverer+ HPC clusters. Therefore, here we show what options to include or not to include in the Slurm batch script code that will be used on Discoverer and Discoverer+ HPC clusters.

Discoverer CPU Cluster

Before starting, be sure you have read the Resource Overview document to understand what partitions of the nodes do we support for your jobs and the different resource allocators available.

Important

Use the ‘cn’ partition for your jobs, unless you have a specific reason to use a different partition mentioned in Resource Overview. Do not select the ‘ALL’ partition for your jobs. Those jobs are subject to termination administratively at any given moment.

Given bellow is a simple Slurm batch script code that illustrates the way allocation of resources for running a job on Discoverer CPU cluster:

#!/bin/bash

#SBATCH --partition=cn
#SBATCH --job-name=sponge
#SBATCH --account=my_account_name
#SBATCH --qos=my_qos_name
#SBATCH --time=48:00:00

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=251G

#SBATCH --exclusive

#SBATCH -o job.%j.out
#SBATCH -e job.%j.err

#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email-address@domain.edu

module purge
module load gromacs/2023/2023.1-intel-fftw3-openblas-nogpu-openmpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=false
export UCX_NET_DEVICES=mlx5_0:1

cd $SLURM_SUBMIT_DIR

mpirun gmx_mpi mdrun \
       -ntomp ${SLURM_CPUS_PER_TASK} \
       -v -s sponge_GMO_PGL_5_5-equil_20.tpr \
       -deffnm sponge_GMO_PGL_5_5-equil_20

Below we comment on the Discoverer CPU cluster specific options:

--partition

The --partition option may take any name of the partition mentioned in Resource Overview. But we suggest to use the ‘cn’ partition for your jobs, unless you have a specific reason to use a different partition mentioned.

--mem

The --mem option is used to specify the memory required for the job. For CPU cluster jobs, the --mem option is mandatory. The maximum memory allocation per node for cn partition is 251GB. That means that all allocated cores on the node may insist on maximally using 251GB of memory on each node. In fact, the smartest way is to request the amount of memory that matches your application’s memory requirements, and not to insist on the maximum every time. That requires some knowledge about the memory requirements of your application. Note that our fat nodes have 512GB of memory, but their number is limited (see Resource Overview).

--ntasks-per-node

The --ntasks-per-node option is used to specify the number of tasks to be executed per node. For CPU cluster jobs, the --ntasks-per-node option is mandatory and must match the number of processes that will be executed on each node. In case you are running MPI parallel jobs, this option must be set to the number of MPI ranks per node. For example, if you are running a job with 128 MPI ranks per node, you should set --ntasks-per-node=128.

--ntasks-per-core

Before making any decision about the value of this option, you should understand the implications of hyperthreading on the performance of your job. It is possible that you will need to set this option to 2, in case you are running a job that is memory-bound and you are using the hyperthreading. But be sure you realise the impact of this decision on the performance of your job.

The --ntasks-per-core option is used to specify the number of tasks to be executed per CPU physical core. For CPU cluster jobs, the --ntasks-per-core option must match the number of processes that will be executed on each CPU physical core. In case you are running MPI parallel jobs, this option must be set to 1. For example, if you are running a job with 128 MPI ranks per node, you should set --ntasks-per-core=1.

--cpus-per-task

The --cpus-per-task option is used to specify the number of CPUs per task. For CPU cluster jobs, the --cpus-per-task option must match the number of CPU threads that will be executed on each task. For Discoverer CPU cluster, the maximum number of CPU threads per task is 2. Since we have 128 CPU cores per node and each core has 2 threads, the maximum value of --cpus-per-task you may request is 256 in case you have --ntasks-per-node=1. Which means that you may run only one task per node requesting the use of all 256 CPU threads available on the node. In case you are running hybrid MPI/OpenMP parallelelised jobs, this option must be set to the number of CPU threads per MPI rank. For example, if you are running a job with 128 MPI ranks per node that explores the hyperthreading, you should set --ntasks-per-node=128, --ntasks-per-core=1 and --cpus-per-task=2. The latter corresponds to maximum utitlisation of the node’s CPU resources.

--exclusive

This option means your job will run alone on the allocated nodes. No other jobs will be run on the same nodes. This is a good way to avoid resource contention and to ensure that your job will run efficiently. But the exclusivity has its downsides, because your account will be charged with a full node allocation even if you use less than 128 CPU cores and 256 CPU threads. So be careful when using this option or you will be overcharged for your job.

export UCX_NET_DEVICES=mlx5_0:1

When there is a internode communication in your job, you should specify the UCX network devices to be used for the communication. Our recommendation is to always use the specification of the InfiniBand network device for the communication.

Discoverer+ GPU Cluster

Before starting, be sure you have read the Resource Overview document to understand what partitions of the nodes do we support for your jobs and the different resource allocators available.

Important

Use the ‘common’ partition for your GPU jobs. The --gres=gpu:1 directive is mandatory for all jobs utilising GPU devices. The CPU-only jobs may be run on the Discoverer+ for the sake of software installation using special QoS.

Given bellow is a simple Slurm batch script code that illustrates the way allocation of resources for running a job on Discoverer+ GPU cluster:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=gpu_job
#SBATCH --account=my_account_name
#SBATCH --qos=my_qos_name
#SBATCH --time=48:00:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G

#SBATCH -o job.%j.out
#SBATCH -e job.%j.err

#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email-address@domain.edu

module purge
module load cuda/12.1
module load python/3.9

cd $SLURM_SUBMIT_DIR

python my_gpu_script.py

Below we comment on the Discoverer+ GPU cluster specific options:

--partition

The --partition option must be set to ‘common’. This partition contains only nodes with GPU resources.

--gres

The --gres option is mandatory if your jobs needs to utlise GPU NVIDIA H200 accelerators installed on the nodes. It is designed to specify the type and amount of GPU resources required for your job. On Discoverer+ GPU cluster you do not need to specify the type of the GPU, but you must specify the amount of GPUs per job and per-node. For example, if you are running a job with 1 GPU per job and 1 GPU per node, you should set --gres=gpu:1. If you are running a job with 2 GPUs per job and 2 GPUs per node, you should set --gres=gpu:2. Note that each node is equipped with 8 NVIDIA H200 GPUs.

--mem

The --mem option is used to specify the memory required for the job. For GPU cluster jobs, the --mem option is not mandatory. The memory allocation should be appropriate for your GPU application. Typical values range from 16GB to 64GB depending on your application’s requirements. Be sure you known those values for your application. Otherwise you may be overcharged for your job.

--ntasks-per-node

The --ntasks-per-node option is used to specify the number of tasks to be executed per node. For GPU cluster jobs, the --ntasks-per-node option is mandatory and must match the number of processes that will be executed on the host to support the execution of the code on the GPU accelerators. In case you are running MPI parallel jobs, this option must be set to the number of MPI ranks per node. For example, if you are running a job with 1 MPI rank per node, you should set --ntasks-per-node=1.

--ntasks-per-core

The --ntasks-per-core option is used to specify the number of tasks to be executed per CPU physical core. For GPU cluster jobs, the --ntasks-per-core option must match the number of processes that will be executed on each CPU physical core. In case you are running MPI parallel jobs, this option must be set to 1. For example, if you are running a job with 1 MPI rank per node, you should set --ntasks-per-core=1.

--cpus-per-task

The --cpus-per-task option specifies the number of CPU cores per task. For GPU jobs, you typically need fewer CPU cores than for CPU-only jobs, as the main computation happens on the GPU. Typical values range from 4 to 16 CPU cores per GPU.

--nodes

In most cases, we need to set --nodes=1. For multi-node GPU applications, you may need to set --nodes=2, but that is rare for there is rare for a single job to require the use of 16 NVIDIA H200 GPUs.