Organizing your Slurm batch scripts

About

The users of Discoverer HPC cluster should consult the documentation related to the submission tools batch to gain a thorough understanding of all control options and resource allocation mechanisms that could be employed in the Slurm batch scripts:

https://slurm.schedmd.com/sbatch.html

This document gives directions about how to adapt or write from scratch Slurm batch scripts that fit the Discoverer HPC compute node architecture and implementation of computational resource accounting.

Common resource allocators in Slurm batch scripts

Given bellow is a simple Slurm batch script code that illustrates the way allocation of resources for running a job on Discoverer cluster:

#!/bin/bash

#SBATCH --partition=pm6-isw2,pm9-isw0,pm11-isw2
#SBATCH --job-name=sponge
#SBATCH --account=my_account_name
#SBATCH --qos=my_qos_name
#SBATCH --time=48:00:00

#SBATCH --nodes           2
#SBATCH --ntasks          256
#SBATCH --ntasks-per-core 1
#SBATCH --cpus-per-task   2

#SBATCH -o job.out
#SBATCH -e job.err

#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email-address@domain.edu

module purge
module load gromacs/2023/2023.1-intel-fftw3-openblas-nogpu-openmpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=false
export UCX_NET_DEVICES=mlx5_0:1

cd $SLURM_SUBMIT_DIR

mpirun gmx_mpi mdrun \
       -ntomp ${SLURM_CPUS_PER_TASK} \
       -v -s sponge_GMO_PGL_5_5-equil_20.tpr \
       -deffnm sponge_GMO_PGL_5_5-equil_20

Warning

In this document we discuss only those of the lines in the scirpt code that start with #SBATCH. Everything bellow those lines is application-specific!

That code has to be saved as a file and the submitted to the queue.

--partition

Discoverer HPC compute nodes are organized into partitions of nodes. Those partitions are described in Resource Overview (see “Partitions (of nodes)”).

One rack hosts 96 nodes and 4 InfiniBand switches. Inside each rack, 24 nodes are connected to the same InfiniBand switch. Which means that if your job needs very intensive and fast communication between the parallel tasks running on multiple nodes, you have to select nodes on a partition where they are connected if not to the same switch than to different switches inside one rack inside the same rack.

Warning

Do not select the ‘ALL’ partition for your jobs. Those jobs are subject to termination administratively at any given moment.

One rack hosts 96 nodes and 4 InfiniBand switches. Within each rack, 24 nodes are connected to the same InfiniBand switch. This implies that if your parallel code executes tasks that extensively exchange data with each other, it is necessary to configure the Slurm batch script for that job to execute the code on nodes that are connected to the same switch or located within the same rack. To run your job only on nodes in partition pm6-isw2, you can tell Slurm to only run it on nodes in rack #6 connected to the InfiniBand switch #2:

#SBATCH --partition=pm6-isw2

Occasionally, the requested number of nodes in the selected partition might be already occupied with the execution of previously submitted jobs, and therefore, if you submit your job to that partition, it will be held in the queue until the requested number of nodes becomes available. To avoid that delay, you may select several partitions and the job will be executed on the first in line, whereupon the requested number of nodes is available:

#SBATCH --partition=pm6-isw2,pm9-isw0,pm11-isw2

In this particular example, the Slurm will attempt firstly to execute the submitted job on the nodes located in pm6-isw2. If the necessary number of nodes to run the job are not currently available in pm6-isw2, Slurm will try to run the job on nodes in pm8-isw0. And if there aren’t enough free nodes in pm9-isw0 too, a new attempt to execute the job will be made on the nodes in pm11-isw2. If none of the selected partitions has enough free nodes to accommodate the job, it will be kept in the queue until that number of nodes becomes available.

--job-name

The job name is the job name known to the Slurm. Naming the jobs helps you find them in the long list of submitted jobs. The person who submits the job should choose the job name.

It is noteworthy that the job name can be passed as a command line argument (-J) to the Slurm batch script:

sbatch -J my_job_name job.sbatch

--account

Providing the name of the Slurm account is mandatory. See Computational resources allocation and accounting.

Note

If you do not know your project’s Slurm account name, ask the Support (see Getting help).

--qos

Providing the name of the Slurm QoS is mandatory. See Computational resources allocation and accounting.

Note

If you do not know the Slurm QoS name, ask the Support (see Getting help).

--time

This is a rough estimate of the wall time of the job. Always provide that number within the batch script code. The preferable time format is hours:minutes:seconds.

--ntasks

The quantity of tasks that the job will generate and execute. When you run code in parallel, the number of tasks matches the number of processes with PID identifier that are running at the same time.

Important

The handling of tasks may occasionally turn out to be challenging owing to the hyperthreading. See the comments made for --ntasks-per-core below.

--nodes

This declaration defines how many nodes are needed to host and execute the number of tasks requested. The primary function of this declaration is to evenly distribute the N tasks across M nodes (M ≤ N). It should be noted that the distribution of tasks over numerous nodes may be restricted for your account (see Computational resources allocation and accounting).

Note

Request a support (see Getting help) if you are uncertain about how to estimate the right number of nodes for your job.

--ntasks-per-core and --cpus-per-task

The parameter --ntasks-per-core binds tasks to processor cores. It is useful to control the productivity of parallel jobs that benefit from the hyperthreading activated on our processor. A good example of such applications are those adopting a hybrid MPI/OpenMP model of parallel execution. In this case, each MPI task binds to a processor core.

--cpus-per-task defines the number of threads that can be carried out by one running task.

For instance, to run 128 MPI tasks on one of our compute nodes, and 2 OpenMP threads on top of each MPI task:

#SBATCH --node=1
#SBATCH --ntasks=128
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=2

But because 2 OpenMP threads are carrying by each MPI task, all 256 processor threads on the compute node will become occupied/reserved.

If an MPI application does not implement a threading model, then a declaration like this one:

#SBATCH --node=1
#SBATCH --ntasks=128
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=1

will run 128 MPI tasks and bind each of them to a processor core. But note that binding 128 MPI tasks to 128 processor cores will fully reserve the cores, which in turn means that all 256 corresponding processor threads will be reserved and count as reserved by Slurm for that particular job.

On the other side, a declaration like this (mind the missing --ntasks-per-core there):

#SBATCH --node=1
#SBATCH --ntasks=128
#SBATCH --cpus-per-task=1

will utilize 128 processor threads on one node (not 128 processor cores). The remaining 128 threads will be available for other jobs.

Getting help

See Getting help