NUMA-aware SLURM job execution

Scope

This document addresses the correct configuration and execution of tightly-coupled, single-node, shared-memory parallel jobs on the Discoverer Petascale Supercomputer. The workload class in question is characterised by a single multi-threaded process that spawns a modest number of POSIX threads or OpenMP threads — typically between 16 and 64 — all executing within one operating system process and sharing a common address space. Such jobs are referred to in HPC practice as SMP-parallel or shared-memory parallel workloads, as distinct from distributed-memory parallel jobs that employ MPI across multiple nodes.

The document is concerned specifically with NUMA domain binding under SLURM workload manager control. This is a materially different operational context from running the same application directly on a dedicated workstation or an interactively accessed server, where the user has unrestricted access to the full machine and may apply numactl without any interaction with a batch scheduler. On Discoverer, compute node access is mediated exclusively through SLURM: the scheduler controls which CPUs are assigned to a job, enforces resource limits, and constructs the CPU affinity mask that the process inherits at launch. Correct NUMA binding therefore requires that both the SLURM directives and the numactl invocation are configured in a mutually consistent manner, as described in the sections that follow.

The principles described here apply to any shared-memory parallel application whose performance is sensitive to memory access latency and NUMA locality, including but not limited to applications based on OpenMP, Intel TBB, or native POSIX thread pools.

The document does not address MPI jobs, GPU-accelerated workloads, jobs that span more than one compute node, or interactive use of numactl outside the SLURM environment.

Hardware topology

The compute nodes of the Discoverer CPU cluster each carry two AMD EPYC 7H12 64-Core Processors (Rome microarchitecture, Zen2). The SLURM node declaration, on a per-node basis, has the following pattern (check /etc/slurm/partition.conf on the login or compute nodes):

NodeName=cn0738 Sockets=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257700 MemSpecLimit=20616

Each physical socket contains 64 cores. With SMT enabled, each socket presents 128 logical CPUs, giving 256 logical CPUs per node in total. The SLURM field Sockets=8 does not refer to physical CPU packages, of which there are only two. It refers instead to NUMA domains: AMD EPYC Rome implements NPS4 (NUMA Per Socket = 4) by default, subdividing each physical socket into four NUMA domains, one per die quadrant (CCD group). With two sockets, each presenting four NUMA domains, the operating system and SLURM observe eight NUMA nodes in total. SLURM maps these onto its Sockets field, and CoresPerSocket=16 reflects the 16 physical cores belonging to each NUMA domain.

The memory subsystem consists of 16 × 16 GiB DDR4 DIMMs running at 3200 MT/s, distributed evenly across the eight NUMA nodes, yielding approximately 32 GiB of locally attached DRAM per NUMA node. The fat nodes (fn[01-18]) carry 16 × 64 GiB DIMMs under the same CPU configuration, giving approximately 128 GiB per NUMA node, but are otherwise topologically identical.

The resulting CPU topology, as reported by numactl --hardware, is as follows:

NUMA node 0 — cpus   0–15,  128–143   (socket 0, quadrant 0)
NUMA node 1 — cpus  16–31,  144–159   (socket 0, quadrant 1)
NUMA node 2 — cpus  32–47,  160–175   (socket 0, quadrant 2)
NUMA node 3 — cpus  48–63,  176–191   (socket 0, quadrant 3)
NUMA node 4 — cpus  64–79,  192–207   (socket 1, quadrant 0)
NUMA node 5 — cpus  80–95,  208–223   (socket 1, quadrant 1)
NUMA node 6 — cpus  96–111, 224–239   (socket 1, quadrant 2)
NUMA node 7 — cpus 112–127, 240–255   (socket 1, quadrant 3)

Each NUMA node owns 16 physical cores together with their 16 SMT sibling threads, giving 32 logical CPUs per node.

The inter-node distance matrix (from the same numactl view) is:

node   0   1   2   3   4   5   6   7
  0:  10  12  12  12  32  32  32  32
  1:  12  10  12  12  32  32  32  32
  2:  12  12  10  12  32  32  32  32
  3:  12  12  12  10  32  32  32  32
  4:  32  32  32  32  10  12  12  12
  5:  32  32  32  32  12  10  12  12
  6:  32  32  32  32  12  12  10  12
  7:  32  32  32  32  12  12  12  10

The distance values are proportional to memory access latency relative to the local baseline of 10. A local access (distance 10) is approximately 3.2 times faster than a cross-socket access (distance 32). Intra-socket cross-NUMA accesses (distance 12) are only marginally slower than local. For memory-bandwidth-bound workloads it is therefore essential to keep both CPU execution and memory allocation within a single NUMA node.

SLURM resource accounting configuration

The Discoverer CPU cluster uses the following selection plugin configuration (check /etc/slurm/slurm.conf on the login or compute nodes):

SelectTypeParameters = CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK

CR_CORE_MEMORY specifies that SLURM accounts resources in physical cores rather than logical CPUs. The consequences of this setting are as follows.

The allocatable unit on each node is one physical core, not one SMT thread. Requesting --cpus-per-task=32 with --threads-per-core=2 is interpreted by SLURM as a request for 16 cores, leaving the node’s remaining cores available for concurrent jobs — including their SMT siblings, which may share execution resources with the requesting job’s threads. There is no user-side mechanism to exclusively reserve both SMT threads of a core under this configuration; that would require CR_CPU_MEMORY, which is a cluster-wide administrative setting.

CR_CORE_DEFAULT_DIST_BLOCK sets the default CPU distribution to block mode, causing SLURM to fill cores on one socket before advancing to the next. This interacts with topology placement directives and can, in certain SLURM versions, override --sockets-per-node constraints rather than respecting them, resulting in the scheduler ignoring locality requests and packing cores wherever capacity is found.

The correct sbatch directives

Given the hardware topology and SLURM accounting model described above, the correct directives to request one full NUMA node’s worth of physical cores are:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=16
#SBATCH --threads-per-core=1
#SBATCH --cpus-per-task=16

Description of each directive

--nodes=1 ensures the entire allocation is placed on a single physical node, eliminating any inter-node communication overhead.

--ntasks-per-node=1 declares a single task on the node. In conjunction with --cpus-per-task, this is the mechanism by which SLURM constructs the task’s CPU affinity mask.

--sockets-per-node=1 instructs SLURM’s placement logic to confine the allocation to one of the eight logical sockets, which corresponds to one NUMA node on this hardware. Without this directive, SLURM may distribute the requested cores across multiple NUMA nodes, resulting in remote memory accesses for any data that does not reside on the node where a given thread is scheduled.

--cores-per-socket=16 requests all 16 physical cores of the chosen NUMA node. In conjunction with --sockets-per-node=1, this fully specifies the locality constraint.

--threads-per-core=1 is the most consequential directive under CR_CORE_MEMORY. It instructs SLURM to expose only one logical CPU per physical core to the job and, critically, to avoid placing the job on cores whose SMT sibling is already occupied by another job. Without this directive, SLURM may assign cores that are already partially loaded by concurrent tenants, producing CPU contention that is indistinguishable in symptom from a misconfigured affinity mask.

--cpus-per-task=16 specifies the number of logical CPUs to include in the task’s affinity mask. With --threads-per-core=1 this equals the number of physical cores. This value also populates the environment variable SLURM_CPUS_PER_TASK, which application launchers typically consult when setting their thread counts.

What SLURM actually reserves

Despite --threads-per-core=1 causing the affinity mask to present only 16 CPUs to the task, SLURM under CR_CORE_MEMORY reserves the full physical core, meaning both SMT threads are blocked from allocation by other jobs. The environment variable SLURM_JOB_CPUS_PER_NODE will consequently report 32 (16 cores multiplied by 2 threads), confirming that both SMT threads are held exclusively. The variable SLURM_CPUS_PER_TASK will report 16, reflecting what the task itself observes in its affinity mask.

Why NUMA binding remains necessary

SLURM’s CPU affinity mask governs which logical CPUs a process may be scheduled upon. It does not govern where memory is allocated. On a shared node, the Linux kernel’s default first-touch memory allocation policy will satisfy page allocation requests from whichever NUMA node currently presents the most available memory, which may be on a different socket entirely from the CPUs executing the job’s threads.

For shared-memory parallel workloads with large per-thread working sets — comprising buffers, arrays, and auxiliary data structures — the working set may occupy tens to hundreds of megabytes per worker thread. If those pages are allocated on a remote NUMA node (distance 32), every cache miss that propagates to DRAM incurs the full cross-socket latency penalty. In practice this reduces effective memory bandwidth by a factor of two to three, which translates directly into reduced throughput for memory-bound computations.

The numactl utility addresses both concerns simultaneously through two distinct mechanisms:

  • --cpunodebind=N restricts CPU scheduling to the specified node (functionally redundant with SLURM’s affinity mask, but explicit and harmless).

  • --membind=N forces all subsequent memory allocations to the local DRAM of node N.

The --membind flag is the operationally significant one. Removing the numactl wrapper whilst retaining SLURM’s affinity mask leaves the CPU binding intact but loses the memory binding, producing a degraded but non-obvious performance state: CPU utilisation appears correct whilst throughput remains well below the expected figure.

The wrapper.sh script

The wrapper encapsulates the NUMA binding logic so that the job script requires no hard-coded NUMA node numbers, which vary between allocations and between nodes. The script is maintained in the Discoverer GitLab repository and should be obtained from there:

https://gitlab.discoverer.bg/vkolev/slurm/-/blob/main/discoverer/numa/wrapper.sh

The affinity mask is read from taskset rather than from SLURM environment variables. This is intentional: SLURM_CPU_BIND_LIST is not populated unless --cpu-bind is explicitly set in the sbatch directives, as confirmed during testing. The affinity mask as reported by the kernel via taskset is the authoritative source of which CPUs are available to the process.

The intersection between the affinity mask and each NUMA node’s CPU list is computed with awk rather than comm. The comm utility requires its input to be sorted in locale collation order; however, CPU numbers produced by expand_cpulist are sorted numerically. For CPU numbers of ten or greater these orderings diverge in many locales, causing comm to produce silently incorrect intersection results.

No minimum CPU threshold is enforced, in contrast to the original run_on_first_fitting_numa.sh, which required at least N CPUs to be present on a node before binding. Any non-zero intersection between the affinity mask and a NUMA node’s CPU list triggers binding to that node. This eliminates the failure mode where a partially subscribed node causes the script to exit with an error even though a valid binding exists.

Two binding strategies are implemented. When all affinity CPUs reside on a single NUMA node, the expected outcome under correct sbatch directives, binding is applied to that node alone. When CPUs span multiple nodes — indicating that SLURM distributed the allocation across NUMA boundaries — binding covers all involved nodes, keeping memory resident on the nodes where the threads execute. This remains substantially preferable to unbound allocation, which permits the kernel to place pages arbitrarily.

The job script

The following is a generic template for a shared-memory parallel job on the cn partition. The application invocation on the final line of the loop should be replaced with the actual programme and its arguments. The use of wrapper.sh and the derivation of THREADS from taskset apply regardless of the application.

#!/bin/bash
#SBATCH --job-name=my_smp_job
#SBATCH --account=<account>
#SBATCH --partition=cn
#SBATCH --qos=<qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=16
#SBATCH --threads-per-core=1
#SBATCH --cpus-per-task=16
#SBATCH --time=<time_limit>
#SBATCH --output=job_%j.out
#SBATCH --error=job_%j.err

set -euo pipefail

module purge || exit
module load <application_module> || exit

cd $SLURM_SUBMIT_DIR

# Compute the thread count from the actual affinity mask rather than from
# SLURM_CPUS_PER_TASK. Under CR_CORE_MEMORY with --threads-per-core=1,
# SLURM_CPUS_PER_TASK reflects the number of visible logical CPUs (16),
# but the affinity mask contains 32 entries (both SMT threads per core).
# The value derived from taskset is the true count of CPUs that numactl
# will bind to and is therefore the correct figure to pass to the application.
THREADS=$(taskset -cp $$ | sed 's/.*: //' | \
  tr ',' '\n' | awk -F- '{if(NF==2) sum+=$2-$1+1; else sum+=1} END{print sum}')

echo "THREADS=$THREADS"

./wrapper.sh <application> <application_arguments>

SLURM_CPUS_PER_TASK is set to 16 by the --cpus-per-task=16 directive. However, because SLURM under CR_CORE_MEMORY reserves complete physical cores, both SMT threads of each allocated core are blocked from other jobs. The actual affinity mask therefore contains 32 logical CPU entries, as confirmed empirically by taskset reporting 80-95,208-223 for a node 5 allocation. Using SLURM_CPUS_PER_TASK directly to set the application thread count would leave half of the available logical CPUs idle. Deriving THREADS from taskset gives the count of CPUs that numactl will bind to, which is the correct value to pass to the application.

Common failure modes and diagnostics

Job dispatched to a fully subscribed node

Symptom: CPU utilisation is approximately 100% rather than the expected 1600%.

Cause: One or more of --threads-per-core=1 or --sockets-per-node=1 are absent from the sbatch directives, permitting SLURM to place the job on cores already occupied by concurrent tenants.

Resolution: Apply the complete set of sbatch directives described in the section on the correct sbatch directives.

Correct CPU count but degraded throughput

Symptom: CPU utilisation is at the expected level, but job runtime is longer than the baseline established on an exclusively allocated node.

Cause: wrapper.sh is not being invoked, or is invoked without --membind. Memory pages are allocated on remote NUMA nodes by the kernel’s first-touch policy. This may be confirmed by running numastat -p <pid> and observing a high numa_miss or other_node count.

Resolution: Ensure the application is invoked through wrapper.sh.

wrapper.sh: taskset returned an empty affinity mask

Cause: The script is executing outside a SLURM job step, or the job step was launched without a CPU affinity assignment.

Resolution: Confirm that --cpus-per-task is present in the sbatch directives and that the job is submitted via sbatch.

SLURM_CPU_BIND_LIST: unbound variable

Cause: The --cpu-bind directive was not specified. This variable is only populated when --cpu-bind appears explicitly in the sbatch directives.

Impact: None. wrapper.sh does not consult this variable; it reads the affinity mask directly from the kernel via taskset.

Verifying an allocation before the application executes

The following diagnostic lines, placed at the head of the job script, produce a complete picture of the allocation state prior to any computation:

echo "[ SLURM allocation ]"
echo "SLURM_CPUS_PER_TASK:     $SLURM_CPUS_PER_TASK"
echo "SLURM_JOB_CPUS_PER_NODE: $SLURM_JOB_CPUS_PER_NODE"
echo "[ taskset affinity ]"
taskset -cp $$
echo "[ numactl hardware ]"
numactl --hardware
echo "[ node CPU load    ]"
mpstat -P ALL 1 1

A correct allocation will present SLURM_JOB_CPUS_PER_NODE=32, a taskset affinity list of 32 CPUs all within the CPU range of a single NUMA node, and numactl --hardware confirming eight nodes with the expected CPU groupings and distance matrix.

Summary

The combination of the hardware topology (AMD EPYC 7H12, Zen2 NPS4, eight NUMA nodes per dual-socket server), the SLURM accounting configuration (CR_CORE_MEMORY), and the memory-access characteristics of shared-memory parallel workloads means that three conditions must hold simultaneously for correct and efficient execution.

First, SLURM must allocate cores from a single NUMA node. This is achieved with --sockets-per-node=1 and --cores-per-socket=16.

Second, SLURM must not permit SMT contention from concurrently executing jobs. This is achieved with --threads-per-core=1.

Third, memory must be allocated on the same NUMA node as the CPUs executing the job’s threads. This is achieved by invoking the application through wrapper.sh, which applies numactl --membind to the process before execution begins.

The failure of any one of these three conditions independently is sufficient to produce a measurable performance regression. All three conditions satisfied together yield a clean, exclusively bound execution environment in which the expected CPU utilisation and throughput figures are reliably achieved.