NUMA-aware SLURM job execution ============================== .. contents:: Table of contents :depth: 3 Scope ----- This document addresses the correct configuration and execution of tightly-coupled, single-node, shared-memory parallel jobs on the Discoverer Petascale Supercomputer. The workload class in question is characterised by a single multi-threaded process that spawns a modest number of POSIX threads or OpenMP threads — typically between 16 and 64 — all executing within one operating system process and sharing a common address space. Such jobs are referred to in HPC practice as SMP-parallel or shared-memory parallel workloads, as distinct from distributed-memory parallel jobs that employ MPI across multiple nodes. The document is concerned specifically with NUMA domain binding under SLURM workload manager control. This is a materially different operational context from running the same application directly on a dedicated workstation or an interactively accessed server, where the user has unrestricted access to the full machine and may apply ``numactl`` without any interaction with a batch scheduler. On Discoverer, compute node access is mediated exclusively through SLURM: the scheduler controls which CPUs are assigned to a job, enforces resource limits, and constructs the CPU affinity mask that the process inherits at launch. Correct NUMA binding therefore requires that both the SLURM directives and the ``numactl`` invocation are configured in a mutually consistent manner, as described in the sections that follow. The principles described here apply to any shared-memory parallel application whose performance is sensitive to memory access latency and NUMA locality, including but not limited to applications based on OpenMP, Intel TBB, or native POSIX thread pools. The document does not address MPI jobs, GPU-accelerated workloads, jobs that span more than one compute node, or interactive use of ``numactl`` outside the SLURM environment. Hardware topology ----------------- The compute nodes of the Discoverer CPU cluster each carry two AMD EPYC 7H12 64-Core Processors (Rome microarchitecture, Zen2). The SLURM node declaration, on a per-node basis, has the following pattern (check ``/etc/slurm/partition.conf`` on the login or compute nodes): :: NodeName=cn0738 Sockets=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257700 MemSpecLimit=20616 Each physical socket contains 64 cores. With SMT enabled, each socket presents 128 logical CPUs, giving 256 logical CPUs per node in total. The SLURM field ``Sockets=8`` does not refer to physical CPU packages, of which there are only two. It refers instead to NUMA domains: AMD EPYC Rome implements NPS4 (NUMA Per Socket = 4) by default, subdividing each physical socket into four NUMA domains, one per die quadrant (CCD group). With two sockets, each presenting four NUMA domains, the operating system and SLURM observe eight NUMA nodes in total. SLURM maps these onto its ``Sockets`` field, and ``CoresPerSocket=16`` reflects the 16 physical cores belonging to each NUMA domain. The memory subsystem consists of 16 × 16 GiB DDR4 DIMMs running at 3200 MT/s, distributed evenly across the eight NUMA nodes, yielding approximately 32 GiB of locally attached DRAM per NUMA node. The fat nodes (``fn[01-18]``) carry 16 × 64 GiB DIMMs under the same CPU configuration, giving approximately 128 GiB per NUMA node, but are otherwise topologically identical. The resulting CPU topology, as reported by ``numactl --hardware``, is as follows: :: NUMA node 0 — cpus 0–15, 128–143 (socket 0, quadrant 0) NUMA node 1 — cpus 16–31, 144–159 (socket 0, quadrant 1) NUMA node 2 — cpus 32–47, 160–175 (socket 0, quadrant 2) NUMA node 3 — cpus 48–63, 176–191 (socket 0, quadrant 3) NUMA node 4 — cpus 64–79, 192–207 (socket 1, quadrant 0) NUMA node 5 — cpus 80–95, 208–223 (socket 1, quadrant 1) NUMA node 6 — cpus 96–111, 224–239 (socket 1, quadrant 2) NUMA node 7 — cpus 112–127, 240–255 (socket 1, quadrant 3) Each NUMA node owns 16 physical cores together with their 16 SMT sibling threads, giving 32 logical CPUs per node. The inter-node distance matrix (from the same ``numactl`` view) is: :: node 0 1 2 3 4 5 6 7 0: 10 12 12 12 32 32 32 32 1: 12 10 12 12 32 32 32 32 2: 12 12 10 12 32 32 32 32 3: 12 12 12 10 32 32 32 32 4: 32 32 32 32 10 12 12 12 5: 32 32 32 32 12 10 12 12 6: 32 32 32 32 12 12 10 12 7: 32 32 32 32 12 12 12 10 The distance values are proportional to memory access latency relative to the local baseline of 10. A local access (distance 10) is approximately 3.2 times faster than a cross-socket access (distance 32). Intra-socket cross-NUMA accesses (distance 12) are only marginally slower than local. For memory-bandwidth-bound workloads it is therefore essential to keep both CPU execution and memory allocation within a single NUMA node. SLURM resource accounting configuration --------------------------------------- The Discoverer CPU cluster uses the following selection plugin configuration (check ``/etc/slurm/slurm.conf`` on the login or compute nodes): :: SelectTypeParameters = CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK ``CR_CORE_MEMORY`` specifies that SLURM accounts resources in physical cores rather than logical CPUs. The consequences of this setting are as follows. The allocatable unit on each node is one physical core, not one SMT thread. Requesting ``--cpus-per-task=32`` with ``--threads-per-core=2`` is interpreted by SLURM as a request for 16 cores, leaving the node’s remaining cores available for concurrent jobs — including their SMT siblings, which may share execution resources with the requesting job’s threads. There is no user-side mechanism to exclusively reserve both SMT threads of a core under this configuration; that would require ``CR_CPU_MEMORY``, which is a cluster-wide administrative setting. ``CR_CORE_DEFAULT_DIST_BLOCK`` sets the default CPU distribution to block mode, causing SLURM to fill cores on one socket before advancing to the next. This interacts with topology placement directives and can, in certain SLURM versions, override ``--sockets-per-node`` constraints rather than respecting them, resulting in the scheduler ignoring locality requests and packing cores wherever capacity is found. The correct ``sbatch`` directives --------------------------------- Given the hardware topology and SLURM accounting model described above, the correct directives to request one full NUMA node’s worth of physical cores are: .. code:: bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=16 #SBATCH --threads-per-core=1 #SBATCH --cpus-per-task=16 Description of each directive ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``--nodes=1`` ensures the entire allocation is placed on a single physical node, eliminating any inter-node communication overhead. ``--ntasks-per-node=1`` declares a single task on the node. In conjunction with ``--cpus-per-task``, this is the mechanism by which SLURM constructs the task’s CPU affinity mask. ``--sockets-per-node=1`` instructs SLURM’s placement logic to confine the allocation to one of the eight logical sockets, which corresponds to one NUMA node on this hardware. Without this directive, SLURM may distribute the requested cores across multiple NUMA nodes, resulting in remote memory accesses for any data that does not reside on the node where a given thread is scheduled. ``--cores-per-socket=16`` requests all 16 physical cores of the chosen NUMA node. In conjunction with ``--sockets-per-node=1``, this fully specifies the locality constraint. ``--threads-per-core=1`` is the most consequential directive under ``CR_CORE_MEMORY``. It instructs SLURM to expose only one logical CPU per physical core to the job and, critically, to avoid placing the job on cores whose SMT sibling is already occupied by another job. Without this directive, SLURM may assign cores that are already partially loaded by concurrent tenants, producing CPU contention that is indistinguishable in symptom from a misconfigured affinity mask. ``--cpus-per-task=16`` specifies the number of logical CPUs to include in the task’s affinity mask. With ``--threads-per-core=1`` this equals the number of physical cores. This value also populates the environment variable ``SLURM_CPUS_PER_TASK``, which application launchers typically consult when setting their thread counts. What SLURM actually reserves ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Despite ``--threads-per-core=1`` causing the affinity mask to present only 16 CPUs to the task, SLURM under ``CR_CORE_MEMORY`` reserves the full physical core, meaning both SMT threads are blocked from allocation by other jobs. The environment variable ``SLURM_JOB_CPUS_PER_NODE`` will consequently report 32 (16 cores multiplied by 2 threads), confirming that both SMT threads are held exclusively. The variable ``SLURM_CPUS_PER_TASK`` will report 16, reflecting what the task itself observes in its affinity mask. Why NUMA binding remains necessary ---------------------------------- SLURM’s CPU affinity mask governs which logical CPUs a process may be scheduled upon. It does not govern where memory is allocated. On a shared node, the Linux kernel’s default first-touch memory allocation policy will satisfy page allocation requests from whichever NUMA node currently presents the most available memory, which may be on a different socket entirely from the CPUs executing the job’s threads. For shared-memory parallel workloads with large per-thread working sets — comprising buffers, arrays, and auxiliary data structures — the working set may occupy tens to hundreds of megabytes per worker thread. If those pages are allocated on a remote NUMA node (distance 32), every cache miss that propagates to DRAM incurs the full cross-socket latency penalty. In practice this reduces effective memory bandwidth by a factor of two to three, which translates directly into reduced throughput for memory-bound computations. The ``numactl`` utility addresses both concerns simultaneously through two distinct mechanisms: - ``--cpunodebind=N`` restricts CPU scheduling to the specified node (functionally redundant with SLURM’s affinity mask, but explicit and harmless). - ``--membind=N`` forces all subsequent memory allocations to the local DRAM of node N. The ``--membind`` flag is the operationally significant one. Removing the ``numactl`` wrapper whilst retaining SLURM’s affinity mask leaves the CPU binding intact but loses the memory binding, producing a degraded but non-obvious performance state: CPU utilisation appears correct whilst throughput remains well below the expected figure. The ``wrapper.sh`` script ------------------------- The wrapper encapsulates the NUMA binding logic so that the job script requires no hard-coded NUMA node numbers, which vary between allocations and between nodes. The script is maintained in the Discoverer GitLab repository and should be obtained from there: https://gitlab.discoverer.bg/vkolev/slurm/-/blob/main/discoverer/numa/wrapper.sh The affinity mask is read from ``taskset`` rather than from SLURM environment variables. This is intentional: ``SLURM_CPU_BIND_LIST`` is not populated unless ``--cpu-bind`` is explicitly set in the sbatch directives, as confirmed during testing. The affinity mask as reported by the kernel via ``taskset`` is the authoritative source of which CPUs are available to the process. The intersection between the affinity mask and each NUMA node’s CPU list is computed with ``awk`` rather than ``comm``. The ``comm`` utility requires its input to be sorted in locale collation order; however, CPU numbers produced by ``expand_cpulist`` are sorted numerically. For CPU numbers of ten or greater these orderings diverge in many locales, causing ``comm`` to produce silently incorrect intersection results. No minimum CPU threshold is enforced, in contrast to the original ``run_on_first_fitting_numa.sh``, which required at least N CPUs to be present on a node before binding. Any non-zero intersection between the affinity mask and a NUMA node’s CPU list triggers binding to that node. This eliminates the failure mode where a partially subscribed node causes the script to exit with an error even though a valid binding exists. Two binding strategies are implemented. When all affinity CPUs reside on a single NUMA node, the expected outcome under correct sbatch directives, binding is applied to that node alone. When CPUs span multiple nodes — indicating that SLURM distributed the allocation across NUMA boundaries — binding covers all involved nodes, keeping memory resident on the nodes where the threads execute. This remains substantially preferable to unbound allocation, which permits the kernel to place pages arbitrarily. The job script -------------- The following is a generic template for a shared-memory parallel job on the ``cn`` partition. The application invocation on the final line of the loop should be replaced with the actual programme and its arguments. The use of ``wrapper.sh`` and the derivation of ``THREADS`` from ``taskset`` apply regardless of the application. .. code:: bash #!/bin/bash #SBATCH --job-name=my_smp_job #SBATCH --account= #SBATCH --partition=cn #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=16 #SBATCH --threads-per-core=1 #SBATCH --cpus-per-task=16 #SBATCH --time= #SBATCH --output=job_%j.out #SBATCH --error=job_%j.err set -euo pipefail module purge || exit module load || exit cd $SLURM_SUBMIT_DIR # Compute the thread count from the actual affinity mask rather than from # SLURM_CPUS_PER_TASK. Under CR_CORE_MEMORY with --threads-per-core=1, # SLURM_CPUS_PER_TASK reflects the number of visible logical CPUs (16), # but the affinity mask contains 32 entries (both SMT threads per core). # The value derived from taskset is the true count of CPUs that numactl # will bind to and is therefore the correct figure to pass to the application. THREADS=$(taskset -cp $$ | sed 's/.*: //' | \ tr ',' '\n' | awk -F- '{if(NF==2) sum+=$2-$1+1; else sum+=1} END{print sum}') echo "THREADS=$THREADS" ./wrapper.sh ``SLURM_CPUS_PER_TASK`` is set to 16 by the ``--cpus-per-task=16`` directive. However, because SLURM under ``CR_CORE_MEMORY`` reserves complete physical cores, both SMT threads of each allocated core are blocked from other jobs. The actual affinity mask therefore contains 32 logical CPU entries, as confirmed empirically by ``taskset`` reporting ``80-95,208-223`` for a node 5 allocation. Using ``SLURM_CPUS_PER_TASK`` directly to set the application thread count would leave half of the available logical CPUs idle. Deriving ``THREADS`` from ``taskset`` gives the count of CPUs that ``numactl`` will bind to, which is the correct value to pass to the application. Common failure modes and diagnostics ------------------------------------ Job dispatched to a fully subscribed node ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Symptom: CPU utilisation is approximately 100% rather than the expected 1600%. Cause: One or more of ``--threads-per-core=1`` or ``--sockets-per-node=1`` are absent from the sbatch directives, permitting SLURM to place the job on cores already occupied by concurrent tenants. Resolution: Apply the complete set of sbatch directives described in the section on the correct ``sbatch`` directives. Correct CPU count but degraded throughput ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Symptom: CPU utilisation is at the expected level, but job runtime is longer than the baseline established on an exclusively allocated node. Cause: ``wrapper.sh`` is not being invoked, or is invoked without ``--membind``. Memory pages are allocated on remote NUMA nodes by the kernel’s first-touch policy. This may be confirmed by running ``numastat -p `` and observing a high ``numa_miss`` or ``other_node`` count. Resolution: Ensure the application is invoked through ``wrapper.sh``. ``wrapper.sh: taskset returned an empty affinity mask`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Cause: The script is executing outside a SLURM job step, or the job step was launched without a CPU affinity assignment. Resolution: Confirm that ``--cpus-per-task`` is present in the sbatch directives and that the job is submitted via ``sbatch``. ``SLURM_CPU_BIND_LIST: unbound variable`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Cause: The ``--cpu-bind`` directive was not specified. This variable is only populated when ``--cpu-bind`` appears explicitly in the sbatch directives. Impact: None. ``wrapper.sh`` does not consult this variable; it reads the affinity mask directly from the kernel via ``taskset``. Verifying an allocation before the application executes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following diagnostic lines, placed at the head of the job script, produce a complete picture of the allocation state prior to any computation: .. code:: bash echo "[ SLURM allocation ]" echo "SLURM_CPUS_PER_TASK: $SLURM_CPUS_PER_TASK" echo "SLURM_JOB_CPUS_PER_NODE: $SLURM_JOB_CPUS_PER_NODE" echo "[ taskset affinity ]" taskset -cp $$ echo "[ numactl hardware ]" numactl --hardware echo "[ node CPU load ]" mpstat -P ALL 1 1 A correct allocation will present ``SLURM_JOB_CPUS_PER_NODE=32``, a ``taskset`` affinity list of 32 CPUs all within the CPU range of a single NUMA node, and ``numactl --hardware`` confirming eight nodes with the expected CPU groupings and distance matrix. Summary ------- The combination of the hardware topology (AMD EPYC 7H12, Zen2 NPS4, eight NUMA nodes per dual-socket server), the SLURM accounting configuration (``CR_CORE_MEMORY``), and the memory-access characteristics of shared-memory parallel workloads means that three conditions must hold simultaneously for correct and efficient execution. First, SLURM must allocate cores from a single NUMA node. This is achieved with ``--sockets-per-node=1`` and ``--cores-per-socket=16``. Second, SLURM must not permit SMT contention from concurrently executing jobs. This is achieved with ``--threads-per-core=1``. Third, memory must be allocated on the same NUMA node as the CPUs executing the job’s threads. This is achieved by invoking the application through ``wrapper.sh``, which applies ``numactl --membind`` to the process before execution begins. The failure of any one of these three conditions independently is sufficient to produce a measurable performance regression. All three conditions satisfied together yield a clean, exclusively bound execution environment in which the expected CPU utilisation and throughput figures are reliably achieved.