NUMA-aware SLURM job execution
==============================

.. contents:: Table of contents
   :depth: 3

Scope
-----

This document addresses the correct configuration and execution of
tightly-coupled, single-node, shared-memory parallel jobs on the
Discoverer Petascale Supercomputer. The workload class in question is
characterised by a single multi-threaded process that spawns a modest
number of POSIX threads or OpenMP threads — typically between 16 and 64
— all executing within one operating system process and sharing a common
address space. Such jobs are referred to in HPC practice as SMP-parallel
or shared-memory parallel workloads, as distinct from distributed-memory
parallel jobs that employ MPI across multiple nodes.

The document is concerned specifically with NUMA domain binding under
SLURM workload manager control. This is a materially different
operational context from running the same application directly on a
dedicated workstation or an interactively accessed server, where the
user has unrestricted access to the full machine and may apply
``numactl`` without any interaction with a batch scheduler. On
Discoverer, compute node access is mediated exclusively through SLURM:
the scheduler controls which CPUs are assigned to a job, enforces
resource limits, and constructs the CPU affinity mask that the process
inherits at launch. Correct NUMA binding therefore requires that both
the SLURM directives and the ``numactl`` invocation are configured in a
mutually consistent manner, as described in the sections that follow.

The principles described here apply to any shared-memory parallel
application whose performance is sensitive to memory access latency and
NUMA locality, including but not limited to applications based on
OpenMP, Intel TBB, or native POSIX thread pools.

The document does not address MPI jobs, GPU-accelerated workloads, jobs
that span more than one compute node, or interactive use of ``numactl``
outside the SLURM environment.


Hardware topology
-----------------

The compute nodes of the Discoverer CPU cluster each carry two AMD EPYC
7H12 64-Core Processors (Rome microarchitecture, Zen2). The SLURM node
declaration, on a per-node basis, has the following pattern (check
``/etc/slurm/partition.conf`` on the login or compute nodes):

::

   NodeName=cn0738 Sockets=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257700 MemSpecLimit=20616

Each physical socket contains 64 cores. With SMT enabled, each socket
presents 128 logical CPUs, giving 256 logical CPUs per node in total.
The SLURM field ``Sockets=8`` does not refer to physical CPU packages,
of which there are only two. It refers instead to NUMA domains: AMD EPYC
Rome implements NPS4 (NUMA Per Socket = 4) by default, subdividing each
physical socket into four NUMA domains, one per die quadrant (CCD
group). With two sockets, each presenting four NUMA domains, the
operating system and SLURM observe eight NUMA nodes in total. SLURM maps
these onto its ``Sockets`` field, and ``CoresPerSocket=16`` reflects the
16 physical cores belonging to each NUMA domain.

The memory subsystem consists of 16 × 16 GiB DDR4 DIMMs running at 3200
MT/s, distributed evenly across the eight NUMA nodes, yielding
approximately 32 GiB of locally attached DRAM per NUMA node. The fat
nodes (``fn[01-18]``) carry 16 × 64 GiB DIMMs under the same CPU
configuration, giving approximately 128 GiB per NUMA node, but are
otherwise topologically identical.

The resulting CPU topology, as reported by ``numactl --hardware``, is as
follows:

::

   NUMA node 0 — cpus   0–15,  128–143   (socket 0, quadrant 0)
   NUMA node 1 — cpus  16–31,  144–159   (socket 0, quadrant 1)
   NUMA node 2 — cpus  32–47,  160–175   (socket 0, quadrant 2)
   NUMA node 3 — cpus  48–63,  176–191   (socket 0, quadrant 3)
   NUMA node 4 — cpus  64–79,  192–207   (socket 1, quadrant 0)
   NUMA node 5 — cpus  80–95,  208–223   (socket 1, quadrant 1)
   NUMA node 6 — cpus  96–111, 224–239   (socket 1, quadrant 2)
   NUMA node 7 — cpus 112–127, 240–255   (socket 1, quadrant 3)

Each NUMA node owns 16 physical cores together with their 16 SMT sibling
threads, giving 32 logical CPUs per node.

The inter-node distance matrix (from the same ``numactl`` view) is:

::

   node   0   1   2   3   4   5   6   7
     0:  10  12  12  12  32  32  32  32
     1:  12  10  12  12  32  32  32  32
     2:  12  12  10  12  32  32  32  32
     3:  12  12  12  10  32  32  32  32
     4:  32  32  32  32  10  12  12  12
     5:  32  32  32  32  12  10  12  12
     6:  32  32  32  32  12  12  10  12
     7:  32  32  32  32  12  12  12  10

The distance values are proportional to memory access latency relative
to the local baseline of 10. A local access (distance 10) is
approximately 3.2 times faster than a cross-socket access (distance 32).
Intra-socket cross-NUMA accesses (distance 12) are only marginally
slower than local. For memory-bandwidth-bound workloads it is therefore
essential to keep both CPU execution and memory allocation within a
single NUMA node.


SLURM resource accounting configuration
---------------------------------------

The Discoverer CPU cluster uses the following selection plugin
configuration (check ``/etc/slurm/slurm.conf`` on the login or compute
nodes):

::

   SelectTypeParameters = CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK

``CR_CORE_MEMORY`` specifies that SLURM accounts resources in physical
cores rather than logical CPUs. The consequences of this setting are as
follows.

The allocatable unit on each node is one physical core, not one SMT
thread. Requesting ``--cpus-per-task=32`` with ``--threads-per-core=2``
is interpreted by SLURM as a request for 16 cores, leaving the node’s
remaining cores available for concurrent jobs — including their SMT
siblings, which may share execution resources with the requesting job’s
threads. There is no user-side mechanism to exclusively reserve both SMT
threads of a core under this configuration; that would require
``CR_CPU_MEMORY``, which is a cluster-wide administrative setting.

``CR_CORE_DEFAULT_DIST_BLOCK`` sets the default CPU distribution to
block mode, causing SLURM to fill cores on one socket before advancing
to the next. This interacts with topology placement directives and can,
in certain SLURM versions, override ``--sockets-per-node`` constraints
rather than respecting them, resulting in the scheduler ignoring
locality requests and packing cores wherever capacity is found.


The correct ``sbatch`` directives
---------------------------------

Given the hardware topology and SLURM accounting model described above,
the correct directives to request one full NUMA node’s worth of physical
cores are:

.. code:: bash

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --sockets-per-node=1
   #SBATCH --cores-per-socket=16
   #SBATCH --threads-per-core=1
   #SBATCH --cpus-per-task=16

Description of each directive
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``--nodes=1`` ensures the entire allocation is placed on a single
physical node, eliminating any inter-node communication overhead.

``--ntasks-per-node=1`` declares a single task on the node. In
conjunction with ``--cpus-per-task``, this is the mechanism by which
SLURM constructs the task’s CPU affinity mask.

``--sockets-per-node=1`` instructs SLURM’s placement logic to confine
the allocation to one of the eight logical sockets, which corresponds to
one NUMA node on this hardware. Without this directive, SLURM may
distribute the requested cores across multiple NUMA nodes, resulting in
remote memory accesses for any data that does not reside on the node
where a given thread is scheduled.

``--cores-per-socket=16`` requests all 16 physical cores of the chosen
NUMA node. In conjunction with ``--sockets-per-node=1``, this fully
specifies the locality constraint.

``--threads-per-core=1`` is the most consequential directive under
``CR_CORE_MEMORY``. It instructs SLURM to expose only one logical CPU
per physical core to the job and, critically, to avoid placing the job
on cores whose SMT sibling is already occupied by another job. Without
this directive, SLURM may assign cores that are already partially loaded
by concurrent tenants, producing CPU contention that is
indistinguishable in symptom from a misconfigured affinity mask.

``--cpus-per-task=16`` specifies the number of logical CPUs to include
in the task’s affinity mask. With ``--threads-per-core=1`` this equals
the number of physical cores. This value also populates the environment
variable ``SLURM_CPUS_PER_TASK``, which application launchers typically
consult when setting their thread counts.

What SLURM actually reserves
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Despite ``--threads-per-core=1`` causing the affinity mask to present
only 16 CPUs to the task, SLURM under ``CR_CORE_MEMORY`` reserves the
full physical core, meaning both SMT threads are blocked from allocation
by other jobs. The environment variable ``SLURM_JOB_CPUS_PER_NODE`` will
consequently report 32 (16 cores multiplied by 2 threads), confirming
that both SMT threads are held exclusively. The variable
``SLURM_CPUS_PER_TASK`` will report 16, reflecting what the task itself
observes in its affinity mask.


Why NUMA binding remains necessary
----------------------------------

SLURM’s CPU affinity mask governs which logical CPUs a process may be
scheduled upon. It does not govern where memory is allocated. On a
shared node, the Linux kernel’s default first-touch memory allocation
policy will satisfy page allocation requests from whichever NUMA node
currently presents the most available memory, which may be on a
different socket entirely from the CPUs executing the job’s threads.

For shared-memory parallel workloads with large per-thread working sets
— comprising buffers, arrays, and auxiliary data structures — the
working set may occupy tens to hundreds of megabytes per worker thread.
If those pages are allocated on a remote NUMA node (distance 32), every
cache miss that propagates to DRAM incurs the full cross-socket latency
penalty. In practice this reduces effective memory bandwidth by a factor
of two to three, which translates directly into reduced throughput for
memory-bound computations.

The ``numactl`` utility addresses both concerns simultaneously through
two distinct mechanisms:

-  ``--cpunodebind=N`` restricts CPU scheduling to the specified node
   (functionally redundant with SLURM’s affinity mask, but explicit and
   harmless).
-  ``--membind=N`` forces all subsequent memory allocations to the local
   DRAM of node N.

The ``--membind`` flag is the operationally significant one. Removing
the ``numactl`` wrapper whilst retaining SLURM’s affinity mask leaves
the CPU binding intact but loses the memory binding, producing a
degraded but non-obvious performance state: CPU utilisation appears
correct whilst throughput remains well below the expected figure.


The ``wrapper.sh`` script
-------------------------

The wrapper encapsulates the NUMA binding logic so that the job script
requires no hard-coded NUMA node numbers, which vary between allocations
and between nodes. The script is maintained in the Discoverer GitLab
repository and should be obtained from there:

https://gitlab.discoverer.bg/vkolev/slurm/-/blob/main/discoverer/numa/wrapper.sh

The affinity mask is read from ``taskset`` rather than from SLURM
environment variables. This is intentional: ``SLURM_CPU_BIND_LIST`` is
not populated unless ``--cpu-bind`` is explicitly set in the sbatch
directives, as confirmed during testing. The affinity mask as reported
by the kernel via ``taskset`` is the authoritative source of which CPUs
are available to the process.

The intersection between the affinity mask and each NUMA node’s CPU list
is computed with ``awk`` rather than ``comm``. The ``comm`` utility
requires its input to be sorted in locale collation order; however, CPU
numbers produced by ``expand_cpulist`` are sorted numerically. For CPU
numbers of ten or greater these orderings diverge in many locales,
causing ``comm`` to produce silently incorrect intersection results.

No minimum CPU threshold is enforced, in contrast to the original
``run_on_first_fitting_numa.sh``, which required at least N CPUs to be
present on a node before binding. Any non-zero intersection between the
affinity mask and a NUMA node’s CPU list triggers binding to that node.
This eliminates the failure mode where a partially subscribed node
causes the script to exit with an error even though a valid binding
exists.

Two binding strategies are implemented. When all affinity CPUs reside on
a single NUMA node, the expected outcome under correct sbatch
directives, binding is applied to that node alone. When CPUs span
multiple nodes — indicating that SLURM distributed the allocation across
NUMA boundaries — binding covers all involved nodes, keeping memory
resident on the nodes where the threads execute. This remains
substantially preferable to unbound allocation, which permits the kernel
to place pages arbitrarily.


The job script
--------------

The following is a generic template for a shared-memory parallel job on
the ``cn`` partition. The application invocation on the final line of
the loop should be replaced with the actual programme and its arguments.
The use of ``wrapper.sh`` and the derivation of ``THREADS`` from
``taskset`` apply regardless of the application.

.. code:: bash

   #!/bin/bash
   #SBATCH --job-name=my_smp_job
   #SBATCH --account=<account>
   #SBATCH --partition=cn
   #SBATCH --qos=<qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --sockets-per-node=1
   #SBATCH --cores-per-socket=16
   #SBATCH --threads-per-core=1
   #SBATCH --cpus-per-task=16
   #SBATCH --time=<time_limit>
   #SBATCH --output=job_%j.out
   #SBATCH --error=job_%j.err

   set -euo pipefail

   module purge || exit
   module load <application_module> || exit

   cd $SLURM_SUBMIT_DIR

   # Compute the thread count from the actual affinity mask rather than from
   # SLURM_CPUS_PER_TASK. Under CR_CORE_MEMORY with --threads-per-core=1,
   # SLURM_CPUS_PER_TASK reflects the number of visible logical CPUs (16),
   # but the affinity mask contains 32 entries (both SMT threads per core).
   # The value derived from taskset is the true count of CPUs that numactl
   # will bind to and is therefore the correct figure to pass to the application.
   THREADS=$(taskset -cp $$ | sed 's/.*: //' | \
     tr ',' '\n' | awk -F- '{if(NF==2) sum+=$2-$1+1; else sum+=1} END{print sum}')

   echo "THREADS=$THREADS"

   ./wrapper.sh <application> <application_arguments>

``SLURM_CPUS_PER_TASK`` is set to 16 by the ``--cpus-per-task=16``
directive. However, because SLURM under ``CR_CORE_MEMORY`` reserves
complete physical cores, both SMT threads of each allocated core are
blocked from other jobs. The actual affinity mask therefore contains 32
logical CPU entries, as confirmed empirically by ``taskset`` reporting
``80-95,208-223`` for a node 5 allocation. Using ``SLURM_CPUS_PER_TASK``
directly to set the application thread count would leave half of the
available logical CPUs idle. Deriving ``THREADS`` from ``taskset`` gives
the count of CPUs that ``numactl`` will bind to, which is the correct
value to pass to the application.


Common failure modes and diagnostics
------------------------------------

Job dispatched to a fully subscribed node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Symptom: CPU utilisation is approximately 100% rather than the expected
1600%.

Cause: One or more of ``--threads-per-core=1`` or
``--sockets-per-node=1`` are absent from the sbatch directives,
permitting SLURM to place the job on cores already occupied by
concurrent tenants.

Resolution: Apply the complete set of sbatch directives described in the
section on the correct ``sbatch`` directives.

Correct CPU count but degraded throughput
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Symptom: CPU utilisation is at the expected level, but job runtime is
longer than the baseline established on an exclusively allocated node.

Cause: ``wrapper.sh`` is not being invoked, or is invoked without
``--membind``. Memory pages are allocated on remote NUMA nodes by the
kernel’s first-touch policy. This may be confirmed by running
``numastat -p <pid>`` and observing a high ``numa_miss`` or
``other_node`` count.

Resolution: Ensure the application is invoked through ``wrapper.sh``.

``wrapper.sh: taskset returned an empty affinity mask``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cause: The script is executing outside a SLURM job step, or the job step
was launched without a CPU affinity assignment.

Resolution: Confirm that ``--cpus-per-task`` is present in the sbatch
directives and that the job is submitted via ``sbatch``.

``SLURM_CPU_BIND_LIST: unbound variable``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cause: The ``--cpu-bind`` directive was not specified. This variable is
only populated when ``--cpu-bind`` appears explicitly in the sbatch
directives.

Impact: None. ``wrapper.sh`` does not consult this variable; it reads
the affinity mask directly from the kernel via ``taskset``.

Verifying an allocation before the application executes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following diagnostic lines, placed at the head of the job script,
produce a complete picture of the allocation state prior to any
computation:

.. code:: bash

   echo "[ SLURM allocation ]"
   echo "SLURM_CPUS_PER_TASK:     $SLURM_CPUS_PER_TASK"
   echo "SLURM_JOB_CPUS_PER_NODE: $SLURM_JOB_CPUS_PER_NODE"
   echo "[ taskset affinity ]"
   taskset -cp $$
   echo "[ numactl hardware ]"
   numactl --hardware
   echo "[ node CPU load    ]"
   mpstat -P ALL 1 1

A correct allocation will present ``SLURM_JOB_CPUS_PER_NODE=32``, a
``taskset`` affinity list of 32 CPUs all within the CPU range of a
single NUMA node, and ``numactl --hardware`` confirming eight nodes with
the expected CPU groupings and distance matrix.


Summary
-------

The combination of the hardware topology (AMD EPYC 7H12, Zen2 NPS4,
eight NUMA nodes per dual-socket server), the SLURM accounting
configuration (``CR_CORE_MEMORY``), and the memory-access
characteristics of shared-memory parallel workloads means that three
conditions must hold simultaneously for correct and efficient execution.

First, SLURM must allocate cores from a single NUMA node. This is
achieved with ``--sockets-per-node=1`` and ``--cores-per-socket=16``.

Second, SLURM must not permit SMT contention from concurrently executing
jobs. This is achieved with ``--threads-per-core=1``.

Third, memory must be allocated on the same NUMA node as the CPUs
executing the job’s threads. This is achieved by invoking the
application through ``wrapper.sh``, which applies ``numactl --membind``
to the process before execution begins.

The failure of any one of these three conditions independently is
sufficient to produce a measurable performance regression. All three
conditions satisfied together yield a clean, exclusively bound execution
environment in which the expected CPU utilisation and throughput figures
are reliably achieved.