SLURM GPU billing
=================

.. contents:: Table of contents
   :depth: 3


GPU hours allocation
--------------------

.. note :: Each project granted access to the Discoverer+ GPU cluster — whether through EuroHPC, a national allocation, or another programme — receives an allocation of *GPU hours*. GPU hours represent the total GPU compute time available to the project.

On Discoverer+, the H200 GPUs are managed by SLURM as *Generic RESources (GRES)*. SLURM does not natively schedule GPUs as a first-class resource type; GPUs are described and tracked as GRES. For this reason, a “GPU hour” in project documentation corresponds to a *GRES hour* in SLURM terms. The two are equivalent on this cluster.

The exact allocation size is specified in the *project documentation*. The figures used throughout this document — 5,000 GPU-hours and the corresponding billing limits — are *examples only* and do not represent any default allocation. The actual values must be taken from the project agreement.

SLURM tracks two resource counters independently for each project:

+-----------------------+-----------------------+-----------------------+
| Counter               | Example limit         | Description           |
+=======================+=======================+=======================+
| ``gres/gpu``          | 300,000 GPU-minutes   | Raw GPU time consumed |
|                       | (= **5,000            |                       |
|                       | GPU-hours**)          |                       |
+-----------------------+-----------------------+-----------------------+
| ``billing``           | 19,500,000            | Weighted sum of all   |
|                       | billing-minutes       | host resources        |
|                       |                       | consumed              |
+-----------------------+-----------------------+-----------------------+

The ``gres/gpu`` counter represents the actual GPU allocation. The ``billing`` counter is a fairness mechanism that accounts for the host CPU and memory consumed alongside the GPUs. Both are described in detail in the sections below.

--------------

Node resources
--------------

Each DGX node is equipped with 2 × 56-core processors, giving 112 physical cores in total. With *simultaneous multithreading (SMT) enabled*, each physical core exposes 2 logical CPU threads to the operating system. Linux and SLURM therefore both operate with *224 logical CPU threads* per node, not physical cores.

All SLURM parameters that reference CPUs — ``--cpus-per-task``, ``--ntasks``, and related options — refer to *logical CPU threads*. A request of ``--cpus-per-task=8`` allocates 8 threads, which may correspond to as few as 4 physical cores depending on thread placement. This is generally transparent to applications, but relevant for
workloads sensitive to NUMA topology or last-level cache sharing.

Not all 224 threads are available to user jobs. *8 physical cores (16 logical threads, binds threads to cores)* are permanently reserved for the *WEKA storage client*, which runs as a container and provides access to the cluster’s high-performance parallel filesystem. These cores are excluded from the SLURM scheduling pool via the ``CpuSpecList`` parameter in the node configuration.

The WEKA container cores are shared with the Linux kernel and general system tasks. Under normal operating conditions this has no effect on user jobs. In cases of unusually high parallel I/O load, the kernel scheduler may transiently borrow a small number of threads from a running job to service system work. This is a rare occurrence.

The resulting resource availability per node is as follows:

+-------------+-------------+-------------+-------------+-------------+
| Resource    | Physical    | Logical     | Reserved    | Available   |
|             |             | (OS/SLURM)  |             | to jobs     |
+=============+=============+=============+=============+=============+
| CPU cores / | 112 cores   | 224 threads | 16 threads  | 208         |
| threads     |             |             | (WEKA)      | threads     |
+-------------+-------------+-------------+-------------+-------------+
| Memory      | 2,063,425   | —           | 5,000 MB    | ~2,058      |
|             | MB          |             | (system)    | GB          |
+-------------+-------------+-------------+-------------+-------------+
| GPUs        | 8           | —           | —           | 8           |
+-------------+-------------+-------------+-------------+-------------+

--------------

Billing model
-------------

For every minute a job runs, SLURM computes a billing score as the weighted sum of all resources allocated to that job:

::

   billing/min = (CPU_threads × 0.035714) + (MemoryGB × 0.25) + (GPUs × 1.0)

``CPU_threads`` is the number of logical CPU threads allocated,
``MemoryGB`` is the host RAM allocated in gigabytes, and ``GPUs`` is the
number of GPU devices requested.

   Memory units: SLURM measures memory in megabytes internally. The weight ``Mem=0.25G`` specifies 0.25 billing units per gigabyte (equivalent to 0.25 ÷ 1024 per megabyte).

On DGX nodes, *memory is the dominant billing term*. Each node carries ~2 TB of host RAM in direct support of its 8 H200 GPUs. A job that exhausts the node’s host memory prevents other jobs from being scheduled on the remaining GPUs, regardless of how many GPUs that job itself uses. The billing weights reflect this: memory over-allocation is penalised at the same scale as GPU over-allocation.

--------------

Fair-share of host resources per GPU
------------------------------------

With 8 GPUs sharing 208 CPU threads and ~2,058 GB of RAM, the proportional 1/8th-node share per GPU is:

=================== ================= =====================
Resource            Available to jobs Per GPU (1/8th share)
=================== ================= =====================
Logical CPU threads 208               26
Memory              ~2,058 GB         ~257 GB
GPUs                8                 1
=================== ================= =====================

These figures define the *reference billing rate* — the rate at which a project consumes its billing budget when host resources are used in proportion to GPU allocation. A job allocated exactly this share produces:

::

   CPU:   26  × 0.035714 =  0.93 billing units/min
   Mem:  257  × 0.25     = 64.25 billing units/min
   GPU:    1  × 1.0      =  1.00 billing units/min
   ─────────────────────────────────────────────
   Total:                  66.18 billing units/min

At this rate, a 5,000 GPU-hour project (300,000 GPU-minutes) would consume:

::

   300,000 × 66.18 = 19,854,000 billing-minutes

This calculation is the basis for deriving the correct billing cap — see `Billing limit derivation <#billing-limit-derivation>`__.

The fair-share figures are a *project-level target for the average across all jobs*, not a per-job requirement. Individual jobs will legitimately deviate in both directions: a preprocessing job may need 2 CPU threads and 32 GB of RAM per GPU, while a large-batch training job may require 40 threads and 400 GB. Both are acceptable. What determines whether a project can reach its full GPU-hours allocation is whether the *average* resource consumption across all submitted jobs remains close to the fair-share values. A project whose jobs consistently over-allocate CPU threads or memory relative to GPUs will exhaust the billing budget before the GPU-hours are spent, and the remaining GPU-hours become permanently unreachable for that allocation period.

--------------

Allocation limits
-----------------

The ``gres/gpu`` limit
~~~~~~~~~~~~~~~~~~~~~~

The ``gres/gpu`` limit enforces the project’s GPU allocation ceiling. Once the counter is exhausted, no further GPU jobs can be submitted or run under the project until the allocation period resets or is extended.

The ``billing`` limit
~~~~~~~~~~~~~~~~~~~~~

The ``billing`` limit is a fairness control. It penalises jobs that consume disproportionate host resources — CPU threads or memory — relative to the number of GPUs they request.

A job requesting 1 GPU but all 208 available CPU threads and all available memory produces the following billing rate:

::

   CPU:  208  × 0.035714 =   7.43 billing units/min
   Mem: 2058  × 0.25     = 514.50 billing units/min
   GPU:    1  × 1.0      =   1.00 billing units/min
   ─────────────────────────────────────────────────
   Total:                  522.93 billing units/min

This is approximately 8× the fair-share rate. Such a job exhausts the billing budget 8× faster than expected, while simultaneously occupying all node CPU and memory resources and preventing the remaining 7 GPUs from being assigned to any other job.

The billing limit stops this pattern before the full GPU allocation is consumed. Billing cannot be refilled until the ``gres/gpu`` allocation is also exhausted — over-consuming billing therefore results in permanent loss of the remaining GPU-hours for that allocation period.

--------------

Billing limit derivation
------------------------

For a project to be able to consume its full GPU-hours allocation under fair-share usage, the billing cap must be set no lower than the total billing cost of that usage:

::

   billing cap = gres/gpu minutes × billing rate at fair share
               = 300,000 × 66.18
               = 19,854,000 billing-minutes

The recommended configuration for a 5,000 GPU-hour project is therefore:

::

   GrpTRESMins=billing=19,500,000,gres/gpu=300,000

The billing cap is rounded slightly below the theoretical maximum to provide a conservative margin for natural variation in job resource requests.

With this configuration, a project consuming resources at or near the fair-share rate will exhaust ``gres/gpu`` first and reach all 5,000 GPU-hours. A project that consistently over-allocates CPU threads or memory relative to GPU count will exhaust ``billing`` first, with the remaining GPU-hours becoming unreachable.

--------------

Job resource estimation
-----------------------

Requesting more resources than a job will use is not a safe practice. SLURM bills on *allocated* resources, not on actual consumption. A job allocated 500 GB of RAM that uses only 80 GB is billed for 500 GB for the full duration of the job. Over-allocation of memory in particular — given that memory dominates the billing formula — is the fastest route to premature exhaustion of the billing budget.

CPU thread estimation
~~~~~~~~~~~~~~~~~~~~~

The number of CPU threads a job should request depends on the actual parallelism of the workload. The node exposes 208 logical threads to jobs; each physical core contributes 2 threads, so a request of 26 threads corresponds to 13 physical cores. The correct value is determined by the application: frameworks vary in how many threads they spawn for data loading, preprocessing, and compute. The actual thread utilisation should be measured using ``htop``, ``sstat``, or equivalent tools during an initial test run, and subsequent submissions adjusted accordingly.

Memory estimation
~~~~~~~~~~~~~~~~~

Host RAM (the memory specified via ``--mem`` or ``--mem-per-cpu``) is distinct from GPU VRAM. Host RAM is used for data staging, CPU-side preprocessing, framework runtime overhead, and any tensors or buffers that reside on the host. A baseline estimate for a single-GPU job is: size of the model checkpoint on CPU plus the size of one data batch in host memory plus approximately 20% overhead. For recurring workloads, the peak resident set size from a completed job can be retrieved with:

.. code:: bash

   sacct -j <jobid> --format=JobID,MaxRSS

and used to calibrate the memory request for subsequent runs.

Input-dependent scaling
~~~~~~~~~~~~~~~~~~~~~~~

Resource requirements frequently scale with input size. A job processing a 10 GB dataset and a job processing a 500 GB dataset may require substantially different amounts of host memory and CPU threads. Submission scripts should parameterise these values rather than using a single fixed request for all input sizes.

Example job scripts
~~~~~~~~~~~~~~~~~~~

Single-GPU job with light host resource requirements (e.g. fine-tuning a compact model on a small dataset):

.. code:: bash

   #!/bin/bash
   #SBATCH --account=ehpc-dev-XXXXXX-YY
   #SBATCH --partition=common
   #SBATCH --nodes=1
   #SBATCH --gres=gpu:1
   #SBATCH --cpus-per-task=8
   #SBATCH --mem=64G
   #SBATCH --time=02:00:00

   # workload

Single-GPU job with heavy host resource requirements (e.g. large-batch inference with parallel data preprocessing):

.. code:: bash

   #!/bin/bash
   #SBATCH --account=ehpc-dev-XXXXXX-YY
   #SBATCH --partition=common
   #SBATCH --nodes=1
   #SBATCH --gres=gpu:1
   #SBATCH --cpus-per-task=32
   #SBATCH --mem=384G
   #SBATCH --time=02:00:00

   # workload

Both configurations are valid. The resource values must reflect the measured or estimated requirements of the specific workload and input, not an arbitrary upper bound.

--------------

Summary
-------

+-----------------------+-----------------------+-----------------------+
| Scenario              | Billing rate          | GPU-hours reachable   |
+=======================+=======================+=======================+
| Fair share (1 GPU, 26 | ~66 billing units/min | All 5,000             |
| threads, 257 GB)      |                       |                       |
+-----------------------+-----------------------+-----------------------+
| Moderate overuse (1   | ~147 billing          | ~2,260                |
| GPU, 56 threads, 500  | units/min             |                       |
| GB)                   |                       |                       |
+-----------------------+-----------------------+-----------------------+
| Severe overuse (1     | ~523 billing          | ~640                  |
| GPU, 208 threads,     | units/min             |                       |
| 2,058 GB)             |                       |                       |
+-----------------------+-----------------------+-----------------------+

The billing mechanism does not penalise efficient GPU use. It penalises the allocation of node-wide host resources by a single job while a fraction of the node’s GPUs remain idle or unschedulable.