SLURM GPU billing

GPU hours allocation

Note

Each project granted access to the Discoverer+ GPU cluster — whether through EuroHPC, a national allocation, or another programme — receives an allocation of GPU hours. GPU hours represent the total GPU compute time available to the project.

On Discoverer+, the H200 GPUs are managed by SLURM as Generic RESources (GRES). SLURM does not natively schedule GPUs as a first-class resource type; GPUs are described and tracked as GRES. For this reason, a “GPU hour” in project documentation corresponds to a GRES hour in SLURM terms. The two are equivalent on this cluster.

The exact allocation size is specified in the project documentation. The figures used throughout this document — 5,000 GPU-hours and the corresponding billing limits — are examples only and do not represent any default allocation. The actual values must be taken from the project agreement.

SLURM tracks two resource counters independently for each project:

Counter Example limit Description
gres/gpu 300,000 GPU-minutes (= 5,000 GPU-hours) Raw GPU time consumed
billing 19,500,000 billing-minutes Weighted sum of all host resources consumed

The gres/gpu counter represents the actual GPU allocation. The billing counter is a fairness mechanism that accounts for the host CPU and memory consumed alongside the GPUs. Both are described in detail in the sections below.


Node resources

Each DGX node is equipped with 2 × 56-core processors, giving 112 physical cores in total. With simultaneous multithreading (SMT) enabled, each physical core exposes 2 logical CPU threads to the operating system. Linux and SLURM therefore both operate with 224 logical CPU threads per node, not physical cores.

All SLURM parameters that reference CPUs — --cpus-per-task, --ntasks, and related options — refer to logical CPU threads. A request of --cpus-per-task=8 allocates 8 threads, which may correspond to as few as 4 physical cores depending on thread placement. This is generally transparent to applications, but relevant for workloads sensitive to NUMA topology or last-level cache sharing.

Not all 224 threads are available to user jobs. 8 physical cores (16 logical threads, binds threads to cores) are permanently reserved for the WEKA storage client, which runs as a container and provides access to the cluster’s high-performance parallel filesystem. These cores are excluded from the SLURM scheduling pool via the CpuSpecList parameter in the node configuration.

The WEKA container cores are shared with the Linux kernel and general system tasks. Under normal operating conditions this has no effect on user jobs. In cases of unusually high parallel I/O load, the kernel scheduler may transiently borrow a small number of threads from a running job to service system work. This is a rare occurrence.

The resulting resource availability per node is as follows:

Resource Physical Logical (OS/SLURM) Reserved Available to jobs
CPU cores / threads 112 cores 224 threads 16 threads (WEKA) 208 threads
Memory 2,063,425 MB 5,000 MB (system) ~2,058 GB
GPUs 8 8

Billing model

For every minute a job runs, SLURM computes a billing score as the weighted sum of all resources allocated to that job:

billing/min = (CPU_threads × 0.035714) + (MemoryGB × 0.25) + (GPUs × 1.0)

CPU_threads is the number of logical CPU threads allocated, MemoryGB is the host RAM allocated in gigabytes, and GPUs is the number of GPU devices requested.

Memory units: SLURM measures memory in megabytes internally. The weight Mem=0.25G specifies 0.25 billing units per gigabyte (equivalent to 0.25 ÷ 1024 per megabyte).

On DGX nodes, memory is the dominant billing term. Each node carries ~2 TB of host RAM in direct support of its 8 H200 GPUs. A job that exhausts the node’s host memory prevents other jobs from being scheduled on the remaining GPUs, regardless of how many GPUs that job itself uses. The billing weights reflect this: memory over-allocation is penalised at the same scale as GPU over-allocation.


Fair-share of host resources per GPU

With 8 GPUs sharing 208 CPU threads and ~2,058 GB of RAM, the proportional 1/8th-node share per GPU is:

Resource Available to jobs Per GPU (1/8th share)
Logical CPU threads 208 26
Memory ~2,058 GB ~257 GB
GPUs 8 1

These figures define the reference billing rate — the rate at which a project consumes its billing budget when host resources are used in proportion to GPU allocation. A job allocated exactly this share produces:

CPU:   26  × 0.035714 =  0.93 billing units/min
Mem:  257  × 0.25     = 64.25 billing units/min
GPU:    1  × 1.0      =  1.00 billing units/min
─────────────────────────────────────────────
Total:                  66.18 billing units/min

At this rate, a 5,000 GPU-hour project (300,000 GPU-minutes) would consume:

300,000 × 66.18 = 19,854,000 billing-minutes

This calculation is the basis for deriving the correct billing cap — see Billing limit derivation.

The fair-share figures are a project-level target for the average across all jobs, not a per-job requirement. Individual jobs will legitimately deviate in both directions: a preprocessing job may need 2 CPU threads and 32 GB of RAM per GPU, while a large-batch training job may require 40 threads and 400 GB. Both are acceptable. What determines whether a project can reach its full GPU-hours allocation is whether the average resource consumption across all submitted jobs remains close to the fair-share values. A project whose jobs consistently over-allocate CPU threads or memory relative to GPUs will exhaust the billing budget before the GPU-hours are spent, and the remaining GPU-hours become permanently unreachable for that allocation period.


Allocation limits

The gres/gpu limit

The gres/gpu limit enforces the project’s GPU allocation ceiling. Once the counter is exhausted, no further GPU jobs can be submitted or run under the project until the allocation period resets or is extended.

The billing limit

The billing limit is a fairness control. It penalises jobs that consume disproportionate host resources — CPU threads or memory — relative to the number of GPUs they request.

A job requesting 1 GPU but all 208 available CPU threads and all available memory produces the following billing rate:

CPU:  208  × 0.035714 =   7.43 billing units/min
Mem: 2058  × 0.25     = 514.50 billing units/min
GPU:    1  × 1.0      =   1.00 billing units/min
─────────────────────────────────────────────────
Total:                  522.93 billing units/min

This is approximately 8× the fair-share rate. Such a job exhausts the billing budget 8× faster than expected, while simultaneously occupying all node CPU and memory resources and preventing the remaining 7 GPUs from being assigned to any other job.

The billing limit stops this pattern before the full GPU allocation is consumed. Billing cannot be refilled until the gres/gpu allocation is also exhausted — over-consuming billing therefore results in permanent loss of the remaining GPU-hours for that allocation period.


Billing limit derivation

For a project to be able to consume its full GPU-hours allocation under fair-share usage, the billing cap must be set no lower than the total billing cost of that usage:

billing cap = gres/gpu minutes × billing rate at fair share
            = 300,000 × 66.18
            = 19,854,000 billing-minutes

The recommended configuration for a 5,000 GPU-hour project is therefore:

GrpTRESMins=billing=19,500,000,gres/gpu=300,000

The billing cap is rounded slightly below the theoretical maximum to provide a conservative margin for natural variation in job resource requests.

With this configuration, a project consuming resources at or near the fair-share rate will exhaust gres/gpu first and reach all 5,000 GPU-hours. A project that consistently over-allocates CPU threads or memory relative to GPU count will exhaust billing first, with the remaining GPU-hours becoming unreachable.


Job resource estimation

Requesting more resources than a job will use is not a safe practice. SLURM bills on allocated resources, not on actual consumption. A job allocated 500 GB of RAM that uses only 80 GB is billed for 500 GB for the full duration of the job. Over-allocation of memory in particular — given that memory dominates the billing formula — is the fastest route to premature exhaustion of the billing budget.

CPU thread estimation

The number of CPU threads a job should request depends on the actual parallelism of the workload. The node exposes 208 logical threads to jobs; each physical core contributes 2 threads, so a request of 26 threads corresponds to 13 physical cores. The correct value is determined by the application: frameworks vary in how many threads they spawn for data loading, preprocessing, and compute. The actual thread utilisation should be measured using htop, sstat, or equivalent tools during an initial test run, and subsequent submissions adjusted accordingly.

Memory estimation

Host RAM (the memory specified via --mem or --mem-per-cpu) is distinct from GPU VRAM. Host RAM is used for data staging, CPU-side preprocessing, framework runtime overhead, and any tensors or buffers that reside on the host. A baseline estimate for a single-GPU job is: size of the model checkpoint on CPU plus the size of one data batch in host memory plus approximately 20% overhead. For recurring workloads, the peak resident set size from a completed job can be retrieved with:

sacct -j <jobid> --format=JobID,MaxRSS

and used to calibrate the memory request for subsequent runs.

Input-dependent scaling

Resource requirements frequently scale with input size. A job processing a 10 GB dataset and a job processing a 500 GB dataset may require substantially different amounts of host memory and CPU threads. Submission scripts should parameterise these values rather than using a single fixed request for all input sizes.

Example job scripts

Single-GPU job with light host resource requirements (e.g. fine-tuning a compact model on a small dataset):

#!/bin/bash
#SBATCH --account=ehpc-dev-XXXXXX-YY
#SBATCH --partition=common
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=02:00:00

# workload

Single-GPU job with heavy host resource requirements (e.g. large-batch inference with parallel data preprocessing):

#!/bin/bash
#SBATCH --account=ehpc-dev-XXXXXX-YY
#SBATCH --partition=common
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=32
#SBATCH --mem=384G
#SBATCH --time=02:00:00

# workload

Both configurations are valid. The resource values must reflect the measured or estimated requirements of the specific workload and input, not an arbitrary upper bound.


Summary

Scenario Billing rate GPU-hours reachable
Fair share (1 GPU, 26 threads, 257 GB) ~66 billing units/min All 5,000
Moderate overuse (1 GPU, 56 threads, 500 GB) ~147 billing units/min ~2,260
Severe overuse (1 GPU, 208 threads, 2,058 GB) ~523 billing units/min ~640

The billing mechanism does not penalise efficient GPU use. It penalises the allocation of node-wide host resources by a single job while a fraction of the node’s GPUs remain idle or unschedulable.