SLURM GPU billing¶
Table of contents
GPU hours allocation¶
Note
Each project granted access to the Discoverer+ GPU cluster — whether through EuroHPC, a national allocation, or another programme — receives an allocation of GPU hours. GPU hours represent the total GPU compute time available to the project.
On Discoverer+, the H200 GPUs are managed by SLURM as Generic RESources (GRES). SLURM does not natively schedule GPUs as a first-class resource type; GPUs are described and tracked as GRES. For this reason, a “GPU hour” in project documentation corresponds to a GRES hour in SLURM terms. The two are equivalent on this cluster.
The exact allocation size is specified in the project documentation. The figures used throughout this document — 5,000 GPU-hours and the corresponding billing limits — are examples only and do not represent any default allocation. The actual values must be taken from the project agreement.
SLURM tracks two resource counters independently for each project:
| Counter | Example limit | Description |
|---|---|---|
gres/gpu |
300,000 GPU-minutes (= 5,000 GPU-hours) | Raw GPU time consumed |
billing |
19,500,000 billing-minutes | Weighted sum of all host resources consumed |
The gres/gpu counter represents the actual GPU allocation. The billing counter is a fairness mechanism that accounts for the host CPU and memory consumed alongside the GPUs. Both are described in detail in the sections below.
Node resources¶
Each DGX node is equipped with 2 × 56-core processors, giving 112 physical cores in total. With simultaneous multithreading (SMT) enabled, each physical core exposes 2 logical CPU threads to the operating system. Linux and SLURM therefore both operate with 224 logical CPU threads per node, not physical cores.
All SLURM parameters that reference CPUs — --cpus-per-task, --ntasks, and related options — refer to logical CPU threads. A request of --cpus-per-task=8 allocates 8 threads, which may correspond to as few as 4 physical cores depending on thread placement. This is generally transparent to applications, but relevant for
workloads sensitive to NUMA topology or last-level cache sharing.
Not all 224 threads are available to user jobs. 8 physical cores (16 logical threads, binds threads to cores) are permanently reserved for the WEKA storage client, which runs as a container and provides access to the cluster’s high-performance parallel filesystem. These cores are excluded from the SLURM scheduling pool via the CpuSpecList parameter in the node configuration.
The WEKA container cores are shared with the Linux kernel and general system tasks. Under normal operating conditions this has no effect on user jobs. In cases of unusually high parallel I/O load, the kernel scheduler may transiently borrow a small number of threads from a running job to service system work. This is a rare occurrence.
The resulting resource availability per node is as follows:
| Resource | Physical | Logical (OS/SLURM) | Reserved | Available to jobs |
|---|---|---|---|---|
| CPU cores / threads | 112 cores | 224 threads | 16 threads (WEKA) | 208 threads |
| Memory | 2,063,425 MB | — | 5,000 MB (system) | ~2,058 GB |
| GPUs | 8 | — | — | 8 |
Billing model¶
For every minute a job runs, SLURM computes a billing score as the weighted sum of all resources allocated to that job:
billing/min = (CPU_threads × 0.035714) + (MemoryGB × 0.25) + (GPUs × 1.0)
CPU_threads is the number of logical CPU threads allocated,
MemoryGB is the host RAM allocated in gigabytes, and GPUs is the
number of GPU devices requested.
Memory units: SLURM measures memory in megabytes internally. The weightMem=0.25Gspecifies 0.25 billing units per gigabyte (equivalent to 0.25 ÷ 1024 per megabyte).
On DGX nodes, memory is the dominant billing term. Each node carries ~2 TB of host RAM in direct support of its 8 H200 GPUs. A job that exhausts the node’s host memory prevents other jobs from being scheduled on the remaining GPUs, regardless of how many GPUs that job itself uses. The billing weights reflect this: memory over-allocation is penalised at the same scale as GPU over-allocation.
Allocation limits¶
The gres/gpu limit¶
The gres/gpu limit enforces the project’s GPU allocation ceiling. Once the counter is exhausted, no further GPU jobs can be submitted or run under the project until the allocation period resets or is extended.
The billing limit¶
The billing limit is a fairness control. It penalises jobs that consume disproportionate host resources — CPU threads or memory — relative to the number of GPUs they request.
A job requesting 1 GPU but all 208 available CPU threads and all available memory produces the following billing rate:
CPU: 208 × 0.035714 = 7.43 billing units/min
Mem: 2058 × 0.25 = 514.50 billing units/min
GPU: 1 × 1.0 = 1.00 billing units/min
─────────────────────────────────────────────────
Total: 522.93 billing units/min
This is approximately 8× the fair-share rate. Such a job exhausts the billing budget 8× faster than expected, while simultaneously occupying all node CPU and memory resources and preventing the remaining 7 GPUs from being assigned to any other job.
The billing limit stops this pattern before the full GPU allocation is consumed. Billing cannot be refilled until the gres/gpu allocation is also exhausted — over-consuming billing therefore results in permanent loss of the remaining GPU-hours for that allocation period.
Billing limit derivation¶
For a project to be able to consume its full GPU-hours allocation under fair-share usage, the billing cap must be set no lower than the total billing cost of that usage:
billing cap = gres/gpu minutes × billing rate at fair share
= 300,000 × 66.18
= 19,854,000 billing-minutes
The recommended configuration for a 5,000 GPU-hour project is therefore:
GrpTRESMins=billing=19,500,000,gres/gpu=300,000
The billing cap is rounded slightly below the theoretical maximum to provide a conservative margin for natural variation in job resource requests.
With this configuration, a project consuming resources at or near the fair-share rate will exhaust gres/gpu first and reach all 5,000 GPU-hours. A project that consistently over-allocates CPU threads or memory relative to GPU count will exhaust billing first, with the remaining GPU-hours becoming unreachable.
Job resource estimation¶
Requesting more resources than a job will use is not a safe practice. SLURM bills on allocated resources, not on actual consumption. A job allocated 500 GB of RAM that uses only 80 GB is billed for 500 GB for the full duration of the job. Over-allocation of memory in particular — given that memory dominates the billing formula — is the fastest route to premature exhaustion of the billing budget.
CPU thread estimation¶
The number of CPU threads a job should request depends on the actual parallelism of the workload. The node exposes 208 logical threads to jobs; each physical core contributes 2 threads, so a request of 26 threads corresponds to 13 physical cores. The correct value is determined by the application: frameworks vary in how many threads they spawn for data loading, preprocessing, and compute. The actual thread utilisation should be measured using htop, sstat, or equivalent tools during an initial test run, and subsequent submissions adjusted accordingly.
Memory estimation¶
Host RAM (the memory specified via --mem or --mem-per-cpu) is distinct from GPU VRAM. Host RAM is used for data staging, CPU-side preprocessing, framework runtime overhead, and any tensors or buffers that reside on the host. A baseline estimate for a single-GPU job is: size of the model checkpoint on CPU plus the size of one data batch in host memory plus approximately 20% overhead. For recurring workloads, the peak resident set size from a completed job can be retrieved with:
sacct -j <jobid> --format=JobID,MaxRSS
and used to calibrate the memory request for subsequent runs.
Input-dependent scaling¶
Resource requirements frequently scale with input size. A job processing a 10 GB dataset and a job processing a 500 GB dataset may require substantially different amounts of host memory and CPU threads. Submission scripts should parameterise these values rather than using a single fixed request for all input sizes.
Example job scripts¶
Single-GPU job with light host resource requirements (e.g. fine-tuning a compact model on a small dataset):
#!/bin/bash
#SBATCH --account=ehpc-dev-XXXXXX-YY
#SBATCH --partition=common
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=02:00:00
# workload
Single-GPU job with heavy host resource requirements (e.g. large-batch inference with parallel data preprocessing):
#!/bin/bash
#SBATCH --account=ehpc-dev-XXXXXX-YY
#SBATCH --partition=common
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=32
#SBATCH --mem=384G
#SBATCH --time=02:00:00
# workload
Both configurations are valid. The resource values must reflect the measured or estimated requirements of the specific workload and input, not an arbitrary upper bound.
Summary¶
| Scenario | Billing rate | GPU-hours reachable |
|---|---|---|
| Fair share (1 GPU, 26 threads, 257 GB) | ~66 billing units/min | All 5,000 |
| Moderate overuse (1 GPU, 56 threads, 500 GB) | ~147 billing units/min | ~2,260 |
| Severe overuse (1 GPU, 208 threads, 2,058 GB) | ~523 billing units/min | ~640 |
The billing mechanism does not penalise efficient GPU use. It penalises the allocation of node-wide host resources by a single job while a fraction of the node’s GPUs remain idle or unschedulable.