Unified memory on DGX H200 with Sapphire Rapids: optimisation, optimal use, and programming models

Table of contents

Overview

This document is intended for users of the Discoverer+ HPC cluster who are developing or optimising GPU-accelerated workloads. It provides guidance on memory management strategies when considering whether to use CUDA unified memory or explicit memory transfers during GPU-driven computations.

The DGX H200 nodes on Discoverer+ use Intel Sapphire Rapids CPUs paired with NVIDIA H200 GPUs. This configuration does not provide true hardware-coherent unified memory between CPU and GPU. Unlike the NVIDIA GH200 Grace Hopper Superchip (which uses NVLink-C2C for a coherent shared memory space), the H200 in the Discoverer+ DGX system connects to the CPU via PCIe Gen5, requiring explicit or software-managed data movement between CPU DDR5 and GPU HBM3e memory pools.

Understanding these architectural constraints is essential for achieving optimal performance on Discoverer+.

Hardware architecture: DGX H200 vs GH200

DGX H200 on Discoverer+

Component	Specification
CPU	Dual Intel Xeon Platinum (Sapphire Rapids)
CPU memory	2 TB DDR5
GPU	8x NVIDIA H200 SXM5
GPU memory	141 GB HBM3e per GPU (1.128 TB total)
GPU memory bandwidth	4.8 TB/s per GPU
CPU-GPU interconnect	PCIe Gen5 (128 GB/s bidirectional)
GPU-GPU interconnect	NVLink 4 (900 GB/s per GPU)

The CPU and GPU memory are physically separate. Data must traverse PCIe to move between them.

GH200 Grace Hopper (for comparison)

Component	Specification
CPU	NVIDIA Grace (72 Arm cores)
CPU memory	480 GB LPDDR5X
GPU	Hopper H100/H200
GPU memory	96-144 GB HBM3/HBM3e
CPU-GPU interconnect	NVLink-C2C (900 GB/s bidirectional)
Memory model	Hardware-coherent unified address space

The GH200 provides a true unified memory pool of up to 624 GB accessible by both CPU and GPU without explicit data copies.

Memory interconnect bandwidth comparison

Interconnect	Bandwidth	Relative speed
H200 HBM3e (on-chip)	4,800 GB/s	1x (baseline)
NVLink 4 (GPU-GPU)	900 GB/s	0.19x
NVLink-C2C (GH200 CPU-GPU)	900 GB/s	0.19x
PCIe Gen5 x16 (DGX H200 CPU-GPU)	128 GB/s	0.027x
DDR5-4800 (8 channels)	~307 GB/s	0.064x

The PCIe Gen5 link between Sapphire Rapids and H200 is 7x slower than NVLink-C2C. This is the fundamental architectural difference that affects unified memory viability.

Programming models for H200 memory management

Option 1: explicit memory management

This approach is recommended for performance-critical workloads. Use cudaMalloc for GPU memory and cudaMallocHost for pinned CPU memory with explicit cudaMemcpyAsync transfers.

// Allocate pinned host memory (required for async transfers)
float *h_data;
cudaMallocHost(&h_data, size);

// Allocate device memory
float *d_data;
cudaMalloc(&d_data, size);

// Async transfer with stream overlap
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream);
kernel<<<blocks, threads, 0, stream>>>(d_data);
cudaMemcpyAsync(h_data, d_data, size, cudaMemcpyDeviceToHost, stream);

Advantages:

Deterministic performance

Full control over data movement

No page-fault overhead

Maximum PCIe bandwidth utilisation

Disadvantages:

More complex code

Manual memory management

Must track data residency

Option 2: CUDA managed memory

Use cudaMallocManaged to allocate memory accessible from both CPU and GPU code.

float *data;
cudaMallocManaged(&data, size);

// CPU can access
for (int i = 0; i < n; i++) data[i] = i;

// GPU can access (data migrates on demand)
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

// CPU reads back (data migrates back)
printf("%f\n", data[0]);

Performance tuning with hints:

// Pre-fetch to GPU before kernel launch
cudaMemPrefetchAsync(data, size, deviceId, stream);

// Mark as read-mostly (creates read-only copies on both)
cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, deviceId);

// Set preferred location
cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, deviceId);

// Allow direct access from GPU to CPU memory (zero-copy)
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, deviceId);

Advantages:

Simpler programming model

No explicit copy management

Good for rapid prototyping

Disadvantages:

Page faults cause latency spikes

On-demand migration over PCIe is slow

Performance unpredictable without hints

Option 3: zero-copy

Map pinned host memory directly into GPU address space.

float *h_data;
cudaHostAlloc(&h_data, size, cudaHostAllocMapped);

float *d_ptr;
cudaHostGetDevicePointer(&d_ptr, h_data, 0);

// GPU accesses CPU memory directly over PCIe
kernel<<<blocks, threads>>>(d_ptr);

Suitable use cases:

Data accessed only once (no benefit from caching in HBM)

Working set exceeds GPU memory

Streaming workloads

Limitation: bandwidth is constrained to PCIe (128 GB/s), which is 37x slower than HBM3e.

Complete C++ example with compilation options

The following example demonstrates unified memory with prefetching and memory hints. Save this as unified_memory_demo.cu.

// Created by Veselin Kolev <v.kolev@discoverer.bg>
// 18 November 2025
// Licence: GPLv2

#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
#include <chrono>

#define CHECK_CUDA(call)                                                       \
    do {                                                                       \
        cudaError_t err = call;                                                \
        if (err != cudaSuccess) {                                              \
            fprintf(stderr, "CUDA error at %s:%d: %s\n", __FILE__, __LINE__,   \
                    cudaGetErrorString(err));                                  \
            exit(EXIT_FAILURE);                                                \
        }                                                                      \
    } while (0)

__global__ void vector_add(const float* a, const float* b, float* c, size_t n) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

void run_with_explicit_memory(size_t n, int device) {
    size_t bytes = n * sizeof(float);

    // Host pinned memory
    float *h_a, *h_b, *h_c;
    CHECK_CUDA(cudaMallocHost(&h_a, bytes));
    CHECK_CUDA(cudaMallocHost(&h_b, bytes));
    CHECK_CUDA(cudaMallocHost(&h_c, bytes));

    // Device memory
    float *d_a, *d_b, *d_c;
    CHECK_CUDA(cudaMalloc(&d_a, bytes));
    CHECK_CUDA(cudaMalloc(&d_b, bytes));
    CHECK_CUDA(cudaMalloc(&d_c, bytes));

    // Initialise on host
    for (size_t i = 0; i < n; i++) {
        h_a[i] = static_cast<float>(i);
        h_b[i] = static_cast<float>(i * 2);
    }

    cudaStream_t stream;
    CHECK_CUDA(cudaStreamCreate(&stream));

    auto start = std::chrono::high_resolution_clock::now();

    // Explicit transfers
    CHECK_CUDA(cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream));
    CHECK_CUDA(cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads, 0, stream>>>(d_a, d_b, d_c, n);

    CHECK_CUDA(cudaMemcpyAsync(h_c, d_c, bytes, cudaMemcpyDeviceToHost, stream));
    CHECK_CUDA(cudaStreamSynchronize(stream));

    auto end = std::chrono::high_resolution_clock::now();
    double ms = std::chrono::duration<double, std::milli>(end - start).count();

    printf("Explicit memory:    %.3f ms\n", ms);

    // Cleanup
    CHECK_CUDA(cudaStreamDestroy(stream));
    CHECK_CUDA(cudaFree(d_a));
    CHECK_CUDA(cudaFree(d_b));
    CHECK_CUDA(cudaFree(d_c));
    CHECK_CUDA(cudaFreeHost(h_a));
    CHECK_CUDA(cudaFreeHost(h_b));
    CHECK_CUDA(cudaFreeHost(h_c));
}

void run_with_unified_memory_naive(size_t n, int device) {
    size_t bytes = n * sizeof(float);

    float *a, *b, *c;
    CHECK_CUDA(cudaMallocManaged(&a, bytes));
    CHECK_CUDA(cudaMallocManaged(&b, bytes));
    CHECK_CUDA(cudaMallocManaged(&c, bytes));

    // Initialise on host (data resides on CPU initially)
    for (size_t i = 0; i < n; i++) {
        a[i] = static_cast<float>(i);
        b[i] = static_cast<float>(i * 2);
    }

    auto start = std::chrono::high_resolution_clock::now();

    // No prefetch - relies on demand paging
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads>>>(a, b, c, n);
    CHECK_CUDA(cudaDeviceSynchronize());

    auto end = std::chrono::high_resolution_clock::now();
    double ms = std::chrono::duration<double, std::milli>(end - start).count();

    printf("Unified (naive):    %.3f ms\n", ms);

    CHECK_CUDA(cudaFree(a));
    CHECK_CUDA(cudaFree(b));
    CHECK_CUDA(cudaFree(c));
}

void run_with_unified_memory_optimised(size_t n, int device) {
    size_t bytes = n * sizeof(float);

    float *a, *b, *c;
    CHECK_CUDA(cudaMallocManaged(&a, bytes));
    CHECK_CUDA(cudaMallocManaged(&b, bytes));
    CHECK_CUDA(cudaMallocManaged(&c, bytes));

    // Set preferred location to GPU
    CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, device));
    CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, device));
    CHECK_CUDA(cudaMemAdvise(c, bytes, cudaMemAdviseSetPreferredLocation, device));

    // Mark inputs as read-mostly (allows read replicas)
    CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, device));
    CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, device));

    // Initialise on host
    for (size_t i = 0; i < n; i++) {
        a[i] = static_cast<float>(i);
        b[i] = static_cast<float>(i * 2);
    }

    cudaStream_t stream;
    CHECK_CUDA(cudaStreamCreate(&stream));

    auto start = std::chrono::high_resolution_clock::now();

    // Prefetch to GPU before kernel
    CHECK_CUDA(cudaMemPrefetchAsync(a, bytes, device, stream));
    CHECK_CUDA(cudaMemPrefetchAsync(b, bytes, device, stream));
    CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, device, stream));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads, 0, stream>>>(a, b, c, n);

    // Prefetch result back to CPU if needed
    CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId, stream));
    CHECK_CUDA(cudaStreamSynchronize(stream));

    auto end = std::chrono::high_resolution_clock::now();
    double ms = std::chrono::duration<double, std::milli>(end - start).count();

    printf("Unified (optimised): %.3f ms\n", ms);

    CHECK_CUDA(cudaStreamDestroy(stream));
    CHECK_CUDA(cudaFree(a));
    CHECK_CUDA(cudaFree(b));
    CHECK_CUDA(cudaFree(c));
}

int main(int argc, char** argv) {
    int device = 0;
    if (argc > 1) {
        device = atoi(argv[1]);
    }

    CHECK_CUDA(cudaSetDevice(device));

    cudaDeviceProp prop;
    CHECK_CUDA(cudaGetDeviceProperties(&prop, device));
    printf("Device: %s\n", prop.name);
    printf("Compute capability: %d.%d\n", prop.major, prop.minor);
    printf("Concurrent managed access: %s\n",
           prop.concurrentManagedAccess ? "yes" : "no");
    printf("Pageable memory access: %s\n",
           prop.pageableMemoryAccess ? "yes" : "no");
    printf("\n");

    // 256 MB of data (64M floats)
    size_t n = 64 * 1024 * 1024;
    printf("Vector size: %zu elements (%.1f MB)\n\n", n, n * sizeof(float) / 1e6);

    // Warm-up
    run_with_explicit_memory(n, device);

    // Benchmark runs
    printf("\nBenchmark results:\n");
    run_with_explicit_memory(n, device);
    run_with_unified_memory_naive(n, device);
    run_with_unified_memory_optimised(n, device);

    return 0;
}

Compilation options

Warning

Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.

For H200 (Hopper architecture, compute capability 9.0):

# Basic compilation
nvcc -arch=sm_90 -O3 -o unified_memory_demo unified_memory_demo.cu

# With additional optimisations
nvcc -arch=sm_90 -O3 --use_fast_math -Xcompiler "-O3 -march=native" \
     -o unified_memory_demo unified_memory_demo.cu

# Debug build with line info (for profiling with nsys/ncu)
nvcc -arch=sm_90 -O2 -g -lineinfo -o unified_memory_demo_debug unified_memory_demo.cu

Running the example

Warning

Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.

# Run on default GPU (device 0)
./unified_memory_demo

# Run on specific GPU
./unified_memory_demo 2

# Profile with Nsight Systems
nsys profile --stats=true ./unified_memory_demo

# Profile with Nsight Compute (kernel analysis)
ncu --set full -o unified_memory_report ./unified_memory_demo

Expected output

On an H200 with the 256 MB test vector, typical results:

Device: NVIDIA H200
Compute capability: 9.0
Concurrent managed access: yes
Pageable memory access: yes

Vector size: 67108864 elements (256.0 MB)

Benchmark results:
Explicit memory:    12.5 ms
Unified (naive):    45.2 ms
Unified (optimised): 14.1 ms

The naive unified memory approach incurs significant page-fault overhead. With prefetching and hints, performance approaches explicit memory management whilst retaining simpler code.

Key compilation flags explained

Flag	Purpose
`-arch=sm_90`	Target H200/Hopper architecture (compute capability 9.0)
`-O3`	Maximum optimisation level
`--use_fast_math`	Enable fast maths (slightly reduced precision)
`-Xcompiler "-O3"`	Pass flags to host compiler
`-g -lineinfo`	Include debug info for profiler source correlation

For multi-GPU DGX H200 systems, you may also use:

# Generate code for multiple architectures (if sharing binary)
nvcc -gencode arch=compute_90,code=sm_90 \
     -gencode arch=compute_90,code=compute_90 \
     -O3 -o unified_memory_demo unified_memory_demo.cu

Selecting the appropriate approach

Explicit memory management

Use when:

Training large models with predictable data access patterns
Inference serving with strict latency SLAs (p95/p99 targets)
Memory-bound HPC kernels where every GB/s matters
Production workloads requiring deterministic performance

Managed memory

Use when:

Prototyping and exploratory development
Complex data structures with unpredictable access (graphs, sparse matrices)
Oversubscription scenarios where data exceeds 141 GB per GPU
Code portability across different NVIDIA platforms is required

Zero-copy

Use when:

Data is accessed once and discarded
CPU preprocessing feeds directly into GPU computation
Results stream back to CPU for I/O immediately
Working set is larger than GPU memory but access is sequential

AI workload considerations

Large language model inference

The H200’s 141 GB HBM3e is specifically designed to hold:

Model weights for 70B+ parameter models without sharding

Large KV caches for long-context inference (128K+ tokens)

For LLM inference on DGX H200:

Keep model weights resident in HBM (explicit allocation)

KV cache should remain in HBM if possible

Only use managed memory for KV cache offload when context exceeds HBM capacity

Performance impact of memory location (Llama-70B inference):

Configuration	Per-token latency
H200 single GPU (model fits in HBM)	~75 ms
H100 single GPU (requires quantisation)	~142 ms

Source: WhiteFiber benchmark analysis. The 1.9x improvement stems from eliminating memory bottlenecks when the full model resides in HBM3e without quantisation or multi-GPU sharding.

Training workloads

Gradient accumulation, optimiser states, activations should all reside in HBM
Use gradient checkpointing to reduce memory footprint before resorting to CPU offload
If using CPU offload (e.g., DeepSpeed ZeRO-Offload), explicit async transfers outperform managed memory

Discoverer+ configuration

Given the 1.128 TB of total HBM3e across 8 GPUs with NVLink interconnect:

Keep working data in GPU memory
Use NVLink for GPU-GPU communication (not CPU-GPU)
Treat CPU DDR as staging/preprocessing area, not working memory
Use explicit transfers with pinned memory when CPU interaction is required

Sapphire Rapids CXL capabilities

Intel Sapphire Rapids supports CXL 1.1, which provides:

Memory expansion via CXL.mem protocol

Cache coherency for CXL-attached accelerators

CXL 1.1 limitations relevant to H200:

NVIDIA H200 does not use CXL (uses PCIe/NVLink)

CXL peer-to-peer between GPU and CXL memory requires CPU mediation

CXL memory bandwidth is lower than DDR5 (single PCIe lane equivalent)

CXL in your system is relevant for:

Memory capacity expansion beyond 2 TB DDR5

Future CXL-enabled accelerators

Memory tiering (hot data in DDR5, cold data in CXL expanders)

It does not provide unified memory with H200 GPUs.

DGX H200 memory hierarchy

+-------------------------------------------------------------+
|                        DGX H200 system                      |
+-------------------------------------------------------------+
|                                                             |
|  +-------------------------------------------------------+  |
|  |              GPU domain (8x H200)                     |  |
|  |  +---------+ +---------+ +---------+ +---------+      |  |
|  |  | 141 GB  | | 141 GB  | | 141 GB  | | 141 GB  | ...  |  |
|  |  | HBM3e   | | HBM3e   | | HBM3e   | | HBM3e   |      |  |
|  |  | 4.8TB/s | | 4.8TB/s | | 4.8TB/s | | 4.8TB/s |      |  |
|  |  +----+----+ +----+----+ +----+----+ +----+----+      |  |
|  |       +----------+----+----+----------+               |  |
|  |                    NVLink 4                           |  |
|  |                   900 GB/s                            |  |
|  +---------------------------+---------------------------+  |
|                              |                              |
|                        PCIe Gen5                            |
|                        128 GB/s                             |
|                              |                              |
|  +---------------------------+---------------------------+  |
|  |              CPU domain (Sapphire Rapids)             |  |
|  |  +------------------------------------------------+   |  |
|  |  |              2 TB DDR5 memory                  |   |  |
|  |  |              ~307 GB/s bandwidth               |   |  |
|  |  +------------------------------------------------+   |  |
|  |  +------------------------------------------------+   |  |
|  |  |         CXL 1.1 (optional expansion)           |   |  |
|  |  +------------------------------------------------+   |  |
|  +-------------------------------------------------------+  |
|                                                             |
+-------------------------------------------------------------+

Performance tiers:
  Tier 1: HBM3e      -> 4,800 GB/s  (keep hot data here)
  Tier 2: NVLink     ->   900 GB/s  (GPU-GPU sharing)
  Tier 3: DDR5       ->   307 GB/s  (CPU preprocessing)
  Tier 4: PCIe       ->   128 GB/s  (CPU-GPU transfers)
  Tier 5: CXL        ->    32 GB/s  (cold storage expansion)

Usage guidelines

Default to explicit memory management for production AI workloads
Use managed memory with prefetch hints when:
- Porting existing CPU code quickly
- Working set slightly exceeds GPU memory
- Access patterns are irregular
Avoid relying on demand paging over PCIe; the 128 GB/s limit creates severe bottlenecks
For models exceeding single-GPU memory, use tensor parallelism over NVLink rather than CPU offload
Profile with nsys and ncu to identify memory movement bottlenecks before optimising
Consider GH200-based systems if your workloads fundamentally require CPU-GPU memory coherency

References

NVIDIA CUDA Programming Guide: Unified Memory
NVIDIA H200 Product Brief
NVIDIA DGX H200 Specifications
NVIDIA Grace Hopper Superchip Architecture Whitepaper
Intel Xeon Scalable Processor Max Series Documentation
CXL 4.0 Infrastructure Planning Guide