Unified memory on DGX H200 with Sapphire Rapids: optimisation, optimal use, and programming models

Table of contents

  1. Overview

  2. Hardware architecture: DGX H200 vs GH200

  3. Memory interconnect bandwidth comparison

  4. Programming models for H200 memory management

  5. Complete C++ example with compilation options

  6. Selecting the appropriate approach

  7. AI workload considerations

  8. Sapphire Rapids CXL capabilities

  9. DGX H200 memory hierarchy

  10. Usage guidelines

  11. References

Overview

This document is intended for users of the Discoverer+ HPC cluster who are developing or optimising GPU-accelerated workloads. It provides guidance on memory management strategies when considering whether to use CUDA unified memory or explicit memory transfers during GPU-driven computations.

The DGX H200 nodes on Discoverer+ use Intel Sapphire Rapids CPUs paired with NVIDIA H200 GPUs. This configuration does not provide true hardware-coherent unified memory between CPU and GPU. Unlike the NVIDIA GH200 Grace Hopper Superchip (which uses NVLink-C2C for a coherent shared memory space), the H200 in the Discoverer+ DGX system connects to the CPU via PCIe Gen5, requiring explicit or software-managed data movement between CPU DDR5 and GPU HBM3e memory pools.

Understanding these architectural constraints is essential for achieving optimal performance on Discoverer+.

Hardware architecture: DGX H200 vs GH200

DGX H200 on Discoverer+

Component

Specification

CPU

Dual Intel Xeon Platinum (Sapphire Rapids)

CPU memory

2 TB DDR5

GPU

8x NVIDIA H200 SXM5

GPU memory

141 GB HBM3e per GPU (1.128 TB total)

GPU memory bandwidth

4.8 TB/s per GPU

CPU-GPU interconnect

PCIe Gen5 (128 GB/s bidirectional)

GPU-GPU interconnect

NVLink 4 (900 GB/s per GPU)

The CPU and GPU memory are physically separate. Data must traverse PCIe to move between them.

GH200 Grace Hopper (for comparison)

Component

Specification

CPU

NVIDIA Grace (72 Arm cores)

CPU memory

480 GB LPDDR5X

GPU

Hopper H100/H200

GPU memory

96-144 GB HBM3/HBM3e

CPU-GPU interconnect

NVLink-C2C (900 GB/s bidirectional)

Memory model

Hardware-coherent unified address space

The GH200 provides a true unified memory pool of up to 624 GB accessible by both CPU and GPU without explicit data copies.

Memory interconnect bandwidth comparison

Interconnect

Bandwidth

Relative speed

H200 HBM3e (on-chip)

4,800 GB/s

1x (baseline)

NVLink 4 (GPU-GPU)

900 GB/s

0.19x

NVLink-C2C (GH200 CPU-GPU)

900 GB/s

0.19x

PCIe Gen5 x16 (DGX H200 CPU-GPU)

128 GB/s

0.027x

DDR5-4800 (8 channels)

~307 GB/s

0.064x

The PCIe Gen5 link between Sapphire Rapids and H200 is 7x slower than NVLink-C2C. This is the fundamental architectural difference that affects unified memory viability.

Programming models for H200 memory management

Option 1: explicit memory management

This approach is recommended for performance-critical workloads. Use cudaMalloc for GPU memory and cudaMallocHost for pinned CPU memory with explicit cudaMemcpyAsync transfers.

// Allocate pinned host memory (required for async transfers)
float *h_data;
cudaMallocHost(&h_data, size);

// Allocate device memory
float *d_data;
cudaMalloc(&d_data, size);

// Async transfer with stream overlap
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream);
kernel<<<blocks, threads, 0, stream>>>(d_data);
cudaMemcpyAsync(h_data, d_data, size, cudaMemcpyDeviceToHost, stream);

Advantages:

  • Deterministic performance

  • Full control over data movement

  • No page-fault overhead

  • Maximum PCIe bandwidth utilisation

Disadvantages:

  • More complex code

  • Manual memory management

  • Must track data residency

Option 2: CUDA managed memory

Use cudaMallocManaged to allocate memory accessible from both CPU and GPU code.

float *data;
cudaMallocManaged(&data, size);

// CPU can access
for (int i = 0; i < n; i++) data[i] = i;

// GPU can access (data migrates on demand)
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

// CPU reads back (data migrates back)
printf("%f\n", data[0]);

Performance tuning with hints:

// Pre-fetch to GPU before kernel launch
cudaMemPrefetchAsync(data, size, deviceId, stream);

// Mark as read-mostly (creates read-only copies on both)
cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, deviceId);

// Set preferred location
cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, deviceId);

// Allow direct access from GPU to CPU memory (zero-copy)
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, deviceId);

Advantages:

  • Simpler programming model

  • No explicit copy management

  • Good for rapid prototyping

Disadvantages:

  • Page faults cause latency spikes

  • On-demand migration over PCIe is slow

  • Performance unpredictable without hints

Option 3: zero-copy

Map pinned host memory directly into GPU address space.

float *h_data;
cudaHostAlloc(&h_data, size, cudaHostAllocMapped);

float *d_ptr;
cudaHostGetDevicePointer(&d_ptr, h_data, 0);

// GPU accesses CPU memory directly over PCIe
kernel<<<blocks, threads>>>(d_ptr);

Suitable use cases:

  • Data accessed only once (no benefit from caching in HBM)

  • Working set exceeds GPU memory

  • Streaming workloads

Limitation: bandwidth is constrained to PCIe (128 GB/s), which is 37x slower than HBM3e.

Complete C++ example with compilation options

The following example demonstrates unified memory with prefetching and memory hints. Save this as unified_memory_demo.cu.

// Created by Veselin Kolev <v.kolev@discoverer.bg>
// 18 November 2025
// Licence: GPLv2

#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
#include <chrono>

#define CHECK_CUDA(call)                                                       \
    do {                                                                       \
        cudaError_t err = call;                                                \
        if (err != cudaSuccess) {                                              \
            fprintf(stderr, "CUDA error at %s:%d: %s\n", __FILE__, __LINE__,   \
                    cudaGetErrorString(err));                                  \
            exit(EXIT_FAILURE);                                                \
        }                                                                      \
    } while (0)

__global__ void vector_add(const float* a, const float* b, float* c, size_t n) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

void run_with_explicit_memory(size_t n, int device) {
    size_t bytes = n * sizeof(float);

    // Host pinned memory
    float *h_a, *h_b, *h_c;
    CHECK_CUDA(cudaMallocHost(&h_a, bytes));
    CHECK_CUDA(cudaMallocHost(&h_b, bytes));
    CHECK_CUDA(cudaMallocHost(&h_c, bytes));

    // Device memory
    float *d_a, *d_b, *d_c;
    CHECK_CUDA(cudaMalloc(&d_a, bytes));
    CHECK_CUDA(cudaMalloc(&d_b, bytes));
    CHECK_CUDA(cudaMalloc(&d_c, bytes));

    // Initialise on host
    for (size_t i = 0; i < n; i++) {
        h_a[i] = static_cast<float>(i);
        h_b[i] = static_cast<float>(i * 2);
    }

    cudaStream_t stream;
    CHECK_CUDA(cudaStreamCreate(&stream));

    auto start = std::chrono::high_resolution_clock::now();

    // Explicit transfers
    CHECK_CUDA(cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream));
    CHECK_CUDA(cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads, 0, stream>>>(d_a, d_b, d_c, n);

    CHECK_CUDA(cudaMemcpyAsync(h_c, d_c, bytes, cudaMemcpyDeviceToHost, stream));
    CHECK_CUDA(cudaStreamSynchronize(stream));

    auto end = std::chrono::high_resolution_clock::now();
    double ms = std::chrono::duration<double, std::milli>(end - start).count();

    printf("Explicit memory:    %.3f ms\n", ms);

    // Cleanup
    CHECK_CUDA(cudaStreamDestroy(stream));
    CHECK_CUDA(cudaFree(d_a));
    CHECK_CUDA(cudaFree(d_b));
    CHECK_CUDA(cudaFree(d_c));
    CHECK_CUDA(cudaFreeHost(h_a));
    CHECK_CUDA(cudaFreeHost(h_b));
    CHECK_CUDA(cudaFreeHost(h_c));
}

void run_with_unified_memory_naive(size_t n, int device) {
    size_t bytes = n * sizeof(float);

    float *a, *b, *c;
    CHECK_CUDA(cudaMallocManaged(&a, bytes));
    CHECK_CUDA(cudaMallocManaged(&b, bytes));
    CHECK_CUDA(cudaMallocManaged(&c, bytes));

    // Initialise on host (data resides on CPU initially)
    for (size_t i = 0; i < n; i++) {
        a[i] = static_cast<float>(i);
        b[i] = static_cast<float>(i * 2);
    }

    auto start = std::chrono::high_resolution_clock::now();

    // No prefetch - relies on demand paging
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads>>>(a, b, c, n);
    CHECK_CUDA(cudaDeviceSynchronize());

    auto end = std::chrono::high_resolution_clock::now();
    double ms = std::chrono::duration<double, std::milli>(end - start).count();

    printf("Unified (naive):    %.3f ms\n", ms);

    CHECK_CUDA(cudaFree(a));
    CHECK_CUDA(cudaFree(b));
    CHECK_CUDA(cudaFree(c));
}

void run_with_unified_memory_optimised(size_t n, int device) {
    size_t bytes = n * sizeof(float);

    float *a, *b, *c;
    CHECK_CUDA(cudaMallocManaged(&a, bytes));
    CHECK_CUDA(cudaMallocManaged(&b, bytes));
    CHECK_CUDA(cudaMallocManaged(&c, bytes));

    // Set preferred location to GPU
    CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, device));
    CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, device));
    CHECK_CUDA(cudaMemAdvise(c, bytes, cudaMemAdviseSetPreferredLocation, device));

    // Mark inputs as read-mostly (allows read replicas)
    CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, device));
    CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, device));

    // Initialise on host
    for (size_t i = 0; i < n; i++) {
        a[i] = static_cast<float>(i);
        b[i] = static_cast<float>(i * 2);
    }

    cudaStream_t stream;
    CHECK_CUDA(cudaStreamCreate(&stream));

    auto start = std::chrono::high_resolution_clock::now();

    // Prefetch to GPU before kernel
    CHECK_CUDA(cudaMemPrefetchAsync(a, bytes, device, stream));
    CHECK_CUDA(cudaMemPrefetchAsync(b, bytes, device, stream));
    CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, device, stream));

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads, 0, stream>>>(a, b, c, n);

    // Prefetch result back to CPU if needed
    CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId, stream));
    CHECK_CUDA(cudaStreamSynchronize(stream));

    auto end = std::chrono::high_resolution_clock::now();
    double ms = std::chrono::duration<double, std::milli>(end - start).count();

    printf("Unified (optimised): %.3f ms\n", ms);

    CHECK_CUDA(cudaStreamDestroy(stream));
    CHECK_CUDA(cudaFree(a));
    CHECK_CUDA(cudaFree(b));
    CHECK_CUDA(cudaFree(c));
}

int main(int argc, char** argv) {
    int device = 0;
    if (argc > 1) {
        device = atoi(argv[1]);
    }

    CHECK_CUDA(cudaSetDevice(device));

    cudaDeviceProp prop;
    CHECK_CUDA(cudaGetDeviceProperties(&prop, device));
    printf("Device: %s\n", prop.name);
    printf("Compute capability: %d.%d\n", prop.major, prop.minor);
    printf("Concurrent managed access: %s\n",
           prop.concurrentManagedAccess ? "yes" : "no");
    printf("Pageable memory access: %s\n",
           prop.pageableMemoryAccess ? "yes" : "no");
    printf("\n");

    // 256 MB of data (64M floats)
    size_t n = 64 * 1024 * 1024;
    printf("Vector size: %zu elements (%.1f MB)\n\n", n, n * sizeof(float) / 1e6);

    // Warm-up
    run_with_explicit_memory(n, device);

    // Benchmark runs
    printf("\nBenchmark results:\n");
    run_with_explicit_memory(n, device);
    run_with_unified_memory_naive(n, device);
    run_with_unified_memory_optimised(n, device);

    return 0;
}

Compilation options

Warning

Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.

For H200 (Hopper architecture, compute capability 9.0):

# Basic compilation
nvcc -arch=sm_90 -O3 -o unified_memory_demo unified_memory_demo.cu

# With additional optimisations
nvcc -arch=sm_90 -O3 --use_fast_math -Xcompiler "-O3 -march=native" \
     -o unified_memory_demo unified_memory_demo.cu

# Debug build with line info (for profiling with nsys/ncu)
nvcc -arch=sm_90 -O2 -g -lineinfo -o unified_memory_demo_debug unified_memory_demo.cu

Running the example

Warning

Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.

# Run on default GPU (device 0)
./unified_memory_demo

# Run on specific GPU
./unified_memory_demo 2

# Profile with Nsight Systems
nsys profile --stats=true ./unified_memory_demo

# Profile with Nsight Compute (kernel analysis)
ncu --set full -o unified_memory_report ./unified_memory_demo

Expected output

On an H200 with the 256 MB test vector, typical results:

Device: NVIDIA H200
Compute capability: 9.0
Concurrent managed access: yes
Pageable memory access: yes

Vector size: 67108864 elements (256.0 MB)

Benchmark results:
Explicit memory:    12.5 ms
Unified (naive):    45.2 ms
Unified (optimised): 14.1 ms

The naive unified memory approach incurs significant page-fault overhead. With prefetching and hints, performance approaches explicit memory management whilst retaining simpler code.

Key compilation flags explained

Flag

Purpose

-arch=sm_90

Target H200/Hopper architecture (compute capability 9.0)

-O3

Maximum optimisation level

--use_fast_math

Enable fast maths (slightly reduced precision)

-Xcompiler "-O3"

Pass flags to host compiler

-g -lineinfo

Include debug info for profiler source correlation

For multi-GPU DGX H200 systems, you may also use:

# Generate code for multiple architectures (if sharing binary)
nvcc -gencode arch=compute_90,code=sm_90 \
     -gencode arch=compute_90,code=compute_90 \
     -O3 -o unified_memory_demo unified_memory_demo.cu

Selecting the appropriate approach

Explicit memory management

Use when:

  1. Training large models with predictable data access patterns

  2. Inference serving with strict latency SLAs (p95/p99 targets)

  3. Memory-bound HPC kernels where every GB/s matters

  4. Production workloads requiring deterministic performance

Managed memory

Use when:

  1. Prototyping and exploratory development

  2. Complex data structures with unpredictable access (graphs, sparse matrices)

  3. Oversubscription scenarios where data exceeds 141 GB per GPU

  4. Code portability across different NVIDIA platforms is required

Zero-copy

Use when:

  1. Data is accessed once and discarded

  2. CPU preprocessing feeds directly into GPU computation

  3. Results stream back to CPU for I/O immediately

  4. Working set is larger than GPU memory but access is sequential

AI workload considerations

Large language model inference

The H200’s 141 GB HBM3e is specifically designed to hold:

  • Model weights for 70B+ parameter models without sharding

  • Large KV caches for long-context inference (128K+ tokens)

For LLM inference on DGX H200:

  • Keep model weights resident in HBM (explicit allocation)

  • KV cache should remain in HBM if possible

  • Only use managed memory for KV cache offload when context exceeds HBM capacity

Performance impact of memory location (Llama-70B inference):

Configuration

Per-token latency

H200 single GPU (model fits in HBM)

~75 ms

H100 single GPU (requires quantisation)

~142 ms

Source: WhiteFiber benchmark analysis. The 1.9x improvement stems from eliminating memory bottlenecks when the full model resides in HBM3e without quantisation or multi-GPU sharding.

Training workloads

  • Gradient accumulation, optimiser states, activations should all reside in HBM

  • Use gradient checkpointing to reduce memory footprint before resorting to CPU offload

  • If using CPU offload (e.g., DeepSpeed ZeRO-Offload), explicit async transfers outperform managed memory

Discoverer+ configuration

Given the 1.128 TB of total HBM3e across 8 GPUs with NVLink interconnect:

  1. Keep working data in GPU memory

  2. Use NVLink for GPU-GPU communication (not CPU-GPU)

  3. Treat CPU DDR as staging/preprocessing area, not working memory

  4. Use explicit transfers with pinned memory when CPU interaction is required

Sapphire Rapids CXL capabilities

Intel Sapphire Rapids supports CXL 1.1, which provides:

  • Memory expansion via CXL.mem protocol

  • Cache coherency for CXL-attached accelerators

CXL 1.1 limitations relevant to H200:

  • NVIDIA H200 does not use CXL (uses PCIe/NVLink)

  • CXL peer-to-peer between GPU and CXL memory requires CPU mediation

  • CXL memory bandwidth is lower than DDR5 (single PCIe lane equivalent)

CXL in your system is relevant for:

  • Memory capacity expansion beyond 2 TB DDR5

  • Future CXL-enabled accelerators

  • Memory tiering (hot data in DDR5, cold data in CXL expanders)

It does not provide unified memory with H200 GPUs.

DGX H200 memory hierarchy

+-------------------------------------------------------------+
|                        DGX H200 system                      |
+-------------------------------------------------------------+
|                                                             |
|  +-------------------------------------------------------+  |
|  |              GPU domain (8x H200)                     |  |
|  |  +---------+ +---------+ +---------+ +---------+      |  |
|  |  | 141 GB  | | 141 GB  | | 141 GB  | | 141 GB  | ...  |  |
|  |  | HBM3e   | | HBM3e   | | HBM3e   | | HBM3e   |      |  |
|  |  | 4.8TB/s | | 4.8TB/s | | 4.8TB/s | | 4.8TB/s |      |  |
|  |  +----+----+ +----+----+ +----+----+ +----+----+      |  |
|  |       +----------+----+----+----------+               |  |
|  |                    NVLink 4                           |  |
|  |                   900 GB/s                            |  |
|  +---------------------------+---------------------------+  |
|                              |                              |
|                        PCIe Gen5                            |
|                        128 GB/s                             |
|                              |                              |
|  +---------------------------+---------------------------+  |
|  |              CPU domain (Sapphire Rapids)             |  |
|  |  +------------------------------------------------+   |  |
|  |  |              2 TB DDR5 memory                  |   |  |
|  |  |              ~307 GB/s bandwidth               |   |  |
|  |  +------------------------------------------------+   |  |
|  |  +------------------------------------------------+   |  |
|  |  |         CXL 1.1 (optional expansion)           |   |  |
|  |  +------------------------------------------------+   |  |
|  +-------------------------------------------------------+  |
|                                                             |
+-------------------------------------------------------------+

Performance tiers:
  Tier 1: HBM3e      -> 4,800 GB/s  (keep hot data here)
  Tier 2: NVLink     ->   900 GB/s  (GPU-GPU sharing)
  Tier 3: DDR5       ->   307 GB/s  (CPU preprocessing)
  Tier 4: PCIe       ->   128 GB/s  (CPU-GPU transfers)
  Tier 5: CXL        ->    32 GB/s  (cold storage expansion)

Usage guidelines

  1. Default to explicit memory management for production AI workloads

  2. Use managed memory with prefetch hints when:

    • Porting existing CPU code quickly

    • Working set slightly exceeds GPU memory

    • Access patterns are irregular

  3. Avoid relying on demand paging over PCIe; the 128 GB/s limit creates severe bottlenecks

  4. For models exceeding single-GPU memory, use tensor parallelism over NVLink rather than CPU offload

  5. Profile with nsys and ncu to identify memory movement bottlenecks before optimising

  6. Consider GH200-based systems if your workloads fundamentally require CPU-GPU memory coherency

References

  • NVIDIA CUDA Programming Guide: Unified Memory

  • NVIDIA H200 Product Brief

  • NVIDIA DGX H200 Specifications

  • NVIDIA Grace Hopper Superchip Architecture Whitepaper

  • Intel Xeon Scalable Processor Max Series Documentation

  • CXL 4.0 Infrastructure Planning Guide