Unified memory on DGX H200 with Sapphire Rapids: optimisation, optimal use, and programming models
Table of contents
Overview
This document is intended for users of the Discoverer+ HPC cluster who are developing or optimising GPU-accelerated workloads. It provides guidance on memory management strategies when considering whether to use CUDA unified memory or explicit memory transfers during GPU-driven computations.
The DGX H200 nodes on Discoverer+ use Intel Sapphire Rapids CPUs paired with NVIDIA H200 GPUs. This configuration does not provide true hardware-coherent unified memory between CPU and GPU. Unlike the NVIDIA GH200 Grace Hopper Superchip (which uses NVLink-C2C for a coherent shared memory space), the H200 in the Discoverer+ DGX system connects to the CPU via PCIe Gen5, requiring explicit or software-managed data movement between CPU DDR5 and GPU HBM3e memory pools.
Understanding these architectural constraints is essential for achieving optimal performance on Discoverer+.
Hardware architecture: DGX H200 vs GH200
DGX H200 on Discoverer+
Component |
Specification |
|---|---|
CPU |
Dual Intel Xeon Platinum (Sapphire Rapids) |
CPU memory |
2 TB DDR5 |
GPU |
8x NVIDIA H200 SXM5 |
GPU memory |
141 GB HBM3e per GPU (1.128 TB total) |
GPU memory bandwidth |
4.8 TB/s per GPU |
CPU-GPU interconnect |
PCIe Gen5 (128 GB/s bidirectional) |
GPU-GPU interconnect |
NVLink 4 (900 GB/s per GPU) |
The CPU and GPU memory are physically separate. Data must traverse PCIe to move between them.
GH200 Grace Hopper (for comparison)
Component |
Specification |
|---|---|
CPU |
NVIDIA Grace (72 Arm cores) |
CPU memory |
480 GB LPDDR5X |
GPU |
Hopper H100/H200 |
GPU memory |
96-144 GB HBM3/HBM3e |
CPU-GPU interconnect |
NVLink-C2C (900 GB/s bidirectional) |
Memory model |
Hardware-coherent unified address space |
The GH200 provides a true unified memory pool of up to 624 GB accessible by both CPU and GPU without explicit data copies.
Memory interconnect bandwidth comparison
Interconnect |
Bandwidth |
Relative speed |
|---|---|---|
H200 HBM3e (on-chip) |
4,800 GB/s |
1x (baseline) |
NVLink 4 (GPU-GPU) |
900 GB/s |
0.19x |
NVLink-C2C (GH200 CPU-GPU) |
900 GB/s |
0.19x |
PCIe Gen5 x16 (DGX H200 CPU-GPU) |
128 GB/s |
0.027x |
DDR5-4800 (8 channels) |
~307 GB/s |
0.064x |
The PCIe Gen5 link between Sapphire Rapids and H200 is 7x slower than NVLink-C2C. This is the fundamental architectural difference that affects unified memory viability.
Programming models for H200 memory management
Option 1: explicit memory management
This approach is recommended for performance-critical workloads. Use
cudaMalloc for GPU memory and cudaMallocHost for pinned CPU
memory with explicit cudaMemcpyAsync transfers.
// Allocate pinned host memory (required for async transfers)
float *h_data;
cudaMallocHost(&h_data, size);
// Allocate device memory
float *d_data;
cudaMalloc(&d_data, size);
// Async transfer with stream overlap
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream);
kernel<<<blocks, threads, 0, stream>>>(d_data);
cudaMemcpyAsync(h_data, d_data, size, cudaMemcpyDeviceToHost, stream);
Advantages:
Deterministic performance
Full control over data movement
No page-fault overhead
Maximum PCIe bandwidth utilisation
Disadvantages:
More complex code
Manual memory management
Must track data residency
Option 2: CUDA managed memory
Use cudaMallocManaged to allocate memory accessible from both CPU
and GPU code.
float *data;
cudaMallocManaged(&data, size);
// CPU can access
for (int i = 0; i < n; i++) data[i] = i;
// GPU can access (data migrates on demand)
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();
// CPU reads back (data migrates back)
printf("%f\n", data[0]);
Performance tuning with hints:
// Pre-fetch to GPU before kernel launch
cudaMemPrefetchAsync(data, size, deviceId, stream);
// Mark as read-mostly (creates read-only copies on both)
cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, deviceId);
// Set preferred location
cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, deviceId);
// Allow direct access from GPU to CPU memory (zero-copy)
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, deviceId);
Advantages:
Simpler programming model
No explicit copy management
Good for rapid prototyping
Disadvantages:
Page faults cause latency spikes
On-demand migration over PCIe is slow
Performance unpredictable without hints
Option 3: zero-copy
Map pinned host memory directly into GPU address space.
float *h_data;
cudaHostAlloc(&h_data, size, cudaHostAllocMapped);
float *d_ptr;
cudaHostGetDevicePointer(&d_ptr, h_data, 0);
// GPU accesses CPU memory directly over PCIe
kernel<<<blocks, threads>>>(d_ptr);
Suitable use cases:
Data accessed only once (no benefit from caching in HBM)
Working set exceeds GPU memory
Streaming workloads
Limitation: bandwidth is constrained to PCIe (128 GB/s), which is 37x slower than HBM3e.
Complete C++ example with compilation options
The following example demonstrates unified memory with prefetching and
memory hints. Save this as unified_memory_demo.cu.
// Created by Veselin Kolev <v.kolev@discoverer.bg>
// 18 November 2025
// Licence: GPLv2
#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
#include <chrono>
#define CHECK_CUDA(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", __FILE__, __LINE__, \
cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
__global__ void vector_add(const float* a, const float* b, float* c, size_t n) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
void run_with_explicit_memory(size_t n, int device) {
size_t bytes = n * sizeof(float);
// Host pinned memory
float *h_a, *h_b, *h_c;
CHECK_CUDA(cudaMallocHost(&h_a, bytes));
CHECK_CUDA(cudaMallocHost(&h_b, bytes));
CHECK_CUDA(cudaMallocHost(&h_c, bytes));
// Device memory
float *d_a, *d_b, *d_c;
CHECK_CUDA(cudaMalloc(&d_a, bytes));
CHECK_CUDA(cudaMalloc(&d_b, bytes));
CHECK_CUDA(cudaMalloc(&d_c, bytes));
// Initialise on host
for (size_t i = 0; i < n; i++) {
h_a[i] = static_cast<float>(i);
h_b[i] = static_cast<float>(i * 2);
}
cudaStream_t stream;
CHECK_CUDA(cudaStreamCreate(&stream));
auto start = std::chrono::high_resolution_clock::now();
// Explicit transfers
CHECK_CUDA(cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream));
CHECK_CUDA(cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream));
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads, 0, stream>>>(d_a, d_b, d_c, n);
CHECK_CUDA(cudaMemcpyAsync(h_c, d_c, bytes, cudaMemcpyDeviceToHost, stream));
CHECK_CUDA(cudaStreamSynchronize(stream));
auto end = std::chrono::high_resolution_clock::now();
double ms = std::chrono::duration<double, std::milli>(end - start).count();
printf("Explicit memory: %.3f ms\n", ms);
// Cleanup
CHECK_CUDA(cudaStreamDestroy(stream));
CHECK_CUDA(cudaFree(d_a));
CHECK_CUDA(cudaFree(d_b));
CHECK_CUDA(cudaFree(d_c));
CHECK_CUDA(cudaFreeHost(h_a));
CHECK_CUDA(cudaFreeHost(h_b));
CHECK_CUDA(cudaFreeHost(h_c));
}
void run_with_unified_memory_naive(size_t n, int device) {
size_t bytes = n * sizeof(float);
float *a, *b, *c;
CHECK_CUDA(cudaMallocManaged(&a, bytes));
CHECK_CUDA(cudaMallocManaged(&b, bytes));
CHECK_CUDA(cudaMallocManaged(&c, bytes));
// Initialise on host (data resides on CPU initially)
for (size_t i = 0; i < n; i++) {
a[i] = static_cast<float>(i);
b[i] = static_cast<float>(i * 2);
}
auto start = std::chrono::high_resolution_clock::now();
// No prefetch - relies on demand paging
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads>>>(a, b, c, n);
CHECK_CUDA(cudaDeviceSynchronize());
auto end = std::chrono::high_resolution_clock::now();
double ms = std::chrono::duration<double, std::milli>(end - start).count();
printf("Unified (naive): %.3f ms\n", ms);
CHECK_CUDA(cudaFree(a));
CHECK_CUDA(cudaFree(b));
CHECK_CUDA(cudaFree(c));
}
void run_with_unified_memory_optimised(size_t n, int device) {
size_t bytes = n * sizeof(float);
float *a, *b, *c;
CHECK_CUDA(cudaMallocManaged(&a, bytes));
CHECK_CUDA(cudaMallocManaged(&b, bytes));
CHECK_CUDA(cudaMallocManaged(&c, bytes));
// Set preferred location to GPU
CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, device));
CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, device));
CHECK_CUDA(cudaMemAdvise(c, bytes, cudaMemAdviseSetPreferredLocation, device));
// Mark inputs as read-mostly (allows read replicas)
CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, device));
CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, device));
// Initialise on host
for (size_t i = 0; i < n; i++) {
a[i] = static_cast<float>(i);
b[i] = static_cast<float>(i * 2);
}
cudaStream_t stream;
CHECK_CUDA(cudaStreamCreate(&stream));
auto start = std::chrono::high_resolution_clock::now();
// Prefetch to GPU before kernel
CHECK_CUDA(cudaMemPrefetchAsync(a, bytes, device, stream));
CHECK_CUDA(cudaMemPrefetchAsync(b, bytes, device, stream));
CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, device, stream));
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads, 0, stream>>>(a, b, c, n);
// Prefetch result back to CPU if needed
CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId, stream));
CHECK_CUDA(cudaStreamSynchronize(stream));
auto end = std::chrono::high_resolution_clock::now();
double ms = std::chrono::duration<double, std::milli>(end - start).count();
printf("Unified (optimised): %.3f ms\n", ms);
CHECK_CUDA(cudaStreamDestroy(stream));
CHECK_CUDA(cudaFree(a));
CHECK_CUDA(cudaFree(b));
CHECK_CUDA(cudaFree(c));
}
int main(int argc, char** argv) {
int device = 0;
if (argc > 1) {
device = atoi(argv[1]);
}
CHECK_CUDA(cudaSetDevice(device));
cudaDeviceProp prop;
CHECK_CUDA(cudaGetDeviceProperties(&prop, device));
printf("Device: %s\n", prop.name);
printf("Compute capability: %d.%d\n", prop.major, prop.minor);
printf("Concurrent managed access: %s\n",
prop.concurrentManagedAccess ? "yes" : "no");
printf("Pageable memory access: %s\n",
prop.pageableMemoryAccess ? "yes" : "no");
printf("\n");
// 256 MB of data (64M floats)
size_t n = 64 * 1024 * 1024;
printf("Vector size: %zu elements (%.1f MB)\n\n", n, n * sizeof(float) / 1e6);
// Warm-up
run_with_explicit_memory(n, device);
// Benchmark runs
printf("\nBenchmark results:\n");
run_with_explicit_memory(n, device);
run_with_unified_memory_naive(n, device);
run_with_unified_memory_optimised(n, device);
return 0;
}
Compilation options
Warning
Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.
For H200 (Hopper architecture, compute capability 9.0):
# Basic compilation
nvcc -arch=sm_90 -O3 -o unified_memory_demo unified_memory_demo.cu
# With additional optimisations
nvcc -arch=sm_90 -O3 --use_fast_math -Xcompiler "-O3 -march=native" \
-o unified_memory_demo unified_memory_demo.cu
# Debug build with line info (for profiling with nsys/ncu)
nvcc -arch=sm_90 -O2 -g -lineinfo -o unified_memory_demo_debug unified_memory_demo.cu
Running the example
Warning
Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.
# Run on default GPU (device 0)
./unified_memory_demo
# Run on specific GPU
./unified_memory_demo 2
# Profile with Nsight Systems
nsys profile --stats=true ./unified_memory_demo
# Profile with Nsight Compute (kernel analysis)
ncu --set full -o unified_memory_report ./unified_memory_demo
Expected output
On an H200 with the 256 MB test vector, typical results:
Device: NVIDIA H200
Compute capability: 9.0
Concurrent managed access: yes
Pageable memory access: yes
Vector size: 67108864 elements (256.0 MB)
Benchmark results:
Explicit memory: 12.5 ms
Unified (naive): 45.2 ms
Unified (optimised): 14.1 ms
The naive unified memory approach incurs significant page-fault overhead. With prefetching and hints, performance approaches explicit memory management whilst retaining simpler code.
Key compilation flags explained
Flag |
Purpose |
|---|---|
|
Target H200/Hopper architecture (compute capability 9.0) |
|
Maximum optimisation level |
|
Enable fast maths (slightly reduced precision) |
|
Pass flags to host compiler |
|
Include debug info for profiler source correlation |
For multi-GPU DGX H200 systems, you may also use:
# Generate code for multiple architectures (if sharing binary)
nvcc -gencode arch=compute_90,code=sm_90 \
-gencode arch=compute_90,code=compute_90 \
-O3 -o unified_memory_demo unified_memory_demo.cu
Selecting the appropriate approach
Explicit memory management
Use when:
Training large models with predictable data access patterns
Inference serving with strict latency SLAs (p95/p99 targets)
Memory-bound HPC kernels where every GB/s matters
Production workloads requiring deterministic performance
Managed memory
Use when:
Prototyping and exploratory development
Complex data structures with unpredictable access (graphs, sparse matrices)
Oversubscription scenarios where data exceeds 141 GB per GPU
Code portability across different NVIDIA platforms is required
Zero-copy
Use when:
Data is accessed once and discarded
CPU preprocessing feeds directly into GPU computation
Results stream back to CPU for I/O immediately
Working set is larger than GPU memory but access is sequential
AI workload considerations
Large language model inference
The H200’s 141 GB HBM3e is specifically designed to hold:
Model weights for 70B+ parameter models without sharding
Large KV caches for long-context inference (128K+ tokens)
For LLM inference on DGX H200:
Keep model weights resident in HBM (explicit allocation)
KV cache should remain in HBM if possible
Only use managed memory for KV cache offload when context exceeds HBM capacity
Performance impact of memory location (Llama-70B inference):
Configuration |
Per-token latency |
|---|---|
H200 single GPU (model fits in HBM) |
~75 ms |
H100 single GPU (requires quantisation) |
~142 ms |
Source: WhiteFiber benchmark analysis. The 1.9x improvement stems from eliminating memory bottlenecks when the full model resides in HBM3e without quantisation or multi-GPU sharding.
Training workloads
Gradient accumulation, optimiser states, activations should all reside in HBM
Use gradient checkpointing to reduce memory footprint before resorting to CPU offload
If using CPU offload (e.g., DeepSpeed ZeRO-Offload), explicit async transfers outperform managed memory
Discoverer+ configuration
Given the 1.128 TB of total HBM3e across 8 GPUs with NVLink interconnect:
Keep working data in GPU memory
Use NVLink for GPU-GPU communication (not CPU-GPU)
Treat CPU DDR as staging/preprocessing area, not working memory
Use explicit transfers with pinned memory when CPU interaction is required
Sapphire Rapids CXL capabilities
Intel Sapphire Rapids supports CXL 1.1, which provides:
Memory expansion via CXL.mem protocol
Cache coherency for CXL-attached accelerators
CXL 1.1 limitations relevant to H200:
NVIDIA H200 does not use CXL (uses PCIe/NVLink)
CXL peer-to-peer between GPU and CXL memory requires CPU mediation
CXL memory bandwidth is lower than DDR5 (single PCIe lane equivalent)
CXL in your system is relevant for:
Memory capacity expansion beyond 2 TB DDR5
Future CXL-enabled accelerators
Memory tiering (hot data in DDR5, cold data in CXL expanders)
It does not provide unified memory with H200 GPUs.
DGX H200 memory hierarchy
+-------------------------------------------------------------+
| DGX H200 system |
+-------------------------------------------------------------+
| |
| +-------------------------------------------------------+ |
| | GPU domain (8x H200) | |
| | +---------+ +---------+ +---------+ +---------+ | |
| | | 141 GB | | 141 GB | | 141 GB | | 141 GB | ... | |
| | | HBM3e | | HBM3e | | HBM3e | | HBM3e | | |
| | | 4.8TB/s | | 4.8TB/s | | 4.8TB/s | | 4.8TB/s | | |
| | +----+----+ +----+----+ +----+----+ +----+----+ | |
| | +----------+----+----+----------+ | |
| | NVLink 4 | |
| | 900 GB/s | |
| +---------------------------+---------------------------+ |
| | |
| PCIe Gen5 |
| 128 GB/s |
| | |
| +---------------------------+---------------------------+ |
| | CPU domain (Sapphire Rapids) | |
| | +------------------------------------------------+ | |
| | | 2 TB DDR5 memory | | |
| | | ~307 GB/s bandwidth | | |
| | +------------------------------------------------+ | |
| | +------------------------------------------------+ | |
| | | CXL 1.1 (optional expansion) | | |
| | +------------------------------------------------+ | |
| +-------------------------------------------------------+ |
| |
+-------------------------------------------------------------+
Performance tiers:
Tier 1: HBM3e -> 4,800 GB/s (keep hot data here)
Tier 2: NVLink -> 900 GB/s (GPU-GPU sharing)
Tier 3: DDR5 -> 307 GB/s (CPU preprocessing)
Tier 4: PCIe -> 128 GB/s (CPU-GPU transfers)
Tier 5: CXL -> 32 GB/s (cold storage expansion)
Usage guidelines
Default to explicit memory management for production AI workloads
Use managed memory with prefetch hints when:
Porting existing CPU code quickly
Working set slightly exceeds GPU memory
Access patterns are irregular
Avoid relying on demand paging over PCIe; the 128 GB/s limit creates severe bottlenecks
For models exceeding single-GPU memory, use tensor parallelism over NVLink rather than CPU offload
Profile with
nsysandncuto identify memory movement bottlenecks before optimisingConsider GH200-based systems if your workloads fundamentally require CPU-GPU memory coherency
References
NVIDIA CUDA Programming Guide: Unified Memory
NVIDIA H200 Product Brief
NVIDIA DGX H200 Specifications
NVIDIA Grace Hopper Superchip Architecture Whitepaper
Intel Xeon Scalable Processor Max Series Documentation
CXL 4.0 Infrastructure Planning Guide