Unified memory on DGX H200 with Sapphire Rapids: optimisation, optimal use, and programming models ================================================================================================== Table of contents ----------------- 1. `Overview <#overview>`__ 2. `Hardware architecture: DGX H200 vs GH200 <#hardware-architecture-dgx-h200-vs-gh200>`__ - `DGX H200 on Discoverer+ <#dgx-h200-on-discoverer>`__ - `GH200 Grace Hopper (for comparison) <#gh200-grace-hopper-for-comparison>`__ 3. `Memory interconnect bandwidth comparison <#memory-interconnect-bandwidth-comparison>`__ 4. `Programming models for H200 memory management <#programming-models-for-h200-memory-management>`__ - `Option 1: explicit memory management <#option-1-explicit-memory-management>`__ - `Option 2: CUDA managed memory <#option-2-cuda-managed-memory>`__ - `Option 3: zero-copy <#option-3-zero-copy>`__ 5. `Complete C++ example with compilation options <#complete-c-example-with-compilation-options>`__ - `Compilation options <#compilation-options>`__ - `Running the example <#running-the-example>`__ - `Expected output <#expected-output>`__ - `Key compilation flags explained <#key-compilation-flags-explained>`__ 6. `Selecting the appropriate approach <#selecting-the-appropriate-approach>`__ - `Explicit memory management <#explicit-memory-management>`__ - `Managed memory <#managed-memory>`__ - `Zero-copy <#zero-copy>`__ 7. `AI workload considerations <#ai-workload-considerations>`__ - `Large language model inference <#large-language-model-inference>`__ - `Training workloads <#training-workloads>`__ - `Discoverer+ configuration <#discoverer-configuration>`__ 8. `Sapphire Rapids CXL capabilities <#sapphire-rapids-cxl-capabilities>`__ 9. `DGX H200 memory hierarchy <#dgx-h200-memory-hierarchy>`__ 10. `Usage guidelines <#usage-guidelines>`__ 11. `References <#references>`__ Overview -------- This document is intended for users of the Discoverer+ HPC cluster who are developing or optimising GPU-accelerated workloads. It provides guidance on memory management strategies when considering whether to use CUDA unified memory or explicit memory transfers during GPU-driven computations. The DGX H200 nodes on Discoverer+ use Intel Sapphire Rapids CPUs paired with NVIDIA H200 GPUs. This configuration does not provide true hardware-coherent unified memory between CPU and GPU. Unlike the NVIDIA GH200 Grace Hopper Superchip (which uses NVLink-C2C for a coherent shared memory space), the H200 in the Discoverer+ DGX system connects to the CPU via PCIe Gen5, requiring explicit or software-managed data movement between CPU DDR5 and GPU HBM3e memory pools. Understanding these architectural constraints is essential for achieving optimal performance on Discoverer+. Hardware architecture: DGX H200 vs GH200 ---------------------------------------- DGX H200 on Discoverer+ ~~~~~~~~~~~~~~~~~~~~~~~ ==================== ========================================== Component Specification ==================== ========================================== CPU Dual Intel Xeon Platinum (Sapphire Rapids) CPU memory 2 TB DDR5 GPU 8x NVIDIA H200 SXM5 GPU memory 141 GB HBM3e per GPU (1.128 TB total) GPU memory bandwidth 4.8 TB/s per GPU CPU-GPU interconnect PCIe Gen5 (128 GB/s bidirectional) GPU-GPU interconnect NVLink 4 (900 GB/s per GPU) ==================== ========================================== The CPU and GPU memory are physically separate. Data must traverse PCIe to move between them. GH200 Grace Hopper (for comparison) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ==================== ======================================= Component Specification ==================== ======================================= CPU NVIDIA Grace (72 Arm cores) CPU memory 480 GB LPDDR5X GPU Hopper H100/H200 GPU memory 96-144 GB HBM3/HBM3e CPU-GPU interconnect NVLink-C2C (900 GB/s bidirectional) Memory model Hardware-coherent unified address space ==================== ======================================= The GH200 provides a true unified memory pool of up to 624 GB accessible by both CPU and GPU without explicit data copies. Memory interconnect bandwidth comparison ---------------------------------------- ================================ ========== ============== Interconnect Bandwidth Relative speed ================================ ========== ============== H200 HBM3e (on-chip) 4,800 GB/s 1x (baseline) NVLink 4 (GPU-GPU) 900 GB/s 0.19x NVLink-C2C (GH200 CPU-GPU) 900 GB/s 0.19x PCIe Gen5 x16 (DGX H200 CPU-GPU) 128 GB/s 0.027x DDR5-4800 (8 channels) ~307 GB/s 0.064x ================================ ========== ============== The PCIe Gen5 link between Sapphire Rapids and H200 is 7x slower than NVLink-C2C. This is the fundamental architectural difference that affects unified memory viability. Programming models for H200 memory management --------------------------------------------- Option 1: explicit memory management ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This approach is recommended for performance-critical workloads. Use ``cudaMalloc`` for GPU memory and ``cudaMallocHost`` for pinned CPU memory with explicit ``cudaMemcpyAsync`` transfers. .. code:: cuda // Allocate pinned host memory (required for async transfers) float *h_data; cudaMallocHost(&h_data, size); // Allocate device memory float *d_data; cudaMalloc(&d_data, size); // Async transfer with stream overlap cudaStream_t stream; cudaStreamCreate(&stream); cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream); kernel<<>>(d_data); cudaMemcpyAsync(h_data, d_data, size, cudaMemcpyDeviceToHost, stream); Advantages: - Deterministic performance - Full control over data movement - No page-fault overhead - Maximum PCIe bandwidth utilisation Disadvantages: - More complex code - Manual memory management - Must track data residency Option 2: CUDA managed memory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use ``cudaMallocManaged`` to allocate memory accessible from both CPU and GPU code. .. code:: cuda float *data; cudaMallocManaged(&data, size); // CPU can access for (int i = 0; i < n; i++) data[i] = i; // GPU can access (data migrates on demand) kernel<<>>(data); cudaDeviceSynchronize(); // CPU reads back (data migrates back) printf("%f\n", data[0]); Performance tuning with hints: .. code:: cuda // Pre-fetch to GPU before kernel launch cudaMemPrefetchAsync(data, size, deviceId, stream); // Mark as read-mostly (creates read-only copies on both) cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, deviceId); // Set preferred location cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, deviceId); // Allow direct access from GPU to CPU memory (zero-copy) cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, deviceId); Advantages: - Simpler programming model - No explicit copy management - Good for rapid prototyping Disadvantages: - Page faults cause latency spikes - On-demand migration over PCIe is slow - Performance unpredictable without hints Option 3: zero-copy ~~~~~~~~~~~~~~~~~~~ Map pinned host memory directly into GPU address space. .. code:: cuda float *h_data; cudaHostAlloc(&h_data, size, cudaHostAllocMapped); float *d_ptr; cudaHostGetDevicePointer(&d_ptr, h_data, 0); // GPU accesses CPU memory directly over PCIe kernel<<>>(d_ptr); Suitable use cases: - Data accessed only once (no benefit from caching in HBM) - Working set exceeds GPU memory - Streaming workloads Limitation: bandwidth is constrained to PCIe (128 GB/s), which is 37x slower than HBM3e. Complete C++ example with compilation options --------------------------------------------- The following example demonstrates unified memory with prefetching and memory hints. Save this as ``unified_memory_demo.cu``. .. code:: cpp // Created by Veselin Kolev // 18 November 2025 // Licence: GPLv2 #include #include #include #include #define CHECK_CUDA(call) \ do { \ cudaError_t err = call; \ if (err != cudaSuccess) { \ fprintf(stderr, "CUDA error at %s:%d: %s\n", __FILE__, __LINE__, \ cudaGetErrorString(err)); \ exit(EXIT_FAILURE); \ } \ } while (0) __global__ void vector_add(const float* a, const float* b, float* c, size_t n) { size_t idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { c[idx] = a[idx] + b[idx]; } } void run_with_explicit_memory(size_t n, int device) { size_t bytes = n * sizeof(float); // Host pinned memory float *h_a, *h_b, *h_c; CHECK_CUDA(cudaMallocHost(&h_a, bytes)); CHECK_CUDA(cudaMallocHost(&h_b, bytes)); CHECK_CUDA(cudaMallocHost(&h_c, bytes)); // Device memory float *d_a, *d_b, *d_c; CHECK_CUDA(cudaMalloc(&d_a, bytes)); CHECK_CUDA(cudaMalloc(&d_b, bytes)); CHECK_CUDA(cudaMalloc(&d_c, bytes)); // Initialise on host for (size_t i = 0; i < n; i++) { h_a[i] = static_cast(i); h_b[i] = static_cast(i * 2); } cudaStream_t stream; CHECK_CUDA(cudaStreamCreate(&stream)); auto start = std::chrono::high_resolution_clock::now(); // Explicit transfers CHECK_CUDA(cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream)); CHECK_CUDA(cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream)); int threads = 256; int blocks = (n + threads - 1) / threads; vector_add<<>>(d_a, d_b, d_c, n); CHECK_CUDA(cudaMemcpyAsync(h_c, d_c, bytes, cudaMemcpyDeviceToHost, stream)); CHECK_CUDA(cudaStreamSynchronize(stream)); auto end = std::chrono::high_resolution_clock::now(); double ms = std::chrono::duration(end - start).count(); printf("Explicit memory: %.3f ms\n", ms); // Cleanup CHECK_CUDA(cudaStreamDestroy(stream)); CHECK_CUDA(cudaFree(d_a)); CHECK_CUDA(cudaFree(d_b)); CHECK_CUDA(cudaFree(d_c)); CHECK_CUDA(cudaFreeHost(h_a)); CHECK_CUDA(cudaFreeHost(h_b)); CHECK_CUDA(cudaFreeHost(h_c)); } void run_with_unified_memory_naive(size_t n, int device) { size_t bytes = n * sizeof(float); float *a, *b, *c; CHECK_CUDA(cudaMallocManaged(&a, bytes)); CHECK_CUDA(cudaMallocManaged(&b, bytes)); CHECK_CUDA(cudaMallocManaged(&c, bytes)); // Initialise on host (data resides on CPU initially) for (size_t i = 0; i < n; i++) { a[i] = static_cast(i); b[i] = static_cast(i * 2); } auto start = std::chrono::high_resolution_clock::now(); // No prefetch - relies on demand paging int threads = 256; int blocks = (n + threads - 1) / threads; vector_add<<>>(a, b, c, n); CHECK_CUDA(cudaDeviceSynchronize()); auto end = std::chrono::high_resolution_clock::now(); double ms = std::chrono::duration(end - start).count(); printf("Unified (naive): %.3f ms\n", ms); CHECK_CUDA(cudaFree(a)); CHECK_CUDA(cudaFree(b)); CHECK_CUDA(cudaFree(c)); } void run_with_unified_memory_optimised(size_t n, int device) { size_t bytes = n * sizeof(float); float *a, *b, *c; CHECK_CUDA(cudaMallocManaged(&a, bytes)); CHECK_CUDA(cudaMallocManaged(&b, bytes)); CHECK_CUDA(cudaMallocManaged(&c, bytes)); // Set preferred location to GPU CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, device)); CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, device)); CHECK_CUDA(cudaMemAdvise(c, bytes, cudaMemAdviseSetPreferredLocation, device)); // Mark inputs as read-mostly (allows read replicas) CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, device)); CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, device)); // Initialise on host for (size_t i = 0; i < n; i++) { a[i] = static_cast(i); b[i] = static_cast(i * 2); } cudaStream_t stream; CHECK_CUDA(cudaStreamCreate(&stream)); auto start = std::chrono::high_resolution_clock::now(); // Prefetch to GPU before kernel CHECK_CUDA(cudaMemPrefetchAsync(a, bytes, device, stream)); CHECK_CUDA(cudaMemPrefetchAsync(b, bytes, device, stream)); CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, device, stream)); int threads = 256; int blocks = (n + threads - 1) / threads; vector_add<<>>(a, b, c, n); // Prefetch result back to CPU if needed CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId, stream)); CHECK_CUDA(cudaStreamSynchronize(stream)); auto end = std::chrono::high_resolution_clock::now(); double ms = std::chrono::duration(end - start).count(); printf("Unified (optimised): %.3f ms\n", ms); CHECK_CUDA(cudaStreamDestroy(stream)); CHECK_CUDA(cudaFree(a)); CHECK_CUDA(cudaFree(b)); CHECK_CUDA(cudaFree(c)); } int main(int argc, char** argv) { int device = 0; if (argc > 1) { device = atoi(argv[1]); } CHECK_CUDA(cudaSetDevice(device)); cudaDeviceProp prop; CHECK_CUDA(cudaGetDeviceProperties(&prop, device)); printf("Device: %s\n", prop.name); printf("Compute capability: %d.%d\n", prop.major, prop.minor); printf("Concurrent managed access: %s\n", prop.concurrentManagedAccess ? "yes" : "no"); printf("Pageable memory access: %s\n", prop.pageableMemoryAccess ? "yes" : "no"); printf("\n"); // 256 MB of data (64M floats) size_t n = 64 * 1024 * 1024; printf("Vector size: %zu elements (%.1f MB)\n\n", n, n * sizeof(float) / 1e6); // Warm-up run_with_explicit_memory(n, device); // Benchmark runs printf("\nBenchmark results:\n"); run_with_explicit_memory(n, device); run_with_unified_memory_naive(n, device); run_with_unified_memory_optimised(n, device); return 0; } Compilation options ~~~~~~~~~~~~~~~~~~~ .. warning:: Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code. For H200 (Hopper architecture, compute capability 9.0): .. code:: bash # Basic compilation nvcc -arch=sm_90 -O3 -o unified_memory_demo unified_memory_demo.cu # With additional optimisations nvcc -arch=sm_90 -O3 --use_fast_math -Xcompiler "-O3 -march=native" \ -o unified_memory_demo unified_memory_demo.cu # Debug build with line info (for profiling with nsys/ncu) nvcc -arch=sm_90 -O2 -g -lineinfo -o unified_memory_demo_debug unified_memory_demo.cu Running the example ~~~~~~~~~~~~~~~~~~~ .. warning:: Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code. .. code:: bash # Run on default GPU (device 0) ./unified_memory_demo # Run on specific GPU ./unified_memory_demo 2 # Profile with Nsight Systems nsys profile --stats=true ./unified_memory_demo # Profile with Nsight Compute (kernel analysis) ncu --set full -o unified_memory_report ./unified_memory_demo Expected output ~~~~~~~~~~~~~~~ On an H200 with the 256 MB test vector, typical results: :: Device: NVIDIA H200 Compute capability: 9.0 Concurrent managed access: yes Pageable memory access: yes Vector size: 67108864 elements (256.0 MB) Benchmark results: Explicit memory: 12.5 ms Unified (naive): 45.2 ms Unified (optimised): 14.1 ms The naive unified memory approach incurs significant page-fault overhead. With prefetching and hints, performance approaches explicit memory management whilst retaining simpler code. Key compilation flags explained ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------+------------------------------------------+ | Flag | Purpose | +===========================+==========================================+ | ``-arch=sm_90`` | Target H200/Hopper architecture (compute | | | capability 9.0) | +---------------------------+------------------------------------------+ | ``-O3`` | Maximum optimisation level | +---------------------------+------------------------------------------+ | ``--use_fast_math`` | Enable fast maths (slightly reduced | | | precision) | +---------------------------+------------------------------------------+ | ``-Xcompiler "-O3"`` | Pass flags to host compiler | +---------------------------+------------------------------------------+ | ``-g -lineinfo`` | Include debug info for profiler source | | | correlation | +---------------------------+------------------------------------------+ For multi-GPU DGX H200 systems, you may also use: .. code:: bash # Generate code for multiple architectures (if sharing binary) nvcc -gencode arch=compute_90,code=sm_90 \ -gencode arch=compute_90,code=compute_90 \ -O3 -o unified_memory_demo unified_memory_demo.cu Selecting the appropriate approach ---------------------------------- Explicit memory management ~~~~~~~~~~~~~~~~~~~~~~~~~~ Use when: 1. Training large models with predictable data access patterns 2. Inference serving with strict latency SLAs (p95/p99 targets) 3. Memory-bound HPC kernels where every GB/s matters 4. Production workloads requiring deterministic performance Managed memory ~~~~~~~~~~~~~~ Use when: 1. Prototyping and exploratory development 2. Complex data structures with unpredictable access (graphs, sparse matrices) 3. Oversubscription scenarios where data exceeds 141 GB per GPU 4. Code portability across different NVIDIA platforms is required Zero-copy ~~~~~~~~~ Use when: 1. Data is accessed once and discarded 2. CPU preprocessing feeds directly into GPU computation 3. Results stream back to CPU for I/O immediately 4. Working set is larger than GPU memory but access is sequential AI workload considerations -------------------------- Large language model inference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The H200’s 141 GB HBM3e is specifically designed to hold: - Model weights for 70B+ parameter models without sharding - Large KV caches for long-context inference (128K+ tokens) For LLM inference on DGX H200: - Keep model weights resident in HBM (explicit allocation) - KV cache should remain in HBM if possible - Only use managed memory for KV cache offload when context exceeds HBM capacity Performance impact of memory location (Llama-70B inference): ======================================= ================= Configuration Per-token latency ======================================= ================= H200 single GPU (model fits in HBM) ~75 ms H100 single GPU (requires quantisation) ~142 ms ======================================= ================= Source: WhiteFiber benchmark analysis. The 1.9x improvement stems from eliminating memory bottlenecks when the full model resides in HBM3e without quantisation or multi-GPU sharding. Training workloads ~~~~~~~~~~~~~~~~~~ - Gradient accumulation, optimiser states, activations should all reside in HBM - Use gradient checkpointing to reduce memory footprint before resorting to CPU offload - If using CPU offload (e.g., DeepSpeed ZeRO-Offload), explicit async transfers outperform managed memory Discoverer+ configuration ~~~~~~~~~~~~~~~~~~~~~~~~~ Given the 1.128 TB of total HBM3e across 8 GPUs with NVLink interconnect: 1. Keep working data in GPU memory 2. Use NVLink for GPU-GPU communication (not CPU-GPU) 3. Treat CPU DDR as staging/preprocessing area, not working memory 4. Use explicit transfers with pinned memory when CPU interaction is required Sapphire Rapids CXL capabilities -------------------------------- Intel Sapphire Rapids supports CXL 1.1, which provides: - Memory expansion via CXL.mem protocol - Cache coherency for CXL-attached accelerators CXL 1.1 limitations relevant to H200: - NVIDIA H200 does not use CXL (uses PCIe/NVLink) - CXL peer-to-peer between GPU and CXL memory requires CPU mediation - CXL memory bandwidth is lower than DDR5 (single PCIe lane equivalent) CXL in your system is relevant for: - Memory capacity expansion beyond 2 TB DDR5 - Future CXL-enabled accelerators - Memory tiering (hot data in DDR5, cold data in CXL expanders) It does not provide unified memory with H200 GPUs. DGX H200 memory hierarchy ------------------------- :: +-------------------------------------------------------------+ | DGX H200 system | +-------------------------------------------------------------+ | | | +-------------------------------------------------------+ | | | GPU domain (8x H200) | | | | +---------+ +---------+ +---------+ +---------+ | | | | | 141 GB | | 141 GB | | 141 GB | | 141 GB | ... | | | | | HBM3e | | HBM3e | | HBM3e | | HBM3e | | | | | | 4.8TB/s | | 4.8TB/s | | 4.8TB/s | | 4.8TB/s | | | | | +----+----+ +----+----+ +----+----+ +----+----+ | | | | +----------+----+----+----------+ | | | | NVLink 4 | | | | 900 GB/s | | | +---------------------------+---------------------------+ | | | | | PCIe Gen5 | | 128 GB/s | | | | | +---------------------------+---------------------------+ | | | CPU domain (Sapphire Rapids) | | | | +------------------------------------------------+ | | | | | 2 TB DDR5 memory | | | | | | ~307 GB/s bandwidth | | | | | +------------------------------------------------+ | | | | +------------------------------------------------+ | | | | | CXL 1.1 (optional expansion) | | | | | +------------------------------------------------+ | | | +-------------------------------------------------------+ | | | +-------------------------------------------------------------+ Performance tiers: Tier 1: HBM3e -> 4,800 GB/s (keep hot data here) Tier 2: NVLink -> 900 GB/s (GPU-GPU sharing) Tier 3: DDR5 -> 307 GB/s (CPU preprocessing) Tier 4: PCIe -> 128 GB/s (CPU-GPU transfers) Tier 5: CXL -> 32 GB/s (cold storage expansion) Usage guidelines ---------------- 1. Default to explicit memory management for production AI workloads 2. Use managed memory with prefetch hints when: - Porting existing CPU code quickly - Working set slightly exceeds GPU memory - Access patterns are irregular 3. Avoid relying on demand paging over PCIe; the 128 GB/s limit creates severe bottlenecks 4. For models exceeding single-GPU memory, use tensor parallelism over NVLink rather than CPU offload 5. Profile with ``nsys`` and ``ncu`` to identify memory movement bottlenecks before optimising 6. Consider GH200-based systems if your workloads fundamentally require CPU-GPU memory coherency References ---------- - NVIDIA CUDA Programming Guide: Unified Memory - NVIDIA H200 Product Brief - NVIDIA DGX H200 Specifications - NVIDIA Grace Hopper Superchip Architecture Whitepaper - Intel Xeon Scalable Processor Max Series Documentation - CXL 4.0 Infrastructure Planning Guide