Unified memory on DGX H200 with Sapphire Rapids: optimisation, optimal use, and programming models
==================================================================================================

Table of contents
-----------------

1.  `Overview <#overview>`__
2.  `Hardware architecture: DGX H200 vs
    GH200 <#hardware-architecture-dgx-h200-vs-gh200>`__

    -  `DGX H200 on Discoverer+ <#dgx-h200-on-discoverer>`__
    -  `GH200 Grace Hopper (for
       comparison) <#gh200-grace-hopper-for-comparison>`__

3.  `Memory interconnect bandwidth
    comparison <#memory-interconnect-bandwidth-comparison>`__
4.  `Programming models for H200 memory
    management <#programming-models-for-h200-memory-management>`__

    -  `Option 1: explicit memory
       management <#option-1-explicit-memory-management>`__
    -  `Option 2: CUDA managed memory <#option-2-cuda-managed-memory>`__
    -  `Option 3: zero-copy <#option-3-zero-copy>`__

5.  `Complete C++ example with compilation
    options <#complete-c-example-with-compilation-options>`__

    -  `Compilation options <#compilation-options>`__
    -  `Running the example <#running-the-example>`__
    -  `Expected output <#expected-output>`__
    -  `Key compilation flags
       explained <#key-compilation-flags-explained>`__

6.  `Selecting the appropriate
    approach <#selecting-the-appropriate-approach>`__

    -  `Explicit memory management <#explicit-memory-management>`__
    -  `Managed memory <#managed-memory>`__
    -  `Zero-copy <#zero-copy>`__

7.  `AI workload considerations <#ai-workload-considerations>`__

    -  `Large language model
       inference <#large-language-model-inference>`__
    -  `Training workloads <#training-workloads>`__
    -  `Discoverer+ configuration <#discoverer-configuration>`__

8.  `Sapphire Rapids CXL
    capabilities <#sapphire-rapids-cxl-capabilities>`__
9.  `DGX H200 memory hierarchy <#dgx-h200-memory-hierarchy>`__
10. `Usage guidelines <#usage-guidelines>`__
11. `References <#references>`__


Overview
--------

This document is intended for users of the Discoverer+ HPC cluster who
are developing or optimising GPU-accelerated workloads. It provides
guidance on memory management strategies when considering whether to use
CUDA unified memory or explicit memory transfers during GPU-driven
computations.

The DGX H200 nodes on Discoverer+ use Intel Sapphire Rapids CPUs paired
with NVIDIA H200 GPUs. This configuration does not provide true
hardware-coherent unified memory between CPU and GPU. Unlike the NVIDIA
GH200 Grace Hopper Superchip (which uses NVLink-C2C for a coherent
shared memory space), the H200 in the Discoverer+ DGX system connects to
the CPU via PCIe Gen5, requiring explicit or software-managed data
movement between CPU DDR5 and GPU HBM3e memory pools.

Understanding these architectural constraints is essential for achieving
optimal performance on Discoverer+.


Hardware architecture: DGX H200 vs GH200
----------------------------------------


DGX H200 on Discoverer+
~~~~~~~~~~~~~~~~~~~~~~~

==================== ==========================================
Component            Specification
==================== ==========================================
CPU                  Dual Intel Xeon Platinum (Sapphire Rapids)
CPU memory           2 TB DDR5
GPU                  8x NVIDIA H200 SXM5
GPU memory           141 GB HBM3e per GPU (1.128 TB total)
GPU memory bandwidth 4.8 TB/s per GPU
CPU-GPU interconnect PCIe Gen5 (128 GB/s bidirectional)
GPU-GPU interconnect NVLink 4 (900 GB/s per GPU)
==================== ==========================================

The CPU and GPU memory are physically separate. Data must traverse PCIe
to move between them.


GH200 Grace Hopper (for comparison)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

==================== =======================================
Component            Specification
==================== =======================================
CPU                  NVIDIA Grace (72 Arm cores)
CPU memory           480 GB LPDDR5X
GPU                  Hopper H100/H200
GPU memory           96-144 GB HBM3/HBM3e
CPU-GPU interconnect NVLink-C2C (900 GB/s bidirectional)
Memory model         Hardware-coherent unified address space
==================== =======================================

The GH200 provides a true unified memory pool of up to 624 GB accessible
by both CPU and GPU without explicit data copies.


Memory interconnect bandwidth comparison
----------------------------------------

================================ ========== ==============
Interconnect                     Bandwidth  Relative speed
================================ ========== ==============
H200 HBM3e (on-chip)             4,800 GB/s 1x (baseline)
NVLink 4 (GPU-GPU)               900 GB/s   0.19x
NVLink-C2C (GH200 CPU-GPU)       900 GB/s   0.19x
PCIe Gen5 x16 (DGX H200 CPU-GPU) 128 GB/s   0.027x
DDR5-4800 (8 channels)           ~307 GB/s  0.064x
================================ ========== ==============

The PCIe Gen5 link between Sapphire Rapids and H200 is 7x slower than
NVLink-C2C. This is the fundamental architectural difference that
affects unified memory viability.


Programming models for H200 memory management
---------------------------------------------


Option 1: explicit memory management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This approach is recommended for performance-critical workloads. Use
``cudaMalloc`` for GPU memory and ``cudaMallocHost`` for pinned CPU
memory with explicit ``cudaMemcpyAsync`` transfers.

.. code:: cuda

   // Allocate pinned host memory (required for async transfers)
   float *h_data;
   cudaMallocHost(&h_data, size);

   // Allocate device memory
   float *d_data;
   cudaMalloc(&d_data, size);

   // Async transfer with stream overlap
   cudaStream_t stream;
   cudaStreamCreate(&stream);
   cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream);
   kernel<<<blocks, threads, 0, stream>>>(d_data);
   cudaMemcpyAsync(h_data, d_data, size, cudaMemcpyDeviceToHost, stream);

Advantages:

  - Deterministic performance
  - Full control over data movement
  - No page-fault overhead
  - Maximum PCIe bandwidth utilisation

Disadvantages:

  - More complex code
  - Manual memory management
  - Must track data residency


Option 2: CUDA managed memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use ``cudaMallocManaged`` to allocate memory accessible from both CPU
and GPU code.

.. code:: cuda

   float *data;
   cudaMallocManaged(&data, size);

   // CPU can access
   for (int i = 0; i < n; i++) data[i] = i;

   // GPU can access (data migrates on demand)
   kernel<<<blocks, threads>>>(data);
   cudaDeviceSynchronize();

   // CPU reads back (data migrates back)
   printf("%f\n", data[0]);

Performance tuning with hints:

.. code:: cuda

   // Pre-fetch to GPU before kernel launch
   cudaMemPrefetchAsync(data, size, deviceId, stream);

   // Mark as read-mostly (creates read-only copies on both)
   cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, deviceId);

   // Set preferred location
   cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, deviceId);

   // Allow direct access from GPU to CPU memory (zero-copy)
   cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, deviceId);

Advantages:

  - Simpler programming model
  - No explicit copy management
  - Good for rapid prototyping

Disadvantages:

  - Page faults cause latency spikes
  - On-demand migration over PCIe is slow
  - Performance unpredictable without hints


Option 3: zero-copy
~~~~~~~~~~~~~~~~~~~

Map pinned host memory directly into GPU address space.

.. code:: cuda

   float *h_data;
   cudaHostAlloc(&h_data, size, cudaHostAllocMapped);

   float *d_ptr;
   cudaHostGetDevicePointer(&d_ptr, h_data, 0);

   // GPU accesses CPU memory directly over PCIe
   kernel<<<blocks, threads>>>(d_ptr);

Suitable use cases:

  - Data accessed only once (no benefit from caching in HBM)
  - Working set exceeds GPU memory
  - Streaming workloads

Limitation: bandwidth is constrained to PCIe (128 GB/s), which is 37x
slower than HBM3e.


Complete C++ example with compilation options
---------------------------------------------

The following example demonstrates unified memory with prefetching and
memory hints. Save this as ``unified_memory_demo.cu``.

.. code:: cpp

   // Created by Veselin Kolev <v.kolev@discoverer.bg>
   // 18 November 2025
   // Licence: GPLv2

   #include <cuda_runtime.h>
   #include <cstdio>
   #include <cstdlib>
   #include <chrono>

   #define CHECK_CUDA(call)                                                       \
       do {                                                                       \
           cudaError_t err = call;                                                \
           if (err != cudaSuccess) {                                              \
               fprintf(stderr, "CUDA error at %s:%d: %s\n", __FILE__, __LINE__,   \
                       cudaGetErrorString(err));                                  \
               exit(EXIT_FAILURE);                                                \
           }                                                                      \
       } while (0)

   __global__ void vector_add(const float* a, const float* b, float* c, size_t n) {
       size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
       if (idx < n) {
           c[idx] = a[idx] + b[idx];
       }
   }

   void run_with_explicit_memory(size_t n, int device) {
       size_t bytes = n * sizeof(float);
       
       // Host pinned memory
       float *h_a, *h_b, *h_c;
       CHECK_CUDA(cudaMallocHost(&h_a, bytes));
       CHECK_CUDA(cudaMallocHost(&h_b, bytes));
       CHECK_CUDA(cudaMallocHost(&h_c, bytes));
       
       // Device memory
       float *d_a, *d_b, *d_c;
       CHECK_CUDA(cudaMalloc(&d_a, bytes));
       CHECK_CUDA(cudaMalloc(&d_b, bytes));
       CHECK_CUDA(cudaMalloc(&d_c, bytes));
       
       // Initialise on host
       for (size_t i = 0; i < n; i++) {
           h_a[i] = static_cast<float>(i);
           h_b[i] = static_cast<float>(i * 2);
       }
       
       cudaStream_t stream;
       CHECK_CUDA(cudaStreamCreate(&stream));
       
       auto start = std::chrono::high_resolution_clock::now();
       
       // Explicit transfers
       CHECK_CUDA(cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream));
       CHECK_CUDA(cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream));
       
       int threads = 256;
       int blocks = (n + threads - 1) / threads;
       vector_add<<<blocks, threads, 0, stream>>>(d_a, d_b, d_c, n);
       
       CHECK_CUDA(cudaMemcpyAsync(h_c, d_c, bytes, cudaMemcpyDeviceToHost, stream));
       CHECK_CUDA(cudaStreamSynchronize(stream));
       
       auto end = std::chrono::high_resolution_clock::now();
       double ms = std::chrono::duration<double, std::milli>(end - start).count();
       
       printf("Explicit memory:    %.3f ms\n", ms);
       
       // Cleanup
       CHECK_CUDA(cudaStreamDestroy(stream));
       CHECK_CUDA(cudaFree(d_a));
       CHECK_CUDA(cudaFree(d_b));
       CHECK_CUDA(cudaFree(d_c));
       CHECK_CUDA(cudaFreeHost(h_a));
       CHECK_CUDA(cudaFreeHost(h_b));
       CHECK_CUDA(cudaFreeHost(h_c));
   }

   void run_with_unified_memory_naive(size_t n, int device) {
       size_t bytes = n * sizeof(float);
       
       float *a, *b, *c;
       CHECK_CUDA(cudaMallocManaged(&a, bytes));
       CHECK_CUDA(cudaMallocManaged(&b, bytes));
       CHECK_CUDA(cudaMallocManaged(&c, bytes));
       
       // Initialise on host (data resides on CPU initially)
       for (size_t i = 0; i < n; i++) {
           a[i] = static_cast<float>(i);
           b[i] = static_cast<float>(i * 2);
       }
       
       auto start = std::chrono::high_resolution_clock::now();
       
       // No prefetch - relies on demand paging
       int threads = 256;
       int blocks = (n + threads - 1) / threads;
       vector_add<<<blocks, threads>>>(a, b, c, n);
       CHECK_CUDA(cudaDeviceSynchronize());
       
       auto end = std::chrono::high_resolution_clock::now();
       double ms = std::chrono::duration<double, std::milli>(end - start).count();
       
       printf("Unified (naive):    %.3f ms\n", ms);
       
       CHECK_CUDA(cudaFree(a));
       CHECK_CUDA(cudaFree(b));
       CHECK_CUDA(cudaFree(c));
   }

   void run_with_unified_memory_optimised(size_t n, int device) {
       size_t bytes = n * sizeof(float);
       
       float *a, *b, *c;
       CHECK_CUDA(cudaMallocManaged(&a, bytes));
       CHECK_CUDA(cudaMallocManaged(&b, bytes));
       CHECK_CUDA(cudaMallocManaged(&c, bytes));
       
       // Set preferred location to GPU
       CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetPreferredLocation, device));
       CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetPreferredLocation, device));
       CHECK_CUDA(cudaMemAdvise(c, bytes, cudaMemAdviseSetPreferredLocation, device));
       
       // Mark inputs as read-mostly (allows read replicas)
       CHECK_CUDA(cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, device));
       CHECK_CUDA(cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, device));
       
       // Initialise on host
       for (size_t i = 0; i < n; i++) {
           a[i] = static_cast<float>(i);
           b[i] = static_cast<float>(i * 2);
       }
       
       cudaStream_t stream;
       CHECK_CUDA(cudaStreamCreate(&stream));
       
       auto start = std::chrono::high_resolution_clock::now();
       
       // Prefetch to GPU before kernel
       CHECK_CUDA(cudaMemPrefetchAsync(a, bytes, device, stream));
       CHECK_CUDA(cudaMemPrefetchAsync(b, bytes, device, stream));
       CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, device, stream));
       
       int threads = 256;
       int blocks = (n + threads - 1) / threads;
       vector_add<<<blocks, threads, 0, stream>>>(a, b, c, n);
       
       // Prefetch result back to CPU if needed
       CHECK_CUDA(cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId, stream));
       CHECK_CUDA(cudaStreamSynchronize(stream));
       
       auto end = std::chrono::high_resolution_clock::now();
       double ms = std::chrono::duration<double, std::milli>(end - start).count();
       
       printf("Unified (optimised): %.3f ms\n", ms);
       
       CHECK_CUDA(cudaStreamDestroy(stream));
       CHECK_CUDA(cudaFree(a));
       CHECK_CUDA(cudaFree(b));
       CHECK_CUDA(cudaFree(c));
   }

   int main(int argc, char** argv) {
       int device = 0;
       if (argc > 1) {
           device = atoi(argv[1]);
       }
       
       CHECK_CUDA(cudaSetDevice(device));
       
       cudaDeviceProp prop;
       CHECK_CUDA(cudaGetDeviceProperties(&prop, device));
       printf("Device: %s\n", prop.name);
       printf("Compute capability: %d.%d\n", prop.major, prop.minor);
       printf("Concurrent managed access: %s\n",
              prop.concurrentManagedAccess ? "yes" : "no");
       printf("Pageable memory access: %s\n",
              prop.pageableMemoryAccess ? "yes" : "no");
       printf("\n");
       
       // 256 MB of data (64M floats)
       size_t n = 64 * 1024 * 1024;
       printf("Vector size: %zu elements (%.1f MB)\n\n", n, n * sizeof(float) / 1e6);
       
       // Warm-up
       run_with_explicit_memory(n, device);
       
       // Benchmark runs
       printf("\nBenchmark results:\n");
       run_with_explicit_memory(n, device);
       run_with_unified_memory_naive(n, device);
       run_with_unified_memory_optimised(n, device);
       
       return 0;
   }


Compilation options
~~~~~~~~~~~~~~~~~~~

.. warning:: Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.

For H200 (Hopper architecture, compute capability 9.0):

.. code:: bash

   # Basic compilation
   nvcc -arch=sm_90 -O3 -o unified_memory_demo unified_memory_demo.cu

   # With additional optimisations
   nvcc -arch=sm_90 -O3 --use_fast_math -Xcompiler "-O3 -march=native" \
        -o unified_memory_demo unified_memory_demo.cu

   # Debug build with line info (for profiling with nsys/ncu)
   nvcc -arch=sm_90 -O2 -g -lineinfo -o unified_memory_demo_debug unified_memory_demo.cu


Running the example
~~~~~~~~~~~~~~~~~~~

.. warning:: Do not compile directly on the login node. Use SLURM job to run the compilation and the execution of the compiled code.

.. code:: bash

   # Run on default GPU (device 0)
   ./unified_memory_demo

   # Run on specific GPU
   ./unified_memory_demo 2

   # Profile with Nsight Systems
   nsys profile --stats=true ./unified_memory_demo

   # Profile with Nsight Compute (kernel analysis)
   ncu --set full -o unified_memory_report ./unified_memory_demo


Expected output
~~~~~~~~~~~~~~~

On an H200 with the 256 MB test vector, typical results:

::

   Device: NVIDIA H200
   Compute capability: 9.0
   Concurrent managed access: yes
   Pageable memory access: yes

   Vector size: 67108864 elements (256.0 MB)

   Benchmark results:
   Explicit memory:    12.5 ms
   Unified (naive):    45.2 ms
   Unified (optimised): 14.1 ms

The naive unified memory approach incurs significant page-fault
overhead. With prefetching and hints, performance approaches explicit
memory management whilst retaining simpler code.

Key compilation flags explained
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+---------------------------+------------------------------------------+
| Flag                      | Purpose                                  |
+===========================+==========================================+
| ``-arch=sm_90``           | Target H200/Hopper architecture (compute |
|                           | capability 9.0)                          |
+---------------------------+------------------------------------------+
| ``-O3``                   | Maximum optimisation level               |
+---------------------------+------------------------------------------+
| ``--use_fast_math``       | Enable fast maths (slightly reduced      |
|                           | precision)                               |
+---------------------------+------------------------------------------+
| ``-Xcompiler "-O3"``      | Pass flags to host compiler              |
+---------------------------+------------------------------------------+
| ``-g -lineinfo``          | Include debug info for profiler source   |
|                           | correlation                              |
+---------------------------+------------------------------------------+

For multi-GPU DGX H200 systems, you may also use:

.. code:: bash

   # Generate code for multiple architectures (if sharing binary)
   nvcc -gencode arch=compute_90,code=sm_90 \
        -gencode arch=compute_90,code=compute_90 \
        -O3 -o unified_memory_demo unified_memory_demo.cu


Selecting the appropriate approach
----------------------------------


Explicit memory management
~~~~~~~~~~~~~~~~~~~~~~~~~~

Use when:

1. Training large models with predictable data access patterns
2. Inference serving with strict latency SLAs (p95/p99 targets)
3. Memory-bound HPC kernels where every GB/s matters
4. Production workloads requiring deterministic performance


Managed memory
~~~~~~~~~~~~~~

Use when:

1. Prototyping and exploratory development
2. Complex data structures with unpredictable access (graphs, sparse
   matrices)
3. Oversubscription scenarios where data exceeds 141 GB per GPU
4. Code portability across different NVIDIA platforms is required


Zero-copy
~~~~~~~~~

Use when:

1. Data is accessed once and discarded
2. CPU preprocessing feeds directly into GPU computation
3. Results stream back to CPU for I/O immediately
4. Working set is larger than GPU memory but access is sequential


AI workload considerations
--------------------------


Large language model inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The H200’s 141 GB HBM3e is specifically designed to hold:

  - Model weights for 70B+ parameter models without sharding
  - Large KV caches for long-context inference (128K+ tokens)

For LLM inference on DGX H200:

  - Keep model weights resident in HBM (explicit allocation)
  - KV cache should remain in HBM if possible
  - Only use managed memory for KV cache offload when context exceeds HBM capacity

Performance impact of memory location (Llama-70B inference):

======================================= =================
Configuration                           Per-token latency
======================================= =================
H200 single GPU (model fits in HBM)     ~75 ms
H100 single GPU (requires quantisation) ~142 ms
======================================= =================

Source: WhiteFiber benchmark analysis. The 1.9x improvement stems from
eliminating memory bottlenecks when the full model resides in HBM3e
without quantisation or multi-GPU sharding.


Training workloads
~~~~~~~~~~~~~~~~~~

-  Gradient accumulation, optimiser states, activations should all
   reside in HBM
-  Use gradient checkpointing to reduce memory footprint before
   resorting to CPU offload
-  If using CPU offload (e.g., DeepSpeed ZeRO-Offload), explicit async
   transfers outperform managed memory

Discoverer+ configuration
~~~~~~~~~~~~~~~~~~~~~~~~~

Given the 1.128 TB of total HBM3e across 8 GPUs with NVLink
interconnect:

1. Keep working data in GPU memory
2. Use NVLink for GPU-GPU communication (not CPU-GPU)
3. Treat CPU DDR as staging/preprocessing area, not working memory
4. Use explicit transfers with pinned memory when CPU interaction is
   required


Sapphire Rapids CXL capabilities
--------------------------------

Intel Sapphire Rapids supports CXL 1.1, which provides:

  - Memory expansion via CXL.mem protocol
  - Cache coherency for CXL-attached accelerators

CXL 1.1 limitations relevant to H200:

  - NVIDIA H200 does not use CXL (uses PCIe/NVLink)
  - CXL peer-to-peer between GPU and CXL memory requires CPU mediation
  - CXL memory bandwidth is lower than DDR5 (single PCIe lane equivalent)

CXL in your system is relevant for:

  - Memory capacity expansion beyond 2 TB DDR5
  - Future CXL-enabled accelerators
  - Memory tiering (hot data in DDR5, cold data in CXL expanders)

It does not provide unified memory with H200 GPUs.


DGX H200 memory hierarchy
-------------------------

::

   +-------------------------------------------------------------+
   |                        DGX H200 system                      |
   +-------------------------------------------------------------+
   |                                                             |
   |  +-------------------------------------------------------+  |
   |  |              GPU domain (8x H200)                     |  |
   |  |  +---------+ +---------+ +---------+ +---------+      |  |
   |  |  | 141 GB  | | 141 GB  | | 141 GB  | | 141 GB  | ...  |  |
   |  |  | HBM3e   | | HBM3e   | | HBM3e   | | HBM3e   |      |  |
   |  |  | 4.8TB/s | | 4.8TB/s | | 4.8TB/s | | 4.8TB/s |      |  |
   |  |  +----+----+ +----+----+ +----+----+ +----+----+      |  |
   |  |       +----------+----+----+----------+               |  |
   |  |                    NVLink 4                           |  |
   |  |                   900 GB/s                            |  |
   |  +---------------------------+---------------------------+  |
   |                              |                              |
   |                        PCIe Gen5                            |
   |                        128 GB/s                             |
   |                              |                              |
   |  +---------------------------+---------------------------+  |
   |  |              CPU domain (Sapphire Rapids)             |  |
   |  |  +------------------------------------------------+   |  |
   |  |  |              2 TB DDR5 memory                  |   |  |
   |  |  |              ~307 GB/s bandwidth               |   |  |
   |  |  +------------------------------------------------+   |  |
   |  |  +------------------------------------------------+   |  |
   |  |  |         CXL 1.1 (optional expansion)           |   |  |
   |  |  +------------------------------------------------+   |  |
   |  +-------------------------------------------------------+  |
   |                                                             |
   +-------------------------------------------------------------+

   Performance tiers:
     Tier 1: HBM3e      -> 4,800 GB/s  (keep hot data here)
     Tier 2: NVLink     ->   900 GB/s  (GPU-GPU sharing)
     Tier 3: DDR5       ->   307 GB/s  (CPU preprocessing)
     Tier 4: PCIe       ->   128 GB/s  (CPU-GPU transfers)
     Tier 5: CXL        ->    32 GB/s  (cold storage expansion)


Usage guidelines
----------------

1. Default to explicit memory management for production AI workloads

2. Use managed memory with prefetch hints when:

   -  Porting existing CPU code quickly
   -  Working set slightly exceeds GPU memory
   -  Access patterns are irregular

3. Avoid relying on demand paging over PCIe; the 128 GB/s limit creates
   severe bottlenecks

4. For models exceeding single-GPU memory, use tensor parallelism over
   NVLink rather than CPU offload

5. Profile with ``nsys`` and ``ncu`` to identify memory movement
   bottlenecks before optimising

6. Consider GH200-based systems if your workloads fundamentally require
   CPU-GPU memory coherency


References
----------

-  NVIDIA CUDA Programming Guide: Unified Memory
-  NVIDIA H200 Product Brief
-  NVIDIA DGX H200 Specifications
-  NVIDIA Grace Hopper Superchip Architecture Whitepaper
-  Intel Xeon Scalable Processor Max Series Documentation
-  CXL 4.0 Infrastructure Planning Guide