Intel Sapphire Rapids Optimisation Guide (Discoverer+ GPU partition) ==================================================================== Table of Contents ----------------- - `Introduction <#introduction>`__ - `Sapphire Rapids architecture overview <#sapphire-rapids-architecture-overview>`__ - `Optimisation levels: -O2 vs -O3 <#optimisation-levels-o2-vs-o3>`__ - `CPU-specific compilation flags <#cpu-specific-compilation-flags>`__ - `Profile-guided optimisation (PGO) <#profile-guided-optimisation-pgo>`__ - `Memory optimisations <#memory-optimisations>`__ - `Link-time optimisations <#link-time-optimisations>`__ - `Mixed workload strategy <#mixed-workload-strategy>`__ - `Practical build configuration <#practical-build-configuration>`__ - `Runtime considerations <#runtime-considerations>`__ - `SLURM configuration <#slurm-configuration>`__ - `Example code demonstrating optimisation benefits <#example-code-demonstrating-optimisation-benefits>`__ - `Example 1: Vectorisation with AVX-512 <#example-1-vectorisation-with-avx-512>`__ - `Example 2: Cache-aware data layout <#example-2-cache-aware-data-layout>`__ - `Example 3: Profile-guided optimisation benefit <#example-3-profile-guided-optimisation-benefit>`__ - `Example 4: Intel MKL with LLVM and Intel Compilers <#example-4-intel-mkl-with-llvm-and-intel-compilers>`__ - `Example 5: Intel oneDNN with LLVM and Intel Compilers <#example-5-intel-onednn-with-llvm-and-intel-compilers>`__ - `Example 6: AMX for ML/AI workloads <#example-6-amx-for-mlai-workloads>`__ - `Example 7: Fortran Compiler Performance (flang and ifx) <#example-7-fortran-compiler-performance-flang-and-ifx>`__ - `Example 8: C++ Compiler Comparison (clang++, g++, icpx) <#example-8-c-compiler-comparison-clang-g-icpx>`__ - `Example 9: OpenMP Library Comparison <#example-9-openmp-library-comparison>`__ - `Example 10: C++ Threads Performance Comparison <#example-10-c-threads-performance-comparison>`__ - `Benchmark results summary <#benchmark-results-summary>`__ - `Compiler comparison: LLVM/21 vs Intel oneAPI <#compiler-comparison-llvm21-vs-intel-oneapi>`__ - `Runtime optimiser considerations <#runtime-optimiser-considerations>`__ Introduction ------------ This document describes compilation and execution practices for Intel Sapphire Rapids microarchitecture systems. Sapphire Rapids processors (Xeon Scalable 4th Generation, including Xeon Platinum 8480C) have specific characteristics that affect performance. The code examples and optimisation techniques explained in this document are applicable to systems equipped with Intel Xeon Platinum 8480C processors. The system configuration includes 2 sockets with 56 cores per socket, presenting 2 NUMA domains (one per socket), totaling 112 cores with SMT (Simultaneous Multi-Threading) providing 224 threads total. For detailed hardware specifications of the Discoverer+ compute nodes (based on DGX H200 servers) where these Intel Xeon Platinum processors are installed, refer to the `Discoverer Resource Overview `__. All compilation and code execution must occur on compute nodes. The only way to access compute nodes is through SLURM batch jobs. Direct execution and compilation on login nodes is not tolerated. All examples in this document must be submitted as SLURM batch jobs using the provided SLURM scripts in the ``sapphirerapids/`` directory located at ``/opt/software/sapphirerapids/``. The test code is also available online at: https://gitlab.discoverer.bg/vkolev/snippets/-/blob/main/sapphirerapids .. important:: Users must ensure they have a QoS (Quality of Service) that allows intensive CPU jobs. The Discoverer+ cluster policy prioritises GPU workloads over intensive CPU workloads. Verify that your QoS configuration permits CPU-intensive jobs before submitting SLURM batch jobs. Sapphire Rapids architecture overview ------------------------------------- Intel Sapphire Rapids microarchitecture (codenamed “SPR”) is a 10nm Enhanced SuperFin process node design introduced in 2023. Sapphire Rapids processors implement a tile-based architecture with multiple compute tiles connected via Intel’s EMIB (Embedded Multi-die Interconnect Bridge). The core architecture consists of Performance cores (P-cores) based on the Golden Cove microarchitecture. Each core has dedicated L1 and L2 caches. The cache hierarchy includes 32KB L1D (data cache) and 32KB L1I (instruction cache) per core, 2MB L2 cache per core, and up to 112MB L3 cache shared across the socket (depending on SKU). Sapphire Rapids cores feature a wide instruction dispatch pipeline with dual 512-bit FMA (Fused Multiply-Add) units per core, enabling simultaneous execution of two 512-bit vector operations. The architecture supports AVX-512 instructions including AVX-512F, AVX-512BW, AVX-512CD, AVX-512DQ, AVX-512VL, AVX-512_VNNI, AVX-512_BF16, AVX-512_FP16, and AVX-512_VBMI2. *Advanced Matrix Extensions (AMX)* is a key feature of Sapphire Rapids specifically designed for AI/ML workloads. AMX provides three types of acceleration: 1. ``AMX-TILE``: 8KB of dedicated tile registers (8 tiles × 1KB each) for efficient matrix data storage and manipulation 2. ``AMX-INT8``: Hardware acceleration for 8-bit integer matrix multiplication, ideal for quantized neural network inference with 4-8x speedup over AVX-512 3. ``AMX-BF16``: Hardware acceleration for bfloat16 matrix multiplication, ideal for mixed-precision training and inference with 2-4x speedup over AVX-512 AMX enables significant performance improvements for deep learning workloads, transformer models, and large language model inference. Each core has independent AMX tile registers, allowing efficient parallelization across cores. The branch prediction unit uses a sophisticated multi-level predictor with improved accuracy over previous generations. Memory disambiguation capabilities allow the processor to detect and handle memory dependencies effectively, enabling out-of-order execution optimisations. For multi-socket systems like Intel Xeon Platinum 8480C, each socket contains multiple tiles connected via EMIB. Each socket presents as a single NUMA domain, with memory controllers distributed across the socket. On systems with 2 sockets, there are 2 NUMA domains with 56 cores per domain, totaling 112 cores with SMT providing 224 threads total. Optimisation levels: ``-O2`` vs ``-O3`` --------------------------------------- Unlike AMD Zen2, Intel Sapphire Rapids does not have a documented embedded runtime optimiser. This means compile-time optimisations, including those enabled by ``-O3``, are more important for achieving optimal performance. - **Use** ``-O3`` **for compute-bound workloads**: Provides aggressive optimisations including vectorisation, loop unrolling, and inlining that significantly benefit Sapphire Rapids - **Use** ``-O2`` **for memory-bound or mixed workloads**: Provides balanced optimisation without excessive code bloat that can hurt instruction cache performance - **Profile to determine optimal level**: Test both ``-O2`` and ``-O3`` for your specific workload; ``-O3`` typically provides 5-15% improvement for compute-bound code - **Combine with architecture-specific flags**: ``-O3`` benefits are amplified when combined with ``-march=sapphirerapids`` and AVX-512 optimisations The lack of a runtime optimiser means that compile-time optimisations are the primary mechanism for performance improvements. Aggressive optimisations at compile time translate directly to runtime performance. CPU-specific compilation flags ------------------------------ Architecture targeting ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Use -march=sapphirerapids to enable Sapphire Rapids-specific instructions -march=sapphirerapids # This enables: # - AVX-512 (512-bit vectors) # - AVX-512_VNNI (vector neural network instructions) # - AVX-512_BF16 (bfloat16 support) # - AVX-512_FP16 (half-precision floating point) # - AMX (Advanced Matrix Extensions) # - Other Sapphire Rapids-specific instruction sets # Alternative: Use -march=native to auto-detect all features -march=native Vector width optimisation ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Optimal vector width for Sapphire Rapids is 512-bit vectors (AVX-512 support) # Sapphire Rapids has dual 512-bit FMA units per core -mprefer-vector-width=512 # For workloads that may benefit from 256-bit vectors (less register pressure) # Use 256-bit for memory-bound code or when register spilling occurs -mprefer-vector-width=256 # Ensures vectorized math uses FMA instructions -mfma AVX-512 specific optimisations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Enable AVX-512 fused multiply-add -mavx512f -mavx512dq -mavx512cd -mavx512bw -mavx512vl # Enable AVX-512 VNNI for neural network workloads -mavx512vnni # Enable AVX-512 BF16 for bfloat16 operations -mavx512bf16 # Enable AVX-512 FP16 for half-precision operations -mavx512fp16 # Note: -march=sapphirerapids automatically enables all supported AVX-512 variants AMX (Advanced Matrix Extensions) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ AMX is Intel’s dedicated hardware acceleration for matrix operations, specifically designed for AI/ML workloads. Sapphire Rapids supports three AMX types: 1. ``AMX-TILE``: Provides 8KB of tile registers (8 tiles × 1KB each) for matrix data storage 2. ``AMX-INT8``: Accelerates 8-bit integer matrix multiplication (INT8 quantization) 3. ``AMX-BF16``: Accelerates bfloat16 matrix multiplication (BF16 mixed precision) Compilation flags ^^^^^^^^^^^^^^^^^ .. code-block:: bash # Enable all AMX types for matrix multiplication workloads (AI/ML) -mamx-tile -mamx-int8 -mamx-bf16 # Or use -march=sapphirerapids which automatically enables AMX support -march=sapphirerapids # Note: AMX requires runtime detection and explicit usage # Compiler will not auto-vectorize to AMX; requires manual intrinsics Runtime detection ^^^^^^^^^^^^^^^^^ AMX requires runtime detection and proper OS support (Linux kernel 5.16+): .. code-block:: cpp #include #include bool check_amx_support() { unsigned int eax, ebx, ecx, edx; // Check for AMX-TILE support (CPUID leaf 0x1D, subleaf 0x0) __cpuid_count(0x1D, 0x0, eax, ebx, ecx, edx); if ((eax & (1 << 0)) == 0) return false; // AMX-TILE // Check for AMX-INT8 support if ((eax & (1 << 1)) == 0) return false; // AMX-INT8 // Check for AMX-BF16 support if ((eax & (1 << 5)) == 0) return false; // AMX-BF16 return true; } AMX tile configuration ^^^^^^^^^^^^^^^^^^^^^^ AMX uses a tile configuration that must be set before using tile operations: .. code-block:: cpp #include // Configure AMX tiles for matrix multiplication // Tile dimensions: rows × cols (max 16 rows × 64 bytes per row) void configure_amx_tiles() { // Tile 0: 16 rows × 64 bytes (for matrix A) // Tile 1: 16 rows × 64 bytes (for matrix B) // Tile 2: 16 rows × 64 bytes (for matrix C result) uint8_t tilecfg[64] = {0}; // Configure tile 0: 16×64 bytes (1024 bytes total) tilecfg[0] = 16; // rows tilecfg[1] = 64; // bytes per row // Configure tile 1: 16×64 bytes tilecfg[16] = 16; tilecfg[17] = 64; // Configure tile 2: 16×64 bytes (accumulator) tilecfg[32] = 16; tilecfg[33] = 64; _tile_loadconfig(tilecfg); } AMX-BF16 for neural network inference ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AMX-BF16 is ideal for neural network inference with bfloat16 precision: .. code-block:: cpp // Example: Matrix multiplication using AMX-BF16 // C = A × B where A, B, C are bfloat16 matrices #include #include void amx_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C, int M, int N, int K) { // Configure tiles uint8_t tilecfg[64] = {0}; tilecfg[0] = 16; tilecfg[1] = 64; // Tile 0: A matrix (16×32 bf16 elements) tilecfg[16] = 16; tilecfg[17] = 64; // Tile 1: B matrix tilecfg[32] = 16; tilecfg[33] = 64; // Tile 2: C accumulator _tile_loadconfig(tilecfg); // Load matrices into tiles and perform multiplication for (int i = 0; i < M; i += 16) { for (int j = 0; j < N; j += 16) { // Zero accumulator tile _tile_zero(2); for (int k = 0; k < K; k += 32) { // Load tile 0 with A[i:i+16, k:k+32] _tile_loadd(0, &A[i * K + k], K * sizeof(__bf16)); // Load tile 1 with B[k:k+32, j:j+16] (transposed) _tile_loadd(1, &B[k * N + j], N * sizeof(__bf16)); // Compute: tile2 += tile0 × tile1 _tile_dpbf16ps(2, 0, 1); } // Store result from tile 2 to C[i:i+16, j:j+16] _tile_stored(2, &C[i * N + j], N * sizeof(__bf16)); } } // Release tile configuration _tile_release(); } AMX-INT8 for quantized neural networks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AMX-INT8 accelerates INT8 quantized models (common for inference): .. code-block:: cpp // Example: INT8 quantized matrix multiplication void amx_int8_matmul(const int8_t* A, const int8_t* B, int32_t* C, int M, int N, int K) { // Configure tiles for INT8 uint8_t tilecfg[64] = {0}; tilecfg[0] = 16; tilecfg[1] = 64; // Tile 0: A matrix (16×64 int8 elements) tilecfg[16] = 16; tilecfg[17] = 64; // Tile 1: B matrix tilecfg[32] = 16; tilecfg[33] = 64; // Tile 2: C accumulator (int32) _tile_loadconfig(tilecfg); for (int i = 0; i < M; i += 16) { for (int j = 0; j < N; j += 16) { _tile_zero(2); // Zero accumulator for (int k = 0; k < K; k += 64) { // Load A[i:i+16, k:k+64] _tile_loadd(0, &A[i * K + k], K); // Load B[k:k+64, j:j+16] (transposed) _tile_loadd(1, &B[k * N + j], N); // Compute: tile2 += tile0 × tile1 (INT8) _tile_dpbssd(2, 0, 1); } // Store result (int32 accumulator) _tile_stored(2, &C[i * N + j], N * sizeof(int32_t)); } } _tile_release(); } Performance considerations ^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Tile reuse**: Keep tiles loaded across multiple operations to minimize memory traffic - **Blocking**: Use appropriate block sizes (16×64 for BF16, 16×64 for INT8) to maximise tile utilisation - **Memory alignment**: Align matrices to 64-byte boundaries for optimal tile loading - **Multi-threading**: Each thread has its own tile registers; use one thread per core for AMX workloads - **Mixed precision**: Use BF16 for training/inference when precision allows; INT8 for maximum throughput in inference Integration with ML frameworks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Many ML frameworks automatically use AMX when available: - **TensorFlow**: Enable with ``TF_ENABLE_ONEDNN_OPTS=1`` (uses oneDNN library) - **PyTorch**: Uses oneDNN optimisations automatically on Sapphire Rapids - **oneDNN**: Intel’s deep neural network library with AMX support .. code-block:: bash # Enable oneDNN AMX optimizations export TF_ENABLE_ONEDNN_OPTS=1 export ONEDNN_VERBOSE=1 # For debugging/verification Loop and alignment ~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Sapphire Rapids benefits from 64-byte alignment (cache line size) -falign-loops=64 -falign-functions=64 # For AVX-512, 64-byte alignment is optimal -falign-loops=64 LLVM-specific optimisations ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Enable loop interchange (benefits from sophisticated branch prediction) -mllvm -enable-loopinterchange # Tune prefetch distance (Sapphire Rapids has aggressive prefetchers) -mllvm -prefetch-distance=256 # Enable interleaved memory access optimization -mllvm -enable-interleaved-mem-accesses # NUMA-aware placement (for multi-socket systems) -mllvm -enable-npm=true # Enable aggressive vectorization -mllvm -force-vector-width=512 -mllvm -force-vector-interleave=2 What to avoid ~~~~~~~~~~~~~ - **Excessive loop unrolling**: Can cause instruction cache misses; profile to find optimal unroll factor - **Over-aggressive** ``-ffast-math``: Test carefully; precision requirements must be considered - **Generic** ``-march`` **flags**: Target ``sapphirerapids`` specifically for architecture-specific optimisations - **Mixing AVX-512 and AVX2**: Use consistent vector width throughout the application Profile-guided optimisation (PGO) --------------------------------- PGO for Sapphire Rapids provides 10-30% performance improvements. It works with LTO and BOLT. PGO benefits for Sapphire Rapids ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Given Sapphire Rapids’s sophisticated branch predictor and wide execution units, PGO provides significant benefits because: - Optimizes for actual branch patterns - Improved code layout reduces instruction cache misses - Hot/cold splitting keeps working set in L2/L3 - Enables improved vectorisation decisions based on runtime data PGO workflow ~~~~~~~~~~~~ .. code-block:: bash # Step 1: Instrumentation build clang++ -fprofile-generate -march=sapphirerapids -O3 -flto=thin \ -mprefer-vector-width=512 \ source.cpp -o program # Step 2: Run representative workloads # In SLURM batch job: ./program < typical_input_1 ./program < typical_input_2 ./program < typical_input_3 # Step 3: Merge profiles (if multiple runs) llvm-profdata merge -o final.profdata default.profraw # Step 4: Optimised build with profile clang++ -fprofile-use=final.profdata -march=sapphirerapids -O3 \ -flto=thin -mprefer-vector-width=512 \ source.cpp -o program_optimized Blended profiles for mixed workloads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For diverse customer workloads, create weighted blended profiles: .. code-block:: bash # Collect profiles from multiple workloads llvm-profdata merge -o workload_A.profdata default.profraw_A llvm-profdata merge -o workload_B.profdata default.profraw_B llvm-profdata merge -o workload_C.profdata default.profraw_C # Merge with weights based on importance/frequency llvm-profdata merge \ -weighted-input=3,workload_A.profdata \ -weighted-input=2,workload_B.profdata \ -weighted-input=1,workload_C.profdata \ -o final_blended.profdata Memory optimisations -------------------- Cache-aware compilation ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Improved cache utilisation through section elimination -fdata-sections -ffunction-sections # Linker garbage collection (use with above flags) -Wl,--gc-sections Structure and data layout ~~~~~~~~~~~~~~~~~~~~~~~~~ - *Pack hot data structures* to fit within 32KB L1 cache - *Consider* ``__restrict__`` for pointer aliasing hints (Sapphire Rapids has strong memory disambiguation, but explicit hints can still help the compiler) - *Align data structures* to cache line boundaries (64 bytes) - *Use structure-of-arrays (SoA) layout* for vectorised code when beneficial Huge pages ~~~~~~~~~~ .. code-block:: bash # Enable transparent huge pages in madvise mode echo madvise > /sys/kernel/mm/transparent_hugepage/enabled # In code, use madvise for large allocations madvise(large_buffer, size, MADV_HUGEPAGE); Memory allocators ~~~~~~~~~~~~~~~~~ Consider replacing default allocator with: - ``jemalloc``: Improved performance for concurrent workloads - ``tcmalloc``: Suitable performance characteristics - ``mimalloc``: Low overhead, suitable for mixed workloads Link-time optimisations ----------------------- ThinLTO vs full LTO ~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # ThinLTO: Faster compile times, most of the benefits -flto=thin # Full LTO: Maximum optimization, slower compilation -flto=full Additional link-time flags ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Allow more aggressive optimizations in shared libraries -fno-semantic-interposition # Reduce PLT call overhead -fno-plt # Use LLD linker for improved LTO support -fuse-ld=lld # Safe identical code folding -Wl,--icf=safe BOLT post-link optimisation ~~~~~~~~~~~~~~~~~~~~~~~~~~~ After linking, apply BOLT using production performance data: .. code-block:: bash # Collect perf data # In SLURM batch job: perf record -e cycles:u -j any,u ./program # Apply BOLT llvm-bolt program -o program.bolt \ --data perf.data \ --reorder-blocks=ext-tsp \ --reorder-functions=hfsort+ \ --split-functions \ --split-all-cold Mixed workload strategy ----------------------- For systems serving diverse customer workloads, use a balanced approach: Conservative optimisation flags ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ``-O3`` *for compute-bound code*: Provides significant benefits on Sapphire Rapids - ``-O2`` *for memory-bound code*: Avoids code bloat that can hurt cache performance - ``-flto=thin``: Faster, more predictable performance across workload variations - *PGO with blended profiles*: Weighted combination of representative workloads Function multi-versioning ~~~~~~~~~~~~~~~~~~~~~~~~~ For hot paths, use target clones: .. code-block:: cpp __attribute__((target_clones("default","avx2","avx512f"))) void process_data(/* params */) { // Hot path that benefits from different optimizations // Runtime dispatcher selects the best version } Split optimisation by code characteristics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: cmake # Hot paths (identified via profiling) set_source_files_properties(hot_path.cpp PROPERTIES COMPILE_FLAGS "-O3 -march=sapphirerapids -fprofile-use=hot.profdata -mprefer-vector-width=512") # Cold paths set_source_files_properties(general_code.cpp PROPERTIES COMPILE_FLAGS "-O2 -march=x86-64-v4") # Core libraries set_source_files_properties(core_lib.cpp PROPERTIES COMPILE_FLAGS "-O3 -march=sapphirerapids -flto=full -mprefer-vector-width=512") # Customer-facing code set_source_files_properties(api_code.cpp PROPERTIES COMPILE_FLAGS "-O2 -march=x86-64-v4 -flto=thin") Practical build configuration ----------------------------- Complete example: core library build (in SLURM batch job) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # In SLURM batch job: clang++ -O3 \ -march=sapphirerapids \ -mprefer-vector-width=512 \ -mfma \ -falign-loops=64 \ -falign-functions=64 \ -fdata-sections \ -ffunction-sections \ -fno-semantic-interposition \ -fno-plt \ -flto=thin \ -fprofile-use=blended.profdata \ -mllvm -enable-loopinterchange \ -mllvm -prefetch-distance=256 \ source.cpp -o program \ -fuse-ld=lld \ -Wl,--gc-sections \ -Wl,--icf=safe CMake configuration ~~~~~~~~~~~~~~~~~~~ .. code-block:: cmake set(CMAKE_C_COMPILER clang) set(CMAKE_CXX_COMPILER clang++) # C++ standard (C++17, C++20, or C++23) set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD_REQUIRED ON) # Base flags set(CMAKE_C_FLAGS_RELEASE "-O3 -march=sapphirerapids -mprefer-vector-width=512") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}") # LTO set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -flto=thin") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -flto=thin") # PGO (if profile available) if(EXISTS "${CMAKE_SOURCE_DIR}/final.profdata") set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata") set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata") endif() # Linker set(CMAKE_EXE_LINKER_FLAGS_RELEASE "-fuse-ld=lld -Wl,--gc-sections -Wl,--icf=safe") set(CMAKE_SHARED_LINKER_FLAGS_RELEASE "${CMAKE_EXE_LINKER_FLAGS_RELEASE}") Runtime considerations ---------------------- CPU frequency scaling ~~~~~~~~~~~~~~~~~~~~~ Sapphire Rapids systems typically use ``intel_pstate`` driver for CPU frequency scaling: .. code-block:: bash # Check current CPU governor cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # Set to performance mode (if root or via SLURM) echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # Or use cpupower (if available) cpupower frequency-set -g performance NUMA awareness ~~~~~~~~~~~~~~ Sapphire Rapids systems with multiple sockets have multiple NUMA domains: 1. **Identify NUMA topology**: Use ``numactl --hardware`` to see NUMA node layout 2. **Bind memory allocation**: Use ``numactl --membind=N`` to allocate memory from specific NUMA node 3. **Bind CPU affinity**: Use ``numactl --cpunodebind=N`` to bind to specific NUMA node 4. **Monitor NUMA statistics**: Use ``numastat`` and ``perf stat -e numa-misses`` 5. **Dual-socket systems**: 2 NUMA domains, one per socket. Use SLURM ``--sockets-per-node`` and ``--cpu-bind=sockets`` to bind to specific NUMA domains Example: NUMA-optimized execution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Direct execution (using numactl): .. code-block:: bash # Check NUMA topology numactl --hardware # Single NUMA domain binding (for single-process) # In SLURM batch job: srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program # Multiple NUMA domains: one process per domain # In SLURM batch job: srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./process1 & srun --cpu-bind=sockets:1-1 numactl --membind=1 --cpunodebind=1 ./process2 & # Monitor NUMA performance # In SLURM batch job: perf stat -e numa-misses,numa-migrations ./program numastat # Show NUMA allocation statistics SLURM execution: .. code-block:: bash # Single NUMA domain (for single-process) #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=56 #SBATCH --cpus-per-task=112 srun --cpu-bind=sockets:0-0 ./program # Multiple NUMA domains (one task per domain) #SBATCH --ntasks=2 #SBATCH --sockets-per-node=2 #SBATCH --cores-per-socket=56 srun --cpu-bind=sockets ./program # Explicit NUMA domain binding with numactl # In SLURM batch job: srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program # Check SLURM CPU binding srun --cpu-bind=sockets:0-0 numactl --hardware Environment variables ~~~~~~~~~~~~~~~~~~~~~ Make optimisation thresholds runtime-configurable: .. code-block:: bash # Example: Tunable buffer sizes export BUFFER_SIZE=1048576 export PARALLELISM_THRESHOLD=1000 Monitoring and feedback ~~~~~~~~~~~~~~~~~~~~~~~ - **Instrumented production builds**: Use lightweight sampling (``-fprofile-sample-use``) to collect real customer profiles - **Performance telemetry**: Track which code paths are actually hot in production - **A/B testing**: Deploy different optimisation configurations to subsets of traffic SLURM configuration ------------------- Systems with Intel Sapphire Rapids processors typically use SLURM for job scheduling with specific NUMA domain configuration. Node configuration ~~~~~~~~~~~~~~~~~~ From system configuration: - **2 NUMA domains**: One per socket - **56 cores per NUMA domain**: Each socket contains 56 cores - **2 threads per core**: SMT (Simultaneous Multi-Threading) enabled - **Total capacity**: 2 × 56 × 2 = 224 threads per node Topology breakdown: - **2 NUMA domains**: One per socket - **56 cores per NUMA domain**: Each domain contains 56 cores - **2 threads per core**: SMT (Simultaneous Multi-Threading) enabled - **Total capacity**: 2 × 56 × 2 = 224 threads per node - **Memory per NUMA domain**: Varies by system configuration SLURM directives ~~~~~~~~~~~~~~~~ All jobs must use appropriate SLURM directives: .. code-block:: bash #SBATCH --partition=commong # Default partition (not "cn") #SBATCH --account= #SBATCH --qos= #SBATCH --sockets-per-node=1 # For single NUMA domain #SBATCH --cores-per-socket=56 # All cores in one socket #SBATCH --cpus-per-task=112 # All threads in one socket (with SMT) .. note:: The default partition for SLURM is ``commong``, not ``cn``. **QoS Requirements for Intensive CPU Jobs:** Users must ensure they have a QoS (Quality of Service) that allows intensive CPU jobs. The Discoverer+ cluster policy prioritises GPU workloads over intensive CPU workloads. Verify that your QoS configuration permits CPU-intensive jobs before submitting SLURM batch jobs for Sapphire Rapids optimisation benchmarks. OpenMP configuration ~~~~~~~~~~~~~~~~~~~~ For OpenMP workloads: .. code-block:: bash #SBATCH --sockets-per-node=1 #SBATCH --cores-per-socket=56 #SBATCH --cpus-per-task=112 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export OMP_PLACES=cores export OMP_PROC_BIND=close srun --cpu-bind=sockets:0-0 ./openmp_program SLURM configuration recommendations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. **Single-process applications**: Use ``--sockets-per-node=1`` to bind to one NUMA domain 2. **Multi-process applications**: Use ``--ntasks=N`` with ``--sockets-per-node=N`` (one task per NUMA domain) 3. **Memory allocation**: Request memory proportional to NUMA domains used 4. **CPU binding**: Always use ``--cpu-bind=sockets`` to ensure proper NUMA binding 5. **Monitor binding**: Check with ``srun --cpu-bind=sockets numactl --hardware`` 6. **Thread placement**: For OpenMP, use ``OMP_PLACES=cores`` and ``OMP_PROC_BIND=close`` Example code demonstrating optimisation benefits ------------------------------------------------ The following examples demonstrate how different optimisations benefit Sapphire Rapids performance. The example source code and SLURM scripts are located in the ``sapphirerapids/`` directory at ``/opt/software/sapphirerapids/``. .. important:: The full path to the ``sapphirerapids/`` folder is ``/opt/software/sapphirerapids/``. You can copy this folder to your project directory or work directly from the system location. The test code is also available online at: https://gitlab.discoverer.bg/vkolev/snippets/-/blob/main/sapphirerapids To reproduce benchmark results, you can either work from the system location or copy the folder to your project directory: .. code-block:: bash # Option 1: Work from the system location cd /opt/software/sapphirerapids # Option 2: Copy to your project directory mkdir -p /path/to/your/project cp -r /opt/software/sapphirerapids /path/to/your/project/ cd /path/to/your/project/sapphirerapids # Submit the SLURM batch job from within the folder sbatch slurm_all_benchmarks.sh This compiles and executes all examples within the SLURM job. Results are written to output files in the same directory. .. note:: The SLURM scripts should use ``SLURM_SUBMIT_DIR`` to locate source files, so they must be run from within the ``sapphirerapids/`` directory where the source files are located. Example 1: Vectorisation with AVX-512 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This example shows how ``-march=sapphirerapids`` enables AVX-512 vectorisation: .. code-block:: cpp // vectorized_compute.cpp #include #include #include // Unoptimized version (scalar) void compute_scalar(float* a, float* b, float* c, size_t n) { for (size_t i = 0; i < n; ++i) { c[i] = a[i] * b[i] + a[i]; } } // Optimized version (vectorized with AVX-512) void compute_vectorized(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, size_t n) { size_t i = 0; // Process 16 floats at a time (512-bit AVX-512) for (; i + 16 <= n; i += 16) { __m512 va = _mm512_load_ps(&a[i]); __m512 vb = _mm512_load_ps(&b[i]); __m512 vc = _mm512_fmadd_ps(va, vb, va); // FMA: a*b + a _mm512_store_ps(&c[i], vc); } // Handle remainder for (; i < n; ++i) { c[i] = a[i] * b[i] + a[i]; } } int main() { const size_t n = 100000000; float* a = (float*)_mm_malloc(n * sizeof(float), 64); float* b = (float*)_mm_malloc(n * sizeof(float), 64); float* c = (float*)_mm_malloc(n * sizeof(float), 64); // Initialize for (size_t i = 0; i < n; ++i) { a[i] = 1.0f; b[i] = 2.0f; } // Benchmark scalar auto start = std::chrono::high_resolution_clock::now(); compute_scalar(a, b, c, n); auto end = std::chrono::high_resolution_clock::now(); auto scalar_time = std::chrono::duration_cast(end - start).count(); // Benchmark vectorized start = std::chrono::high_resolution_clock::now(); compute_vectorized(a, b, c, n); end = std::chrono::high_resolution_clock::now(); auto vectorised_time = std::chrono::duration_cast(end - start).count(); std::cout << "Scalar time: " << scalar_time << " us\n"; std::cout << "Vectorized time: " << vectorised_time << " us\n"; std::cout << "Speedup: " << (double)scalar_time / vectorised_time << "x\n"; _mm_free(a); _mm_free(b); _mm_free(c); return 0; } Compile with: .. code-block:: bash clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 \ -mfma vectorized_compute.cpp -o vectorized_compute Example 2: Cache-aware data layout ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This example demonstrates the importance of data layout for cache performance: .. code-block:: cpp // cache_layout_example.cpp #include #include #include // Array of Structures (AoS) - poor cache locality struct Point { float x, y, z; int id; }; void process_aos(Point* points, size_t n) { float sum = 0.0f; for (size_t i = 0; i < n; ++i) { sum += points[i].x * points[i].y; } } // Structure of Arrays (SoA) - improved cache locality struct Points { std::vector x, y, z; std::vector id; }; void process_soa(Points& points, size_t n) { float sum = 0.0f; for (size_t i = 0; i < n; ++i) { sum += points.x[i] * points.y[i]; } } int main() { const size_t n = 10000000; // AoS version Point* aos_points = new Point[n]; for (size_t i = 0; i < n; ++i) { aos_points[i].x = 1.0f; aos_points[i].y = 2.0f; } auto start = std::chrono::high_resolution_clock::now(); process_aos(aos_points, n); auto end = std::chrono::high_resolution_clock::now(); auto aos_time = std::chrono::duration_cast(end - start).count(); // SoA version Points soa_points; soa_points.x.resize(n); soa_points.y.resize(n); for (size_t i = 0; i < n; ++i) { soa_points.x[i] = 1.0f; soa_points.y[i] = 2.0f; } start = std::chrono::high_resolution_clock::now(); process_soa(soa_points, n); end = std::chrono::high_resolution_clock::now(); auto soa_time = std::chrono::duration_cast(end - start).count(); std::cout << "AoS time: " << aos_time << " us\n"; std::cout << "SoA time: " << soa_time << " us\n"; std::cout << "Speedup: " << (double)aos_time / soa_time << "x\n"; delete[] aos_points; return 0; } Compile with: .. code-block:: bash clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 \ cache_layout_example.cpp -o cache_layout_example Example 3: Profile-guided optimisation benefit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This example demonstrates PGO workflow and benefits: .. code-block:: cpp // pgo_example.cpp #include #include #include #include // Hot path function void process_hot_path(std::vector& data) { for (size_t i = 0; i < data.size(); ++i) { if (data[i] > 1000) { // Common branch data[i] = data[i] * 2 + 1; } else { // Less common branch data[i] = data[i] / 2; } } } // Cold path function void process_cold_path(std::vector& data) { std::sort(data.begin(), data.end()); } int main(int argc, char* argv[]) { const size_t n = 10000000; std::vector data(n); // Initialize with pattern that makes hot path common for (size_t i = 0; i < n; ++i) { data[i] = (i % 10 == 0) ? 500 : 2000; // 90% go to hot path } // Simulate typical workload for (int iter = 0; iter < 100; ++iter) { process_hot_path(data); if (iter % 10 == 0) { process_cold_path(data); } } return 0; } PGO workflow: .. code-block:: bash # Step 1: Instrumentation build clang++ -fprofile-generate -O3 -march=sapphirerapids \ pgo_example.cpp -o pgo_example # Step 2: Run representative workload ./pgo_example # Step 3: Merge profile llvm-profdata merge -o pgo.profdata default.profraw # Step 4: Optimised build clang++ -fprofile-use=pgo.profdata -O3 -march=sapphirerapids \ pgo_example.cpp -o pgo_example_optimized Example 4: Intel MKL with LLVM and Intel compilers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This example demonstrates how both LLVM/21 and Intel oneAPI compilers can use Intel Math Kernel Library (MKL) for optimized linear algebra operations. | Source file: ``sapphirerapids/mkl_benchmark.cpp`` | SLURM script: ``sapphirerapids/slurm_mkl_benchmark.sh`` | Location: ``sapphirerapids/`` directory Compilation with LLVM/21: .. code-block:: bash module load mkl/2025.0 llvm/21 clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \ -I$MKLROOT/include \ -L$MKLROOT/lib/intel64 \ -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl \ mkl_benchmark.cpp -o mkl_benchmark_llvm Compilation with Intel oneAPI: .. code-block:: bash module load mkl/2025.0 compiler-intel-llvm/2025.0.4 icpx -qmkl=sequential -O3 -march=sapphirerapids -mprefer-vector-width=512 \ mkl_benchmark.cpp -o mkl_benchmark_intel Performance results: - Both compilers achieve similar MKL performance (within 0.5%) - For 2048x2048 DGEMM: LLVM/21 = 99.75 GFLOPS, Intel oneAPI = 99.88 GFLOPS - MKL library performance is independent of compiler choice - Differences come from user code optimisation, not MKL library calls Compiler differences: - Intel oneAPI provides simpler MKL linking with ``-qmkl`` flag - LLVM/21 requires manual library linking but offers more control - Both compilers can achieve optimal MKL performance - Compiler choice primarily affects user code, not pre-compiled MKL routines MKL with OpenMP threading ^^^^^^^^^^^^^^^^^^^^^^^^^ MKL can use OpenMP for internal threading, which is important for multi-threaded applications. The choice of threading library must match between your application and MKL to avoid conflicts. | Source file: ``sapphirerapids/mkl_openmp_example.cpp`` | SLURM script: ``sapphirerapids/slurm_mkl_openmp.sh`` Compilation with LLVM/21: .. code-block:: bash module load mkl/2025.0 llvm/21 clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \ -fopenmp \ -I$MKLROOT/include \ -L$MKLROOT/lib/intel64 \ -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lpthread -lm -ldl \ mkl_openmp_example.cpp -o mkl_openmp_llvm Compilation with Intel oneAPI: .. code-block:: bash module load mkl/2025.0 compiler-intel-llvm/2025.0.4 icpx -qmkl=parallel -qopenmp -O3 -march=sapphirerapids -mprefer-vector-width=512 \ mkl_openmp_example.cpp -o mkl_openmp_intel Threading library differences: - **LLVM/21**: Uses ``-lmkl_gnu_thread`` with ``-liomp5`` (Intel OpenMP runtime) for improved scaling - **Intel oneAPI**: Uses ``-qmkl=parallel`` which automatically selects ``libmkl_intel_thread`` with Intel OpenMP - Both require matching OpenMP runtime libraries to avoid conflicts Runtime configuration: .. code-block:: bash export OMP_NUM_THREADS=56 export MKL_NUM_THREADS=56 export OMP_PLACES=cores export OMP_PROC_BIND=close ./mkl_openmp_llvm 2048 56 Performance Considerations: - Set ``OMP_NUM_THREADS`` and ``MKL_NUM_THREADS`` to the same value - Use ``OMP_PLACES=cores`` and ``OMP_PROC_BIND=close`` for NUMA-aware placement - Intel OpenMP (``libiomp5``) typically provides higher scaling than GNU OpenMP for MKL MKL with MPI ^^^^^^^^^^^^ MKL can be used with MPI for distributed-memory parallel applications. Intel MPI (provided with oneAPI) works with both LLVM and Intel compilers. | Source file: ``sapphirerapids/mkl_mpi_example.cpp`` | SLURM script: ``sapphirerapids/slurm_mkl_mpi.sh`` Compilation with LLVM/21: .. code-block:: bash module load mkl/2025.0 llvm/21 mpi/2021.14 # Intel MPI uses I_MPI_CXX environment variable to select compiler export I_MPI_CXX=clang++ export CXXFLAGS="-O3 -march=sapphirerapids -DNDEBUG -std=c++17 -stdlib=libc++" mpicxx ${CXXFLAGS} \ -I$MKLROOT/include \ -L$MKLROOT/lib/intel64 \ -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lpthread -lm -ldl \ mkl_mpi_example.cpp -o mkl_mpi_llvm Compilation with Intel oneAPI: .. code-block:: bash module load mkl/2025.0 compiler-intel-llvm/2025.0.4 mpi/2021.14 # Force mpicxx to use icpx instead of default g++ # Intel MPI uses I_MPI_CXX environment variable to select compiler export I_MPI_CXX=icpx mpicxx -O3 -march=sapphirerapids -DNDEBUG -std=c++17 \ -qmkl=parallel \ mkl_mpi_example.cpp -o mkl_mpi_intel MPI process grid configuration: .. code-block:: bash # Run with 2 MPI processes, 56 threads each export MKL_NUM_THREADS=56 export OMP_NUM_THREADS=56 srun -n 2 --cpu-bind=sockets ./mkl_mpi_llvm 2048 56 MPI configuration notes: - Intel MPI (``mpicxx``) wrapper works with both compilers - For LLVM, set ``I_MPI_CXX=clang++`` to override default g++ compiler - For Intel oneAPI, set ``I_MPI_CXX=icpx`` to override default g++ compiler - By default, ``mpicxx`` uses ``g++`` unless compiler is explicitly specified - Intel MPI uses ``I_MPI_CXX`` environment variable (not ``CXX``) to select the C++ compiler for both LLVM and Intel compilers - MKL threading (``MKL_NUM_THREADS``) should match OpenMP threads per process - Use ``--cpu-bind=sockets`` in SLURM to bind processes to NUMA domains MKL BLACS for ScaLAPACK: For distributed linear algebra (ScaLAPACK), MKL provides BLACS libraries: - ``libmkl_blacs_intelmpi_lp64`` - Intel MPI - ``libmkl_blacs_openmpi_lp64`` - OpenMPI These are automatically selected when using ``-qmkl=cluster`` with Intel compilers, or manually linked with LLVM. Example 5: Intel oneDNN with LLVM and Intel compilers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Intel oneDNN (Deep Neural Network Library) is a performance library for deep learning applications, providing optimized primitives for neural network operations. It supports both LLVM and Intel compilers. | Source file: ``sapphirerapids/onednn_benchmark.cpp`` | SLURM script: ``sapphirerapids/slurm_onednn_benchmark.sh`` Compilation with LLVM/21: .. code-block:: bash module load llvm/21 dnnl/latest clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \ -I$DNNLROOT/include \ -L$DNNLROOT/lib -ldnnl -Wl,-rpath,$DNNLROOT/lib \ onednn_benchmark.cpp -o onednn_llvm Compilation with Intel oneAPI: .. code-block:: bash module load compiler-intel-llvm/2025.0.4 dnnl/latest icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 \ -I$DNNLROOT/include \ -L$DNNLROOT/lib -ldnnl \ onednn_benchmark.cpp -o onednn_intel Performance results: - LLVM/21: 4278 GFLOPS (average, 2048x2048 matrix multiplication) - Intel oneAPI: 4022 GFLOPS (average, 2048x2048 matrix multiplication) - LLVM/21 shows approximately 6.4% higher performance Compiler performance with oneDNN: - Both compilers can successfully use oneDNN library - oneDNN library itself is pre-compiled, but user code compilation affects performance - LLVM/21 shows slightly higher performance for the benchmark code - oneDNN automatically detects and uses AMX instructions on Sapphire Rapids - Both compilers link against the same oneDNN library (version 3.6.1) Runtime configuration: .. code-block:: bash # Disable verbose output (optional) export DNNL_VERBOSE=0 export ONEDNN_VERBOSE=0 # Run benchmark ./onednn_llvm 2048 ./onednn_intel 2048 Integration with ML frameworks: - **TensorFlow**: Enable with ``TF_ENABLE_ONEDNN_OPTS=1`` - **PyTorch**: Uses oneDNN automatically on Sapphire Rapids - Both frameworks benefit from oneDNN’s AMX optimisations Example 6: AMX for ML/AI workloads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This example demonstrates how to use AMX (Advanced Matrix Extensions) for machine learning and AI workloads, showing all three AMX types: AMX-TILE, AMX-INT8, and AMX-BF16. | Source file: ``sapphirerapids/amx_ml_example.cpp`` | SLURM script: ``sapphirerapids/slurm_amx_example.sh`` | Location: ``sapphirerapids/`` directory Complete AMX example with runtime detection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: cpp // amx_ml_example.cpp #include #include #include #include #include #include // Runtime AMX detection bool check_amx_support() { unsigned int eax, ebx, ecx, edx; // Check for AMX-TILE support (CPUID leaf 0x1D, subleaf 0x0) __cpuid_count(0x1D, 0x0, eax, ebx, ecx, edx); bool has_tile = (eax & (1 << 0)) != 0; bool has_int8 = (eax & (1 << 1)) != 0; bool has_bf16 = (eax & (1 << 5)) != 0; std::cout << "AMX-TILE: " << (has_tile ? "Yes" : "No") << "\n"; std::cout << "AMX-INT8: " << (has_int8 ? "Yes" : "No") << "\n"; std::cout << "AMX-BF16: " << (has_bf16 ? "Yes" : "No") << "\n"; return has_tile && has_int8 && has_bf16; } // AMX-BF16 matrix multiplication for neural network layers // C = A × B where A, B, C are bfloat16 matrices void amx_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C, int M, int N, int K) { // Configure AMX tiles uint8_t tilecfg[64] = {0}; // Tile 0: A matrix (16 rows × 32 bf16 elements = 64 bytes per row) tilecfg[0] = 16; // rows tilecfg[1] = 64; // bytes per row // Tile 1: B matrix (transposed, 16 rows × 32 bf16 elements) tilecfg[16] = 16; tilecfg[17] = 64; // Tile 2: C accumulator (16 rows × 32 bf16 elements, stores FP32) tilecfg[32] = 16; tilecfg[33] = 64; _tile_loadconfig(tilecfg); // Blocked matrix multiplication for (int i = 0; i < M; i += 16) { for (int j = 0; j < N; j += 16) { // Zero accumulator tile _tile_zero(2); // Inner product accumulation for (int k = 0; k < K; k += 32) { // Load A[i:i+16, k:k+32] into tile 0 _tile_loadd(0, &A[i * K + k], K * sizeof(__bf16)); // Load B[k:k+32, j:j+16] (transposed) into tile 1 _tile_loadd(1, &B[k * N + j], N * sizeof(__bf16)); // Compute: tile2 += tile0 × tile1 (BF16) _tile_dpbf16ps(2, 0, 1); } // Store result from tile 2 to C[i:i+16, j:j+16] _tile_stored(2, &C[i * N + j], N * sizeof(__bf16)); } } _tile_release(); } // AMX-INT8 quantized matrix multiplication for inference // C = A × B where A, B are int8, C is int32 accumulator void amx_int8_matmul(const int8_t* A, const int8_t* B, int32_t* C, int M, int N, int K) { // Configure AMX tiles for INT8 uint8_t tilecfg[64] = {0}; // Tile 0: A matrix (16 rows × 64 int8 elements = 64 bytes per row) tilecfg[0] = 16; tilecfg[1] = 64; // Tile 1: B matrix (64 rows × 16 int8 elements, transposed) tilecfg[16] = 16; tilecfg[17] = 64; // Tile 2: C accumulator (16 rows × 16 int32 elements) tilecfg[32] = 16; tilecfg[33] = 64; _tile_loadconfig(tilecfg); for (int i = 0; i < M; i += 16) { for (int j = 0; j < N; j += 16) { _tile_zero(2); // Zero accumulator for (int k = 0; k < K; k += 64) { // Load A[i:i+16, k:k+64] _tile_loadd(0, &A[i * K + k], K); // Load B[k:k+64, j:j+16] (transposed) _tile_loadd(1, &B[k * N + j], N); // Compute: tile2 += tile0 × tile1 (INT8) _tile_dpbssd(2, 0, 1); } // Store result (int32 accumulator) _tile_stored(2, &C[i * N + j], N * sizeof(int32_t)); } } _tile_release(); } // Reference implementation using AVX-512 (for comparison) void avx512_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C, int M, int N, int K) { // Simplified AVX-512 implementation for comparison // This is a basic version; full implementation would be more complex for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { float sum = 0.0f; for (int k = 0; k < K; ++k) { sum += (float)A[i * K + k] * (float)B[k * N + j]; } C[i * N + j] = (__bf16)sum; } } } int main() { // Check AMX support std::cout << "Checking AMX support...\n"; if (!check_amx_support()) { std::cerr << "AMX not supported on this system\n"; return 1; } // Matrix dimensions (typical neural network layer) const int M = 1024; // Batch size × sequence length const int N = 4096; // Output features const int K = 2048; // Input features // Allocate and initialize matrices __bf16* A_bf16 = (__bf16*)aligned_alloc(64, M * K * sizeof(__bf16)); __bf16* B_bf16 = (__bf16*)aligned_alloc(64, K * N * sizeof(__bf16)); __bf16* C_bf16 = (__bf16*)aligned_alloc(64, M * N * sizeof(__bf16)); __bf16* C_ref = (__bf16*)aligned_alloc(64, M * N * sizeof(__bf16)); // Initialize with random values for (int i = 0; i < M * K; ++i) { A_bf16[i] = (__bf16)((float)rand() / RAND_MAX); } for (int i = 0; i < K * N; ++i) { B_bf16[i] = (__bf16)((float)rand() / RAND_MAX); } // Benchmark AMX-BF16 std::cout << "\nBenchmarking AMX-BF16 matrix multiplication...\n"; std::cout << "Matrix dimensions: " << M << " × " << K << " × " << N << "\n"; const int iterations = 10; auto start = std::chrono::high_resolution_clock::now(); for (int iter = 0; iter < iterations; ++iter) { amx_bf16_matmul(A_bf16, B_bf16, C_bf16, M, N, K); } auto end = std::chrono::high_resolution_clock::now(); auto amx_time = std::chrono::duration_cast(end - start).count(); // Benchmark AVX-512 reference (for comparison) start = std::chrono::high_resolution_clock::now(); for (int iter = 0; iter < iterations; ++iter) { avx512_bf16_matmul(A_bf16, B_bf16, C_ref, M, N, K); } end = std::chrono::high_resolution_clock::now(); auto avx512_time = std::chrono::duration_cast(end - start).count(); std::cout << "AMX-BF16 time: " << amx_time / iterations << " us per iteration\n"; std::cout << "AVX-512 time: " << avx512_time / iterations << " us per iteration\n"; if (avx512_time > 0) { std::cout << "Speedup: " << (double)avx512_time / amx_time << "x\n"; } // INT8 quantized example std::cout << "\nBenchmarking AMX-INT8 quantized matrix multiplication...\n"; int8_t* A_int8 = (int8_t*)aligned_alloc(64, M * K); int8_t* B_int8 = (int8_t*)aligned_alloc(64, K * N); int32_t* C_int32 = (int32_t*)aligned_alloc(64, M * N * sizeof(int32_t)); // Initialize INT8 matrices (quantized values) for (int i = 0; i < M * K; ++i) { A_int8[i] = (int8_t)(rand() % 256 - 128); } for (int i = 0; i < K * N; ++i) { B_int8[i] = (int8_t)(rand() % 256 - 128); } start = std::chrono::high_resolution_clock::now(); for (int iter = 0; iter < iterations; ++iter) { amx_int8_matmul(A_int8, B_int8, C_int32, M, N, K); } end = std::chrono::high_resolution_clock::now(); auto int8_time = std::chrono::duration_cast(end - start).count(); std::cout << "AMX-INT8 time: " << int8_time / iterations << " us per iteration\n"; std::cout << "Throughput: " << (double)(M * N * K) / (int8_time / iterations) * 1e6 / 1e9 << " GFLOPs\n"; // Cleanup free(A_bf16); free(B_bf16); free(C_bf16); free(C_ref); free(A_int8); free(B_int8); free(C_int32); return 0; } Compile with: .. code-block:: bash clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 \ -mprefer-vector-width=512 \ amx_ml_example.cpp -o amx_ml_example Use cases for each AMX type ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. ``AMX-BF16``: - Neural network training with mixed precision - Inference with bfloat16 precision - Transformer models (attention mechanisms) - Large language model inference 2. ``AMX-INT8``: - Quantized neural network inference - Post-training quantization models - Edge AI inference - Maximum throughput inference workloads 3. ``AMX-TILE``: - Base infrastructure for both INT8 and BF16 - Provides 8KB of tile register storage - Enables efficient matrix blocking strategies Performance tips for AMX ^^^^^^^^^^^^^^^^^^^^^^^^ - **Tile configuration**: Set tile configuration once and reuse across multiple operations - **Blocking strategy**: Use 16×64 blocking for optimal tile utilisation - **Memory alignment**: Align all matrices to 64-byte boundaries - **Threading**: Use one thread per core; each thread has independent tile registers - **NUMA awareness**: For multi-socket systems, bind threads to local NUMA domain - **Mixed precision**: Use BF16 when precision allows; INT8 for maximum throughput .. _integration-with-ml-frameworks-1: Integration with ML frameworks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Many ML frameworks automatically use AMX when available: .. code-block:: bash # TensorFlow with oneDNN AMX support export TF_ENABLE_ONEDNN_OPTS=1 export ONEDNN_VERBOSE=1 # Enable verbose output to verify AMX usage # PyTorch with oneDNN export ONEDNN_VERBOSE=1 # Verify AMX is being used # Look for "amx" in framework logs Expected performance improvements: - ``AMX-BF16``: 2-4x speedup over AVX-512 for large matrix multiplications - ``AMX-INT8``: 4-8x speedup over AVX-512 for quantized inference - Suitable for: Large batch sizes, deep neural networks, transformer models Example 7: Fortran compiler performance (``flang``, ``ifx``, and ``gfortran``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fortran code can be compiled with LLVM’s ``flang``, Intel’s ``ifx``, and GCC’s ``gfortran`` compilers. This example demonstrates Sapphire Rapids-specific optimisations available in Fortran. .. important:: Compiler results are not directly comparable because: - ``flang`` **(LLVM/21)**: Only generates AVX-256 (ymm) instructions, not AVX-512 - ``ifx`` **(Intel oneAPI)**: Generates AVX-512 (zmm) instructions - ``gfortran`` **(GCC 15.1.0)**: Generates AVX-512 (zmm) instructions - These are separate benchmarks using different instruction sets Source files: - ``sapphirerapids/fortran_avx512_example.f90`` - AVX-512 vectorisation - ``sapphirerapids/fortran_openmp_example.f90`` - OpenMP parallelization - ``sapphirerapids/fortran_mkl_example.f90`` - Intel MKL integration SLURM script: ``sapphirerapids/slurm_fortran_benchmarks.sh`` Compilation with ``flang`` (LLVM/21): .. code-block:: bash module load llvm/21 flang -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \ fortran_avx512_example.f90 -o fortran_avx512_flang Compilation with ifx (Intel oneAPI): .. code-block:: bash module load compiler-intel-llvm/2025.0.4 ifx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \ fortran_avx512_example.f90 -o fortran_avx512_ifx Compilation with ``gfortran`` (GCC 15.1.0): .. code-block:: bash module load gcc/15.1.0 gfortran -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \ fortran_avx512_example.f90 -o fortran_avx512_gcc Fortran features for Sapphire Rapids: 1. AVX-512 vectorisation: - Use assumed-shape arrays (``a(:)``) for improved vectorisation hints - Compiler automatically vectorises simple loops with ``-march=sapphirerapids`` - **flang (LLVM/21)**: Only generates AVX-256 (ymm) code, ``-mprefer-vector-width=512`` flag is ignored - **ifx (Intel oneAPI)**: Generates AVX-512 (zmm) code with ``-mprefer-vector-width=512``, but may show performance overhead due to frequency scaling 2. OpenMP parallelization: - Use ``!$omp parallel do`` directives for NUMA-aware parallelization - Set ``OMP_PLACES=cores`` and ``OMP_PROC_BIND=close`` for optimal placement - Both compilers support OpenMP 4.5+ features 3. MKL integration: - Intel MKL can be called from Fortran using standard BLAS/LAPACK interfaces - Both compilers can link against MKL libraries - Use ``-qmkl=parallel`` with ifx or manual linking with ``flang`` .. important:: Performance results (separate benchmarks - not comparable): ``flang`` (LLVM/21) - AVX-256 only: - Generates AVX-256 (ymm) code only, does not generate AVX-512 (zmm) instructions - ``-mprefer-vector-width=512`` flag is ignored with warning - Single precision (C_FLOAT): 2.93x speedup (vectorised vs scalar) - These results apply to AVX-256 vectorisation only ``ifx`` (Intel oneAPI) - AVX-512: - Generates AVX-512 (zmm) code with ``-mprefer-vector-width=512`` - Single precision (C_FLOAT): 1.74x speedup (vectorised vs scalar) - AVX-512 code generation confirmed (zmm registers in assembly) - These results apply to AVX-512 vectorisation ``ifx`` (Intel oneAPI) - AVX-256 (equivalent instruction set): - Can be forced to use AVX-256 with ``-mprefer-vector-width=256`` - Single precision (C_FLOAT): 1.49x speedup (vectorised vs scalar) - AVX-256 code generation confirmed (ymm registers in assembly and binary) - This configuration uses the same instruction set as ``flang`` (AVX-256), enabling direct comparison ``gfortran`` (GCC 15.1.0) - AVX-512: - Generates AVX-512 (zmm) code with ``-march=sapphirerapids -mprefer-vector-width=512`` - Single precision (C_FLOAT): 14.08x speedup (vectorised vs scalar) - highest performance observed - AVX-512 code generation confirmed (zmm registers in assembly and binary) - AMX flags available: ``-mamx-tile``, ``-mamx-int8``, ``-mamx-bf16`` (but AMX requires explicit intrinsics) - These results apply to AVX-512 vectorisation Equivalent instruction set comparison (AVX-256): - ``flang`` **(AVX-256)**: 2.93x speedup - ``ifx`` **(AVX-256)**: 1.49x speedup - ``flang`` achieves 2.93x speedup compared to 1.49x for ifx when both use AVX-256 - Both compilers use the same instruction set (AVX-256), enabling direct comparison .. note:: Results using different instruction sets (AVX-256 vs AVX-512) should not be compared directly. For equivalent instruction set comparisons, use AVX-256 mode with ``-mprefer-vector-width=256``. - **OpenMP scaling**: Both compilers demonstrate scaling with OpenMP - **Code compatibility**: Same source code compiles with both compilers - **Compiler flags**: - ``flang``: ``-march=sapphirerapids`` (AVX-512 code generation not supported) - ``ifx``: ``-march=sapphirerapids -mprefer-vector-width=512`` (generates AVX-512) Compiler-specific features: - ``flang`` (LLVM/21): - Uses ``-fopenmp`` flag for OpenMP - Manual MKL linking required - Compatible with modern Fortran standards - ``ifx`` (Intel oneAPI): - Uses ``-qopenmp`` flag for OpenMP - Simplified MKL linking with ``-qmkl=parallel`` - Integrated with Intel tools (VTune, Advisor) - ``gfortran`` (GCC 15.1.0): - Uses ``-fopenmp`` flag for OpenMP - Manual MKL linking required - **Capability**: Generates AVX-512 code with ``-march=sapphirerapids -mprefer-vector-width=512`` - **AMX support**: AMX flags available (``-mamx-tile``, ``-mamx-int8``, ``-mamx-bf16``) but AMX requires explicit intrinsics - **Performance**: 14.08x speedup with AVX-512 (single precision C_FLOAT) - highest performance observed .. important:: Compiler performance results are not directly comparable because they use fundamentally different instruction sets (AVX-256 vs AVX-512). Each compiler’s results should be evaluated independently. Example 8: C++ compiler comparison (``clang++``, ``g++``, ``icpx``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This section compares C++ compilers (``clang++``, ``g++``, and ``icpx``) for AVX-512, AVX-256, and AMX support on Sapphire Rapids. Compilers tested: - ``clang++`` (LLVM/21): ``module load llvm/21`` - ``g++`` (GCC 15.1.0): ``module load gcc/15.1.0`` - ``icpx`` (Intel oneAPI 2025.0.4): ``module load compiler-intel-llvm/2025.0.4`` Test code: ``sapphirerapids/vectorized_compute.cpp`` AVX-512 support ^^^^^^^^^^^^^^^ All three compilers support AVX-512 code generation: Compilation flags for AVX-512: .. code-block:: bash # clang++ (LLVM/21) clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \ vectorised_compute.cpp -o vectorised_compute_clang # g++ (GCC 15.1.0) g++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \ vectorised_compute.cpp -o vectorised_compute_gcc # icpx (Intel oneAPI) icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \ vectorised_compute.cpp -o vectorised_compute_icpx AVX-512 performance results (10M elements, single precision): +-------------------------+---------+-------------------------------------------------+ | Compiler | Speedup | Remarks | +=========================+=========+=================================================+ | ``clang++`` (LLVM/21) | 1.25x | AVX-512 (zmm) confirmed in assembly | +-------------------------+---------+-------------------------------------------------+ | ``g++`` (GCC 15.1.0) | 0.96x | AVX-512 (zmm) confirmed, but slower than scalar | +-------------------------+---------+-------------------------------------------------+ | ``icpx`` (Intel oneAPI) | 1.06x | AVX-512 (zmm) confirmed | +-------------------------+---------+-------------------------------------------------+ .. note:: AVX-512 results show limited speedup due to CPU frequency scaling on Sapphire Rapids. The vectorised code uses zmm registers but may experience downclocking. AVX-256 support (equivalent instruction set) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For equivalent instruction set comparison, all compilers can be forced to use AVX-256: Compilation flags for AVX-256: .. code-block:: bash # clang++ (LLVM/21) clang++ -O3 -march=sapphirerapids -mprefer-vector-width=256 -fopenmp \ vectorized_compute.cpp -o vectorized_compute_clang_avx256 # g++ (GCC 15.1.0) g++ -O3 -march=sapphirerapids -mprefer-vector-width=256 -fopenmp \ vectorized_compute.cpp -o vectorized_compute_gcc_avx256 # icpx (Intel oneAPI) icpx -O3 -march=sapphirerapids -mprefer-vector-width=256 -qopenmp \ vectorized_compute.cpp -o vectorized_compute_icpx_avx256 AVX-256 performance results (10M elements, single precision): ======================= ======= ======================= Compiler Speedup Remarks ======================= ======= ======================= ``clang++`` (LLVM/21) 1.28x AVX-256 (ymm) confirmed ``g++`` (GCC 15.1.0) 1.02x AVX-256 (ymm) confirmed ``icpx`` (Intel oneAPI) 1.05x AVX-256 (ymm) confirmed ======================= ======= ======================= ``clang++`` achieves 1.28x speedup with AVX-256, followed by icpx (1.05x) and g++ (1.02x). All compilers use the same instruction set (AVX-256), enabling direct comparison. AMX support ^^^^^^^^^^^ All three compilers support AMX flags, but AMX requires explicit intrinsics: AMX compilation flags: .. code-block:: bash # clang++ (LLVM/21) clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \ amx_ml_example.cpp -o amx_ml_example_clang # g++ (GCC 15.1.0) g++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \ amx_ml_example.cpp -o amx_ml_example_gcc # icpx (Intel oneAPI) icpx -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -qopenmp \ amx_ml_example.cpp -o amx_ml_example_icpx AMX Support Status: +-------------------------+-----------------+-------------------------+---------------------------------------------------------+ | Compiler | AMX Flags | AMX Intrinsics | Remarks | +=========================+=================+=========================+=========================================================+ | ``clang++`` (LLVM/21) | Supported | Works | AMX intrinsics compile successfully | +-------------------------+-----------------+-------------------------+---------------------------------------------------------+ | ``g++`` (GCC 15.1.0) | Supported | Partial | ``vmovw`` (BF16) instruction not supported by assembler | +-------------------------+-----------------+-------------------------+---------------------------------------------------------+ | ``icpx`` (Intel oneAPI) | Supported | Works | AMX intrinsics compile successfully | +-------------------------+-----------------+-------------------------+---------------------------------------------------------+ - AMX flags enable AMX instruction support, but AMX is not auto-vectorised - AMX must be used via explicit intrinsics (e.g., ``_tile_loadd``, ``_tile_dpbssd``) - ``g++`` has an assembler limitation with BF16 instructions (``vmovw``) - For AMX usage, ``clang++`` and ``icpx`` are recommended AVX-512: - All compilers generate AVX-512 code (zmm registers) - Performance limited by CPU frequency scaling - ``clang++`` achieves 1.25x speedup AVX-256 (Equivalent Instruction Set): - All compilers can be forced to use AVX-256 - ``clang++`` achieves 1.28x speedup - This configuration provides equivalent instruction sets for direct comparison AMX: - All compilers support AMX flags - AMX requires explicit intrinsics (not auto-vectorised) - ``clang++`` and ``icpx`` recommended for AMX code - ``g++`` has BF16 instruction limitations - For AVX-512: ``clang++`` achieves 1.25x speedup - For AVX-256: ``clang++`` achieves 1.28x speedup - For AMX: Use ``clang++`` or ``icpx`` (``g++`` has limitations) - For equivalent instruction set comparisons: Use AVX-256 mode (``-mprefer-vector-width=256``) Example 9: OpenMP library comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This section compares OpenMP libraries from different compiler suites and their support for AVX-512 and AMX SIMD. OpenMP Libraries Tested: - ``libomp`` (LLVM/21): OpenMP 5.0, used with ``clang++ -fopenmp`` - ``libiomp5`` (Intel oneAPI): Intel OpenMP, used with ``icpx -qopenmp`` - ``libgomp`` (GCC 15.1.0): GNU OpenMP 4.5, used with ``g++ -fopenmp`` Test code: ``sapphirerapids/openmp_simd_test.cpp`` OpenMP libraries and versions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +-------------------------+-------------------+-------------------+-------------------------------------------------------------------------+ | Compiler | OpenMP Library | OpenMP Version | Library Path | +=========================+===================+===================+=========================================================================+ | ``clang++`` (LLVM/21) | ``libomp.so`` | 5.0 (202011) | ``/opt/software/llvm/21/21.1.0/lib/x86_64-unknown-linux-gnu/libomp.so`` | +-------------------------+-------------------+-------------------+-------------------------------------------------------------------------+ | ``icpx`` (Intel oneAPI) | ``libiomp5.so`` | 5.0 (202011) | ``/opt/intel/oneapi/compiler/2025.0/lib/libiomp5.so`` | +-------------------------+-------------------+-------------------+-------------------------------------------------------------------------+ | ``g++`` (GCC 15.1.0) | ``libgomp.so.1`` | 4.5 (201511) | ``/opt/software/gnu/gcc-15/gcc-15.1.0/lib64/libgomp.so.1`` | +-------------------------+-------------------+-------------------+-------------------------------------------------------------------------+ AVX-512 support in OpenMP SIMD ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All three OpenMP libraries support AVX-512 SIMD vectorisation: Compilation: .. code-block:: bash # clang++ with libomp clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \ openmp_simd_test.cpp -o openmp_simd_test_clang # icpx with libiomp5 icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \ openmp_simd_test.cpp -o openmp_simd_test_icpx # g++ with libgomp g++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \ openmp_simd_test.cpp -o openmp_simd_test_gcc AVX-512 Confirmation: - All three compilers generate AVX-512 (zmm) instructions in OpenMP SIMD loops - Assembly analysis confirms: ``vmovaps %zmm``, ``vfmadd231ps %zmm``, etc. - OpenMP SIMD pragmas successfully vectorise to AVX-512 Performance comparison ^^^^^^^^^^^^^^^^^^^^^^ Test Configuration: - Array size: 100M elements (single precision) - Iterations: 10 - Threads: 56 (one socket) Results (Parallel+SIMD, 56 threads): +------------------+---------------+---------------------------+-----------------------------------------------------------------+ | OpenMP Library | SIMD Speedup | Parallel+SIMD Speedup | Remarks | +==================+===============+===========================+=================================================================+ | libomp (LLVM/21) | 0.96x | 22.04x | Highest parallel performance (22.04x) | +------------------+---------------+---------------------------+-----------------------------------------------------------------+ | libiomp5 (Intel) | 1.12x | 6.26x | Higher SIMD performance (1.12x), lower parallel scaling (6.26x) | +------------------+---------------+---------------------------+-----------------------------------------------------------------+ | libgomp (GCC) | 1.29x | 17.34x | Highest SIMD-only performance (1.29x) | +------------------+---------------+---------------------------+-----------------------------------------------------------------+ - **libomp (LLVM/21)**: Highest parallel+SIMD performance (22.04x), strong thread scaling - **libgomp (GCC)**: Highest SIMD-only performance (1.29x), parallel scaling of 17.34x - **libiomp5 (Intel)**: Moderate performance, lower parallel scaling than libomp Thread scaling (Parallel+SIMD): ======= ============= ================ ============= Threads libomp (LLVM) libiomp5 (Intel) libgomp (GCC) ======= ============= ================ ============= 1 1.12x 1.11x 1.23x 28 9.36x 9.61x 9.33x 56 22.04x 6.26x 17.34x ======= ============= ================ ============= libomp achieves the highest scaling to 56 threads, while libiomp5 shows reduced scaling at high thread counts. AMX support in OpenMP ^^^^^^^^^^^^^^^^^^^^^ .. important:: OpenMP SIMD does not auto-vectorise to AMX instructions. AMX Usage with OpenMP: - AMX requires explicit intrinsics (``_tile_loadd``, ``_tile_dpbssd``, ``_tile_stored``, etc.) - AMX intrinsics can be used within OpenMP parallel regions - OpenMP does not generate AMX code automatically from SIMD pragmas - AMX must be manually integrated into OpenMP parallel code Example: .. code-block:: cpp #pragma omp parallel { // AMX intrinsics can be used here _tile_loadd(0, A, K); _tile_loadd(1, B, K); _tile_dpbssd(2, 0, 1); _tile_stored(2, C, N); } Compilation with AMX: .. code-block:: bash # All compilers support AMX flags clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \ amx_code.cpp -o amx_code_clang icpx -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -qopenmp \ amx_code.cpp -o amx_code_icpx g++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \ amx_code.cpp -o amx_code_gcc AVX-512 support: - All three OpenMP libraries support AVX-512 SIMD vectorisation - OpenMP SIMD pragmas generate AVX-512 (zmm) instructions - Highest performance: libomp (22.04x parallel+SIMD speedup) AMX support: - OpenMP SIMD does not auto-vectorise to AMX - AMX requires explicit intrinsics - AMX intrinsics can be used in OpenMP parallel regions - All compilers support AMX flags, but AMX must be manually integrated - For highest parallel+SIMD performance: Use libomp (LLVM/21) with ``clang++`` - For highest SIMD-only performance: Use libgomp (GCC 15.1.0) with ``g++`` - For AMX: Use explicit intrinsics within OpenMP parallel regions - For AVX-512 SIMD: All three libraries work, choose based on parallel scaling needs Example 10: C++ threads performance comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This section compares native C++ ``std::thread`` performance across different compilers for matrix multiplication. Compilers tested: - ``clang++`` (LLVM/21): ``module load llvm/21`` - ``icpx`` (Intel oneAPI 2025.0.4): ``module load compiler-intel-llvm/2025.0.4`` - ``g++`` (GCC 15.1.0): ``module load gcc/15.1.0`` Test code: ``sapphirerapids/cpp_threads_matmul.cpp`` Requirements: - C++17 standard (``-std=c++17``) - Native C++ threads (``std::thread``), not OpenMP - Maximum 56 threads - Matrix size: 2048x2048 Compilation ^^^^^^^^^^^ Compilation flags: .. code-block:: bash # clang++ (LLVM/21) clang++ -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \ cpp_threads_matmul.cpp -o cpp_threads_matmul_clang # icpx (Intel oneAPI) icpx -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \ cpp_threads_matmul.cpp -o cpp_threads_matmul_icpx # g++ (GCC 15.1.0) g++ -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \ cpp_threads_matmul.cpp -o cpp_threads_matmul_gcc .. note:: The ``-pthread`` flag is required for C++ threads support. AVX-512 vectorization ^^^^^^^^^^^^^^^^^^^^^ All three compilers generate AVX-512 code in the matrix multiplication kernel: - Assembly analysis confirms: ``vmovaps %zmm``, ``vfmadd213ps %zmm``, ``vfmadd231ps %zmm`` - The inner loop is auto-vectorised to use AVX-512 instructions - Each thread benefits from AVX-512 vectorisation Performance results ^^^^^^^^^^^^^^^^^^^ Test Configuration: - Matrix size: 2048x2048 (single precision) - Iterations: 5 - Thread counts: 1, 14, 28, 56 Performance Comparison (GFLOPS): +---------+-----------------------+-------------------------+----------------------+ | Threads | ``clang++`` (LLVM/21) | ``icpx`` (Intel oneAPI) | ``g++`` (GCC 15.1.0) | +=========+=======================+=========================+======================+ | 1 | 17.57 GFLOPS | 17.34 GFLOPS | 17.56 GFLOPS | +---------+-----------------------+-------------------------+----------------------+ | 14 | 206.99 GFLOPS | 208.49 GFLOPS | 204.04 GFLOPS | +---------+-----------------------+-------------------------+----------------------+ | 28 | 381.77 GFLOPS | 319.33 GFLOPS | 381.77 GFLOPS | +---------+-----------------------+-------------------------+----------------------+ | 56 | 423.15 GFLOPS | 287.29 GFLOPS | 401.40 GFLOPS | +---------+-----------------------+-------------------------+----------------------+ Speedup Comparison (relative to 1 thread): +---------+-----------------------+-------------------------+----------------------+ | Threads | ``clang++`` (LLVM/21) | ``icpx`` (Intel oneAPI) | ``g++`` (GCC 15.1.0) | +=========+=======================+=========================+======================+ | 1 | 1.00x | 1.00x | 1.00x | +---------+-----------------------+-------------------------+----------------------+ | 14 | 11.78x | 12.02x | 11.62x | +---------+-----------------------+-------------------------+----------------------+ | 28 | 21.73x | 18.42x | 21.74x | +---------+-----------------------+-------------------------+----------------------+ | 56 | 24.09x | 16.57x | 22.86x | +---------+-----------------------+-------------------------+----------------------+ Single-threaded performance: - All compilers show similar single-threaded performance (~17.5 GFLOPS) - Differences are within measurement variance Multi-threaded scaling: - ``clang++`` **(LLVM/21)**: Highest scaling to 56 threads (24.09x speedup, 423.15 GFLOPS) - ``g++`` **(GCC 15.1.0)**: Scaling of 22.86x speedup (401.40 GFLOPS) - ``icpx`` **(Intel oneAPI)**: Shows performance degradation at 56 threads (16.57x speedup, 287.29 GFLOPS) 1. ``clang++`` achieves 423.15 GFLOPS at 56 threads 2. ``g++`` achieves 401.40 GFLOPS at 56 threads, lower than ``clang++`` 3. ``icpx`` shows reduced performance at 56 threads, possibly due to thread contention or NUMA issues 4. All compilers demonstrate strong scaling up to 28 threads 5. AVX-512 vectorisation is used by all compilers in the inner loop Threading Model: - Uses C++17 ``std::thread`` (native C++ threads) - Each thread processes a portion of rows - No OpenMP overhead - pure C++ threading - Thread creation and synchronization handled by C++ standard library Comparison with OpenMP ^^^^^^^^^^^^^^^^^^^^^^ C++ Threads vs OpenMP (56 threads, ``clang++``): - C++ Threads: 423.15 GFLOPS (24.09x speedup) - OpenMP Parallel+SIMD: Similar performance range - C++ threads provide more control but require manual thread management - OpenMP provides easier parallelization with pragmas Performance results: - ``clang++`` **(LLVM/21)**: 423.15 GFLOPS at 56 threads (24.09x speedup) - highest performance - ``g++`` **(GCC 15.1.0)**: 401.40 GFLOPS at 56 threads (22.86x speedup) - ``icpx`` **(Intel oneAPI)**: 287.29 GFLOPS at 56 threads (16.57x speedup) - shows degradation - For highest C++ threads performance: ``clang++`` (LLVM/21) achieves 423.15 GFLOPS - For performance with simpler code:``g++`` (GCC 15.1.0) achieves 401.40 GFLOPS - For 28 threads or fewer: All compilers demonstrate acceptable performance - For 56 threads: ``clang++`` or ``g++`` recommended (``icpx`` shows degradation) AVX-512 support: - All compilers generate AVX-512 code in the matrix multiplication kernel - Each thread benefits from AVX-512 vectorisation - Performance scales well with thread count when using AVX-512 Runtime optimiser considerations -------------------------------- Intel processor optimisation features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Unlike AMD Zen2, Intel Sapphire Rapids does not have a documented embedded runtime optimiser that performs instruction-level optimisations at runtime. However, Intel processors include several hardware-level optimisation features: 1. **Out-of-Order Execution**: The processor can reorder instructions at runtime to maximise instruction-level parallelism, but this is a standard feature of modern processors, not a specialised runtime optimiser. 2. **Hardware Prefetching**: Aggressive hardware prefetchers that predict and prefetch data into cache, reducing memory latency. 3. **Branch Prediction**: Sophisticated branch prediction units that minimize branch misprediction penalties. 4. **Turbo Boost**: Dynamic frequency scaling based on workload and thermal headroom. 5. **Hyper-Threading (SMT)**: Simultaneous Multi-Threading that enables improved utilisation of execution units. Implications for optimisation strategy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The lack of a runtime optimizer means: 1. **Compile-time optimizations are critical**: Unlike Zen2, where ``-O2`` and ``-O3`` show minimal differences due to suspected runtime optimization, Sapphire Rapids benefits significantly from aggressive compile-time optimizations. ``-O3`` typically provides 5-15% improvement over ``-O2`` for compute-bound workloads. 2. **Explicit optimization hints are valuable**: Hints like ``__restrict__``, explicit vectorisation, and loop unrolling provide measurable benefits because the compiler is the primary optimization mechanism. 3. **Profile-Guided Optimization is essential**: PGO provides significant benefits (10-30%) because it guides compile-time optimizations based on actual runtime behaviour. 4. **Architecture-specific flags matter more**: Flags like ``-march=sapphirerapids`` and ``-mprefer-vector-width=512`` are critical for enabling hardware features that the compiler can utilise. 5. **Code layout optimizations**: BOLT and other post-link optimizations are valuable because they optimize code layout based on runtime profiles, compensating for the lack of runtime optimization. 6. **Vectorization is compiler-dependent**: Unlike systems with runtime optimizers that might optimize vectorisation at runtime, Sapphire Rapids relies entirely on compiler vectorisation. Explicit vectorisation hints and compiler flags are important. Intel Sapphire Rapids processors do not include an embedded runtime optimiser like AMD Zen2 may have. This means: - **Compile-time optimisations are the primary mechanism** for performance improvements - ``-O3`` **provides significant benefits** over ``-O2`` for compute-bound workloads (typically 5-15%) - **Explicit optimisation hints** (``__restrict__``, vectorisation, etc.) provide measurable benefits - **Profile-Guided Optimisation** is essential and provides 10-30% improvements - **Architecture-specific flags** are critical for enabling hardware features - **Post-link optimisations** (BOLT) are valuable for code layout optimisation The optimisation strategy for Sapphire Rapids should focus on aggressive compile-time optimisations, PGO, and architecture-specific flags rather than relying on runtime optimisation capabilities. Benchmark results summary ------------------------- Measured performance results from the optimisation examples on Intel Xeon Platinum 8480C (Sapphire Rapids): Measured speedups ~~~~~~~~~~~~~~~~~ 1. **Vectorization with AVX-512**: 1.29x speedup (LLVM/21) vs 1.01x (Intel oneAPI 2025.0.4) - LLVM shows better vectorisation optimisation - Both compilers produce correct results (checksums match) 2. **Combined optimisations** (AVX-512 + restrict + alignment): 1.27x speedup (LLVM/21) vs 1.03x (Intel oneAPI) - LLVM better at combining multiple optimisations - Intel compiler shows more conservative optimisation 3. **Restrict pointer optimisation**: 1.28x speedup (LLVM/21) vs 1.22x (Intel oneAPI) - Both compilers benefit from ``__restrict__`` hints - LLVM shows slightly better alias analysis optimisation 4. **Cache-aware data layout**: 8.05x speedup (LLVM/21) vs 8.09x (Intel oneAPI) - Both compilers demonstrate strong performance for memory layout optimisations - This is the largest speedup category (memory-bound optimisation) 5. **Memory alignment**: 7.48x speedup (LLVM/21) vs 5.20x (Intel oneAPI) - Both compilers benefit significantly from proper alignment - LLVM shows better utilisation of aligned memory access Compiler comparison: LLVM/21 vs Intel oneAPI 2025.0.4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The examples were tested with both LLVM/21 (``clang++``) and Intel oneAPI 2025.0.4 (``icpx``) compilers. Results show: LLVM/21 Advantages: - Better vectorisation optimisation (1.29x vs 1.01x for AVX-512) - Better combined optimisation performance (1.27x vs 1.03x) - Better memory alignment utilisation (7.48x vs 5.20x) - More aggressive optimisation with ``-O3`` Intel oneAPI Advantages: - Slightly better cache layout optimisation (8.09x vs 8.05x) - More conservative optimisation may be beneficial for stability - Better integration with Intel-specific tools (VTune, oneDNN) Code Compatibility: - All examples compile and run with both compilers using the same source code - No code modifications needed between compilers - Both compilers support the same optimisation flags (``-march=sapphirerapids``, ``-mprefer-vector-width=512``, etc.) - **Use LLVM/21** for maximum performance on compute-bound workloads - **Use Intel oneAPI** when integration with Intel tools (VTune, oneDNN) is required - **Test both compilers** for your specific workload to determine the appropriate choice Performance comparison table ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following table shows measured performance differences between LLVM/21 and Intel oneAPI 2025.0.4 compilers: +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | Optimisation Type | Example | LLVM/21 Speedup | Intel oneAPI Speedup | Performance Difference | Remarks | +=================================================+============================+=================+======================+========================+==================================================================================+ | AVX-512 Vectorization | ``vectorized_compute`` | 1.29x | 1.01x | +27.7% | LLVM shows significantly better vectorization | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | Combined Optimisations | ``combined_optimization`` | 1.27x | 1.03x | +23.3% | LLVM better at combining optimisations | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | Restrict Pointers | ``restrict_example`` | 1.28x | 1.22x | +4.9% | Both compilers benefit, LLVM slightly better | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | Cache Layout | ``cache_layout`` | 8.05x | 8.09x | -0.5% | Essentially equivalent performance | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | Memory Alignment | ``memory_alignment`` | 7.48x | 5.20x | +43.8% | LLVM shows higher alignment utilisation | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | MKL DGEMM (2048x2048) | ``mkl_benchmark`` | 99.75 GFLOPS | 99.88 GFLOPS | -0.1% | Essentially equivalent (MKL is pre-compiled library) | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | MKL DGEMM OpenMP (2048x2048, 56 threads) | ``mkl_openmp_example`` | ~2800 GFLOPS | ~2850 GFLOPS | -1.8% | Both achieve strong scaling with OpenMP threading | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | MKL DGEMM MPI (2 processes, 56 threads/process) | ``mkl_mpi_example`` | 3193 GFLOPS | 3192 GFLOPS | +0.03% | Essentially equivalent performance, both demonstrate scaling with MPI | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | oneDNN MatMul (2048x2048) | ``onednn_benchmark`` | 4278 GFLOPS | 4022 GFLOPS | +6.4% | LLVM shows higher performance for user code, oneDNN library is pre-compiled | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | Fortran AVX-256 (100M elements, ``flang`` only) | ``fortran_avx512_example`` | 2.97x speedup | N/A | N/A | ``flang`` AVX-256 only (single precision ``C_FLOAT``), not comparable to ``ifx`` | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ | Fortran AVX-512 (100M elements, ifx only) | ``fortran_avx512_example`` | N/A | 1.74x speedup | N/A | ``ifx`` AVX-512 (single precision ``C_FLOAT``), not comparable to ``flang`` | +-------------------------------------------------+----------------------------+-----------------+----------------------+------------------------+----------------------------------------------------------------------------------+ **Performance Difference** = ((LLVM Speedup - Intel Speedup) / Intel Speedup) × 100% .. note:: For MKL benchmarks, performance is measured in GFLOPS rather than speedup, as MKL is a pre-compiled optimized library. Both compilers achieve similar performance since MKL routines are independent of the compiler used. OpenMP and MPI variants demonstrate strong scaling characteristics with both compilers. Results shown are from local testing; for production runs, use the provided SLURM job scripts. Performance analysis by category ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compute-bound optimisations (vectorisation, combined): - LLVM/21 shows 23-28% better performance for compute-bound workloads - Intel oneAPI is more conservative, showing minimal speedup (1-3%) - Recommendation: Use LLVM/21 for compute-intensive applications Memory-bound optimisations (cache, alignment): - Both compilers show excellent cache layout optimisation (8x speedup) - LLVM/21 shows 44% better performance for memory alignment - Recommendation: Both compilers work well, but LLVM has an edge for alignment-sensitive code Pointer aliasing (restrict): - Both compilers benefit from ``__restrict__`` hints - LLVM/21 shows slightly better optimisation (4.9% difference) - Recommendation: Either compiler works well, LLVM has a small advantage Overall performance summary ~~~~~~~~~~~~~~~~~~~~~~~~~~~ +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | Category | LLVM/21 Performance | Intel oneAPI Performance | +===============================+=====================================================+===========================================================+ | Compute-bound | 23-28% higher speedup | 1-3% speedup (conservative) | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | Memory-bound | 44% higher speedup (alignment) | 0.5% higher speedup (cache layout) | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | MKL Integration | Manual linking required | Simplified ``-qmkl`` flag | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | MKL Performance (Sequential) | 99.75 GFLOPS (DGEMM 2048x2048) | 99.88 GFLOPS (DGEMM 2048x2048) | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | MKL with OpenMP | Uses ``-lmkl_gnu_thread -liomp5`` | Uses ``-qmkl=parallel`` (Intel OpenMP) | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | MKL with MPI | Manual MPI linking, set ``I_MPI_CXX=clang++`` | Automatic with ``-qmkl=parallel``, set ``I_MPI_CXX=icpx`` | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | oneDNN Performance | 4278 GFLOPS (MatMul 2048x2048) | 4022 GFLOPS (MatMul 2048x2048) | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | Fortran Compilers | ``flang`` (LLVM/21) with ``-fopenmp`` | ``ifx`` (Intel oneAPI) with ``-qopenmp`` | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | Fortran Performance (AVX-256) | 2.97x speedup (single precision C_FLOAT) | N/A (``ifx`` generates AVX-512, not AVX-256) | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | Fortran Performance (AVX-512) | N/A (``flang`` does not generate AVX-512) | 1.74x speedup (single precision C_FLOAT) | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | Note | Results not comparable - different instruction sets | Results not comparable - different instruction sets | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ | Tool Integration | Standard LLVM toolchain | Intel VTune, oneDNN integration | +-------------------------------+-----------------------------------------------------+-----------------------------------------------------------+ General performance characteristics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Optimisation level ``-O3`` provides significant benefits over ``-O2`` for compute-bound workloads (typically 5-15% improvement). 2. Data layout optimisation provides the largest performance improvements. Cache-aware data structure design shows 8x speedup in benchmarks, exceeding other optimisation techniques. 3. Profile-Guided Optimisation (PGO) provides 10-30% performance gains with proper profiling workflows. 4. Use ``-march=sapphirerapids`` to enable architecture-specific optimisations including AVX-512 and AMX. For ML/AI workloads, AMX provides 2-8x speedup over AVX-512 for matrix operations. Use ``-mamx-tile -mamx-int8 -mamx-bf16`` to enable all AMX types. 5. Sapphire Rapids supports AVX-512. Use 512-bit vectors with ``-mprefer-vector-width=512`` for compute-bound workloads. For memory-bound code, 256-bit vectors may be preferable to reduce register pressure. 6. Loop unrolling should be tuned based on instruction cache capacity. Profile to find optimal unroll factor. 7. For mixed workloads, blend PGO profiles by weighting representative workloads appropriately. 8. ``__restrict__`` benefits are significant for complex pointer patterns (1.2-1.3x speedup observed). Profile to identify where alias analysis limits optimisation. 9. Memory alignment provides significant performance improvements (5-7x speedup), enabling vectorisation and reducing cache penalties. 10. Dual-socket systems have 2 NUMA domains (one per socket). Use SLURM ``--sockets-per-node`` and ``--cpu-bind=sockets`` to bind to specific NUMA domains. 11. Thread affinity binding is workload-dependent. For single-process workloads, OS scheduling often performs well, but explicit CPU affinity binding may be valuable for multi-process applications and NUMA-aware code. 12. Optimisation priorities: data layout (8x), memory alignment (5-7x), NUMA awareness, and memory access patterns provide larger performance gains than micro-optimisations (1.2-1.3x). Further reading --------------- - `Intel Sapphire Rapids Architecture Documentation `__ - `LLVM Optimization Passes `__ - `Profile-Guided Optimisation Guide `__ - `Intel AVX-512 Programming Reference `__ - `Intel AMX Programming Reference `__