Intel Sapphire Rapids Optimisation Guide (Discoverer+ GPU partition)

Introduction

This document describes compilation and execution practices for Intel Sapphire Rapids microarchitecture systems. Sapphire Rapids processors (Xeon Scalable 4th Generation, including Xeon Platinum 8480C) have specific characteristics that affect performance.

The code examples and optimisation techniques explained in this document are applicable to systems equipped with Intel Xeon Platinum 8480C processors. The system configuration includes 2 sockets with 56 cores per socket, presenting 2 NUMA domains (one per socket), totaling 112 cores with SMT (Simultaneous Multi-Threading) providing 224 threads total.

For detailed hardware specifications of the Discoverer+ compute nodes (based on DGX H200 servers) where these Intel Xeon Platinum processors are installed, refer to the Discoverer Resource Overview.

All compilation and code execution must occur on compute nodes. The only way to access compute nodes is through SLURM batch jobs. Direct execution and compilation on login nodes is not tolerated. All examples in this document must be submitted as SLURM batch jobs using the provided SLURM scripts in the sapphirerapids/ directory located at /opt/software/sapphirerapids/. The test code is also available online at:

https://gitlab.discoverer.bg/vkolev/snippets/-/blob/main/sapphirerapids

Important

Users must ensure they have a QoS (Quality of Service) that allows intensive CPU jobs. The Discoverer+ cluster policy prioritises GPU workloads over intensive CPU workloads. Verify that your QoS configuration permits CPU-intensive jobs before submitting SLURM batch jobs.

Sapphire Rapids architecture overview

Intel Sapphire Rapids microarchitecture (codenamed “SPR”) is a 10nm Enhanced SuperFin process node design introduced in 2023. Sapphire Rapids processors implement a tile-based architecture with multiple compute tiles connected via Intel’s EMIB (Embedded Multi-die Interconnect Bridge).

The core architecture consists of Performance cores (P-cores) based on the Golden Cove microarchitecture. Each core has dedicated L1 and L2 caches. The cache hierarchy includes 32KB L1D (data cache) and 32KB L1I (instruction cache) per core, 2MB L2 cache per core, and up to 112MB L3 cache shared across the socket (depending on SKU).

Sapphire Rapids cores feature a wide instruction dispatch pipeline with dual 512-bit FMA (Fused Multiply-Add) units per core, enabling simultaneous execution of two 512-bit vector operations. The architecture supports AVX-512 instructions including AVX-512F, AVX-512BW, AVX-512CD, AVX-512DQ, AVX-512VL, AVX-512_VNNI, AVX-512_BF16, AVX-512_FP16, and AVX-512_VBMI2.

Advanced Matrix Extensions (AMX) is a key feature of Sapphire Rapids specifically designed for AI/ML workloads. AMX provides three types of acceleration:

  1. AMX-TILE: 8KB of dedicated tile registers (8 tiles × 1KB each) for efficient matrix data storage and manipulation
  2. AMX-INT8: Hardware acceleration for 8-bit integer matrix multiplication, ideal for quantized neural network inference with 4-8x speedup over AVX-512
  3. AMX-BF16: Hardware acceleration for bfloat16 matrix multiplication, ideal for mixed-precision training and inference with 2-4x speedup over AVX-512

AMX enables significant performance improvements for deep learning workloads, transformer models, and large language model inference. Each core has independent AMX tile registers, allowing efficient parallelization across cores.

The branch prediction unit uses a sophisticated multi-level predictor with improved accuracy over previous generations. Memory disambiguation capabilities allow the processor to detect and handle memory dependencies effectively, enabling out-of-order execution optimisations.

For multi-socket systems like Intel Xeon Platinum 8480C, each socket contains multiple tiles connected via EMIB. Each socket presents as a single NUMA domain, with memory controllers distributed across the socket. On systems with 2 sockets, there are 2 NUMA domains with 56 cores per domain, totaling 112 cores with SMT providing 224 threads total.

Optimisation levels: -O2 vs -O3

Unlike AMD Zen2, Intel Sapphire Rapids does not have a documented embedded runtime optimiser. This means compile-time optimisations, including those enabled by -O3, are more important for achieving optimal performance.

  • Use -O3 for compute-bound workloads: Provides aggressive optimisations including vectorisation, loop unrolling, and inlining that significantly benefit Sapphire Rapids
  • Use -O2 for memory-bound or mixed workloads: Provides balanced optimisation without excessive code bloat that can hurt instruction cache performance
  • Profile to determine optimal level: Test both -O2 and -O3 for your specific workload; -O3 typically provides 5-15% improvement for compute-bound code
  • Combine with architecture-specific flags: -O3 benefits are amplified when combined with -march=sapphirerapids and AVX-512 optimisations

The lack of a runtime optimiser means that compile-time optimisations are the primary mechanism for performance improvements. Aggressive optimisations at compile time translate directly to runtime performance.

CPU-specific compilation flags

Architecture targeting

# Use -march=sapphirerapids to enable Sapphire Rapids-specific instructions
-march=sapphirerapids

# This enables:
# - AVX-512 (512-bit vectors)
# - AVX-512_VNNI (vector neural network instructions)
# - AVX-512_BF16 (bfloat16 support)
# - AVX-512_FP16 (half-precision floating point)
# - AMX (Advanced Matrix Extensions)
# - Other Sapphire Rapids-specific instruction sets

# Alternative: Use -march=native to auto-detect all features
-march=native

Vector width optimisation

# Optimal vector width for Sapphire Rapids is 512-bit vectors (AVX-512 support)
# Sapphire Rapids has dual 512-bit FMA units per core
-mprefer-vector-width=512

# For workloads that may benefit from 256-bit vectors (less register pressure)
# Use 256-bit for memory-bound code or when register spilling occurs
-mprefer-vector-width=256

# Ensures vectorized math uses FMA instructions
-mfma

AVX-512 specific optimisations

# Enable AVX-512 fused multiply-add
-mavx512f -mavx512dq -mavx512cd -mavx512bw -mavx512vl

# Enable AVX-512 VNNI for neural network workloads
-mavx512vnni

# Enable AVX-512 BF16 for bfloat16 operations
-mavx512bf16

# Enable AVX-512 FP16 for half-precision operations
-mavx512fp16

# Note: -march=sapphirerapids automatically enables all supported AVX-512 variants

AMX (Advanced Matrix Extensions)

AMX is Intel’s dedicated hardware acceleration for matrix operations, specifically designed for AI/ML workloads. Sapphire Rapids supports three AMX types:

  1. AMX-TILE: Provides 8KB of tile registers (8 tiles × 1KB each) for matrix data storage
  2. AMX-INT8: Accelerates 8-bit integer matrix multiplication (INT8 quantization)
  3. AMX-BF16: Accelerates bfloat16 matrix multiplication (BF16 mixed precision)

Compilation flags

# Enable all AMX types for matrix multiplication workloads (AI/ML)
-mamx-tile -mamx-int8 -mamx-bf16

# Or use -march=sapphirerapids which automatically enables AMX support
-march=sapphirerapids

# Note: AMX requires runtime detection and explicit usage
# Compiler will not auto-vectorize to AMX; requires manual intrinsics

Runtime detection

AMX requires runtime detection and proper OS support (Linux kernel 5.16+):

#include <cpuid.h>
#include <immintrin.h>

bool check_amx_support() {
    unsigned int eax, ebx, ecx, edx;

    // Check for AMX-TILE support (CPUID leaf 0x1D, subleaf 0x0)
    __cpuid_count(0x1D, 0x0, eax, ebx, ecx, edx);
    if ((eax & (1 << 0)) == 0) return false;  // AMX-TILE

    // Check for AMX-INT8 support
    if ((eax & (1 << 1)) == 0) return false;  // AMX-INT8

    // Check for AMX-BF16 support
    if ((eax & (1 << 5)) == 0) return false;  // AMX-BF16

    return true;
}

AMX tile configuration

AMX uses a tile configuration that must be set before using tile operations:

#include <immintrin.h>

// Configure AMX tiles for matrix multiplication
// Tile dimensions: rows × cols (max 16 rows × 64 bytes per row)
void configure_amx_tiles() {
    // Tile 0: 16 rows × 64 bytes (for matrix A)
    // Tile 1: 16 rows × 64 bytes (for matrix B)
    // Tile 2: 16 rows × 64 bytes (for matrix C result)

    uint8_t tilecfg[64] = {0};

    // Configure tile 0: 16×64 bytes (1024 bytes total)
    tilecfg[0] = 16;  // rows
    tilecfg[1] = 64;  // bytes per row

    // Configure tile 1: 16×64 bytes
    tilecfg[16] = 16;
    tilecfg[17] = 64;

    // Configure tile 2: 16×64 bytes (accumulator)
    tilecfg[32] = 16;
    tilecfg[33] = 64;

    _tile_loadconfig(tilecfg);
}

AMX-BF16 for neural network inference

AMX-BF16 is ideal for neural network inference with bfloat16 precision:

// Example: Matrix multiplication using AMX-BF16
// C = A × B where A, B, C are bfloat16 matrices
#include <immintrin.h>
#include <stdint.h>

void amx_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C,
                     int M, int N, int K) {
    // Configure tiles
    uint8_t tilecfg[64] = {0};
    tilecfg[0] = 16; tilecfg[1] = 64;  // Tile 0: A matrix (16×32 bf16 elements)
    tilecfg[16] = 16; tilecfg[17] = 64; // Tile 1: B matrix
    tilecfg[32] = 16; tilecfg[33] = 64; // Tile 2: C accumulator
    _tile_loadconfig(tilecfg);

    // Load matrices into tiles and perform multiplication
    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            // Zero accumulator tile
            _tile_zero(2);

            for (int k = 0; k < K; k += 32) {
                // Load tile 0 with A[i:i+16, k:k+32]
                _tile_loadd(0, &A[i * K + k], K * sizeof(__bf16));

                // Load tile 1 with B[k:k+32, j:j+16] (transposed)
                _tile_loadd(1, &B[k * N + j], N * sizeof(__bf16));

                // Compute: tile2 += tile0 × tile1
                _tile_dpbf16ps(2, 0, 1);
            }

            // Store result from tile 2 to C[i:i+16, j:j+16]
            _tile_stored(2, &C[i * N + j], N * sizeof(__bf16));
        }
    }

    // Release tile configuration
    _tile_release();
}

AMX-INT8 for quantized neural networks

AMX-INT8 accelerates INT8 quantized models (common for inference):

// Example: INT8 quantized matrix multiplication
void amx_int8_matmul(const int8_t* A, const int8_t* B, int32_t* C,
                     int M, int N, int K) {
    // Configure tiles for INT8
    uint8_t tilecfg[64] = {0};
    tilecfg[0] = 16; tilecfg[1] = 64;  // Tile 0: A matrix (16×64 int8 elements)
    tilecfg[16] = 16; tilecfg[17] = 64; // Tile 1: B matrix
    tilecfg[32] = 16; tilecfg[33] = 64; // Tile 2: C accumulator (int32)
    _tile_loadconfig(tilecfg);

    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            _tile_zero(2);  // Zero accumulator

            for (int k = 0; k < K; k += 64) {
                // Load A[i:i+16, k:k+64]
                _tile_loadd(0, &A[i * K + k], K);

                // Load B[k:k+64, j:j+16] (transposed)
                _tile_loadd(1, &B[k * N + j], N);

                // Compute: tile2 += tile0 × tile1 (INT8)
                _tile_dpbssd(2, 0, 1);
            }

            // Store result (int32 accumulator)
            _tile_stored(2, &C[i * N + j], N * sizeof(int32_t));
        }
    }

    _tile_release();
}

Performance considerations

  • Tile reuse: Keep tiles loaded across multiple operations to minimize memory traffic
  • Blocking: Use appropriate block sizes (16×64 for BF16, 16×64 for INT8) to maximise tile utilisation
  • Memory alignment: Align matrices to 64-byte boundaries for optimal tile loading
  • Multi-threading: Each thread has its own tile registers; use one thread per core for AMX workloads
  • Mixed precision: Use BF16 for training/inference when precision allows; INT8 for maximum throughput in inference

Integration with ML frameworks

Many ML frameworks automatically use AMX when available:

  • TensorFlow: Enable with TF_ENABLE_ONEDNN_OPTS=1 (uses oneDNN library)
  • PyTorch: Uses oneDNN optimisations automatically on Sapphire Rapids
  • oneDNN: Intel’s deep neural network library with AMX support
# Enable oneDNN AMX optimizations
export TF_ENABLE_ONEDNN_OPTS=1
export ONEDNN_VERBOSE=1  # For debugging/verification

Loop and alignment

# Sapphire Rapids benefits from 64-byte alignment (cache line size)
-falign-loops=64
-falign-functions=64

# For AVX-512, 64-byte alignment is optimal
-falign-loops=64

LLVM-specific optimisations

# Enable loop interchange (benefits from sophisticated branch prediction)
-mllvm -enable-loopinterchange

# Tune prefetch distance (Sapphire Rapids has aggressive prefetchers)
-mllvm -prefetch-distance=256

# Enable interleaved memory access optimization
-mllvm -enable-interleaved-mem-accesses

# NUMA-aware placement (for multi-socket systems)
-mllvm -enable-npm=true

# Enable aggressive vectorization
-mllvm -force-vector-width=512
-mllvm -force-vector-interleave=2

What to avoid

  • Excessive loop unrolling: Can cause instruction cache misses; profile to find optimal unroll factor
  • Over-aggressive -ffast-math: Test carefully; precision requirements must be considered
  • Generic -march flags: Target sapphirerapids specifically for architecture-specific optimisations
  • Mixing AVX-512 and AVX2: Use consistent vector width throughout the application

Profile-guided optimisation (PGO)

PGO for Sapphire Rapids provides 10-30% performance improvements. It works with LTO and BOLT.

PGO benefits for Sapphire Rapids

Given Sapphire Rapids’s sophisticated branch predictor and wide execution units, PGO provides significant benefits because:

  • Optimizes for actual branch patterns
  • Improved code layout reduces instruction cache misses
  • Hot/cold splitting keeps working set in L2/L3
  • Enables improved vectorisation decisions based on runtime data

PGO workflow

# Step 1: Instrumentation build
clang++ -fprofile-generate -march=sapphirerapids -O3 -flto=thin \
        -mprefer-vector-width=512 \
        source.cpp -o program

# Step 2: Run representative workloads
# In SLURM batch job:
./program < typical_input_1
./program < typical_input_2
./program < typical_input_3

# Step 3: Merge profiles (if multiple runs)
llvm-profdata merge -o final.profdata default.profraw

# Step 4: Optimised build with profile
clang++ -fprofile-use=final.profdata -march=sapphirerapids -O3 \
        -flto=thin -mprefer-vector-width=512 \
        source.cpp -o program_optimized

Blended profiles for mixed workloads

For diverse customer workloads, create weighted blended profiles:

# Collect profiles from multiple workloads
llvm-profdata merge -o workload_A.profdata default.profraw_A
llvm-profdata merge -o workload_B.profdata default.profraw_B
llvm-profdata merge -o workload_C.profdata default.profraw_C

# Merge with weights based on importance/frequency
llvm-profdata merge \
    -weighted-input=3,workload_A.profdata \
    -weighted-input=2,workload_B.profdata \
    -weighted-input=1,workload_C.profdata \
    -o final_blended.profdata

Memory optimisations

Cache-aware compilation

# Improved cache utilisation through section elimination
-fdata-sections -ffunction-sections

# Linker garbage collection (use with above flags)
-Wl,--gc-sections

Structure and data layout

  • Pack hot data structures to fit within 32KB L1 cache
  • Consider __restrict__ for pointer aliasing hints (Sapphire Rapids has strong memory disambiguation, but explicit hints can still help the compiler)
  • Align data structures to cache line boundaries (64 bytes)
  • Use structure-of-arrays (SoA) layout for vectorised code when beneficial

Huge pages

# Enable transparent huge pages in madvise mode
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# In code, use madvise for large allocations
madvise(large_buffer, size, MADV_HUGEPAGE);

Memory allocators

Consider replacing default allocator with:

  • jemalloc: Improved performance for concurrent workloads
  • tcmalloc: Suitable performance characteristics
  • mimalloc: Low overhead, suitable for mixed workloads

Mixed workload strategy

For systems serving diverse customer workloads, use a balanced approach:

Conservative optimisation flags

  • -O3 for compute-bound code: Provides significant benefits on Sapphire Rapids
  • -O2 for memory-bound code: Avoids code bloat that can hurt cache performance
  • -flto=thin: Faster, more predictable performance across workload variations
  • PGO with blended profiles: Weighted combination of representative workloads

Function multi-versioning

For hot paths, use target clones:

__attribute__((target_clones("default","avx2","avx512f")))
void process_data(/* params */) {
    // Hot path that benefits from different optimizations
    // Runtime dispatcher selects the best version
}

Split optimisation by code characteristics

# Hot paths (identified via profiling)
set_source_files_properties(hot_path.cpp PROPERTIES
    COMPILE_FLAGS "-O3 -march=sapphirerapids -fprofile-use=hot.profdata -mprefer-vector-width=512")

# Cold paths
set_source_files_properties(general_code.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=x86-64-v4")

# Core libraries
set_source_files_properties(core_lib.cpp PROPERTIES
    COMPILE_FLAGS "-O3 -march=sapphirerapids -flto=full -mprefer-vector-width=512")

# Customer-facing code
set_source_files_properties(api_code.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=x86-64-v4 -flto=thin")

Practical build configuration

Complete example: core library build (in SLURM batch job)

# In SLURM batch job:
clang++ -O3 \
        -march=sapphirerapids \
        -mprefer-vector-width=512 \
        -mfma \
        -falign-loops=64 \
        -falign-functions=64 \
        -fdata-sections \
        -ffunction-sections \
        -fno-semantic-interposition \
        -fno-plt \
        -flto=thin \
        -fprofile-use=blended.profdata \
        -mllvm -enable-loopinterchange \
        -mllvm -prefetch-distance=256 \
        source.cpp -o program \
        -fuse-ld=lld \
        -Wl,--gc-sections \
        -Wl,--icf=safe

CMake configuration

set(CMAKE_C_COMPILER clang)
set(CMAKE_CXX_COMPILER clang++)

# C++ standard (C++17, C++20, or C++23)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Base flags
set(CMAKE_C_FLAGS_RELEASE "-O3 -march=sapphirerapids -mprefer-vector-width=512")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")

# LTO
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -flto=thin")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -flto=thin")

# PGO (if profile available)
if(EXISTS "${CMAKE_SOURCE_DIR}/final.profdata")
    set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata")
    set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata")
endif()

# Linker
set(CMAKE_EXE_LINKER_FLAGS_RELEASE "-fuse-ld=lld -Wl,--gc-sections -Wl,--icf=safe")
set(CMAKE_SHARED_LINKER_FLAGS_RELEASE "${CMAKE_EXE_LINKER_FLAGS_RELEASE}")

Runtime considerations

CPU frequency scaling

Sapphire Rapids systems typically use intel_pstate driver for CPU frequency scaling:

# Check current CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set to performance mode (if root or via SLURM)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Or use cpupower (if available)
cpupower frequency-set -g performance

NUMA awareness

Sapphire Rapids systems with multiple sockets have multiple NUMA domains:

  1. Identify NUMA topology: Use numactl --hardware to see NUMA node layout
  2. Bind memory allocation: Use numactl --membind=N to allocate memory from specific NUMA node
  3. Bind CPU affinity: Use numactl --cpunodebind=N to bind to specific NUMA node
  4. Monitor NUMA statistics: Use numastat and perf stat -e numa-misses
  5. Dual-socket systems: 2 NUMA domains, one per socket. Use SLURM --sockets-per-node and --cpu-bind=sockets to bind to specific NUMA domains

Example: NUMA-optimized execution

Direct execution (using numactl):

# Check NUMA topology
numactl --hardware

# Single NUMA domain binding (for single-process)
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

# Multiple NUMA domains: one process per domain
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./process1 &
srun --cpu-bind=sockets:1-1 numactl --membind=1 --cpunodebind=1 ./process2 &

# Monitor NUMA performance
# In SLURM batch job:
perf stat -e numa-misses,numa-migrations ./program
numastat  # Show NUMA allocation statistics

SLURM execution:

# Single NUMA domain (for single-process)
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=56
#SBATCH --cpus-per-task=112
srun --cpu-bind=sockets:0-0 ./program

# Multiple NUMA domains (one task per domain)
#SBATCH --ntasks=2
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=56
srun --cpu-bind=sockets ./program

# Explicit NUMA domain binding with numactl
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

# Check SLURM CPU binding
srun --cpu-bind=sockets:0-0 numactl --hardware

Environment variables

Make optimisation thresholds runtime-configurable:

# Example: Tunable buffer sizes
export BUFFER_SIZE=1048576
export PARALLELISM_THRESHOLD=1000

Monitoring and feedback

  • Instrumented production builds: Use lightweight sampling (-fprofile-sample-use) to collect real customer profiles
  • Performance telemetry: Track which code paths are actually hot in production
  • A/B testing: Deploy different optimisation configurations to subsets of traffic

SLURM configuration

Systems with Intel Sapphire Rapids processors typically use SLURM for job scheduling with specific NUMA domain configuration.

Node configuration

From system configuration:

  • 2 NUMA domains: One per socket
  • 56 cores per NUMA domain: Each socket contains 56 cores
  • 2 threads per core: SMT (Simultaneous Multi-Threading) enabled
  • Total capacity: 2 × 56 × 2 = 224 threads per node

Topology breakdown:

  • 2 NUMA domains: One per socket
  • 56 cores per NUMA domain: Each domain contains 56 cores
  • 2 threads per core: SMT (Simultaneous Multi-Threading) enabled
  • Total capacity: 2 × 56 × 2 = 224 threads per node
  • Memory per NUMA domain: Varies by system configuration

SLURM directives

All jobs must use appropriate SLURM directives:

#SBATCH --partition=commong      # Default partition (not "cn")
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --sockets-per-node=1    # For single NUMA domain
#SBATCH --cores-per-socket=56   # All cores in one socket
#SBATCH --cpus-per-task=112     # All threads in one socket (with SMT)

Note

The default partition for SLURM is commong, not cn.

QoS Requirements for Intensive CPU Jobs:

Users must ensure they have a QoS (Quality of Service) that allows intensive CPU jobs. The Discoverer+ cluster policy prioritises GPU workloads over intensive CPU workloads. Verify that your QoS configuration permits CPU-intensive jobs before submitting SLURM batch jobs for Sapphire Rapids optimisation benchmarks.

OpenMP configuration

For OpenMP workloads:

#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=56
#SBATCH --cpus-per-task=112

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
export OMP_PROC_BIND=close

srun --cpu-bind=sockets:0-0 ./openmp_program

SLURM configuration recommendations

  1. Single-process applications: Use --sockets-per-node=1 to bind to one NUMA domain
  2. Multi-process applications: Use --ntasks=N with --sockets-per-node=N (one task per NUMA domain)
  3. Memory allocation: Request memory proportional to NUMA domains used
  4. CPU binding: Always use --cpu-bind=sockets to ensure proper NUMA binding
  5. Monitor binding: Check with srun --cpu-bind=sockets numactl --hardware
  6. Thread placement: For OpenMP, use OMP_PLACES=cores and OMP_PROC_BIND=close

Example code demonstrating optimisation benefits

The following examples demonstrate how different optimisations benefit Sapphire Rapids performance. The example source code and SLURM scripts are located in the sapphirerapids/ directory at /opt/software/sapphirerapids/.

Important

The full path to the sapphirerapids/ folder is /opt/software/sapphirerapids/. You can copy this folder to your project directory or work directly from the system location.

The test code is also available online at: https://gitlab.discoverer.bg/vkolev/snippets/-/blob/main/sapphirerapids

To reproduce benchmark results, you can either work from the system location or copy the folder to your project directory:

# Option 1: Work from the system location
cd /opt/software/sapphirerapids

# Option 2: Copy to your project directory
mkdir -p /path/to/your/project
cp -r /opt/software/sapphirerapids /path/to/your/project/
cd /path/to/your/project/sapphirerapids

# Submit the SLURM batch job from within the folder
sbatch slurm_all_benchmarks.sh

This compiles and executes all examples within the SLURM job. Results are written to output files in the same directory.

Note

The SLURM scripts should use SLURM_SUBMIT_DIR to locate source files, so they must be run from within the sapphirerapids/ directory where the source files are located.

Example 1: Vectorisation with AVX-512

This example shows how -march=sapphirerapids enables AVX-512 vectorisation:

// vectorized_compute.cpp
#include <immintrin.h>
#include <chrono>
#include <iostream>

// Unoptimized version (scalar)
void compute_scalar(float* a, float* b, float* c, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        c[i] = a[i] * b[i] + a[i];
    }
}

// Optimized version (vectorized with AVX-512)
void compute_vectorized(float* __restrict__ a, float* __restrict__ b,
                        float* __restrict__ c, size_t n) {
    size_t i = 0;
    // Process 16 floats at a time (512-bit AVX-512)
    for (; i + 16 <= n; i += 16) {
        __m512 va = _mm512_load_ps(&a[i]);
        __m512 vb = _mm512_load_ps(&b[i]);
        __m512 vc = _mm512_fmadd_ps(va, vb, va); // FMA: a*b + a
        _mm512_store_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; ++i) {
        c[i] = a[i] * b[i] + a[i];
    }
}

int main() {
    const size_t n = 100000000;
    float* a = (float*)_mm_malloc(n * sizeof(float), 64);
    float* b = (float*)_mm_malloc(n * sizeof(float), 64);
    float* c = (float*)_mm_malloc(n * sizeof(float), 64);

    // Initialize
    for (size_t i = 0; i < n; ++i) {
        a[i] = 1.0f;
        b[i] = 2.0f;
    }

    // Benchmark scalar
    auto start = std::chrono::high_resolution_clock::now();
    compute_scalar(a, b, c, n);
    auto end = std::chrono::high_resolution_clock::now();
    auto scalar_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    // Benchmark vectorized
    start = std::chrono::high_resolution_clock::now();
    compute_vectorized(a, b, c, n);
    end = std::chrono::high_resolution_clock::now();
    auto vectorised_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "Scalar time: " << scalar_time << " us\n";
    std::cout << "Vectorized time: " << vectorised_time << " us\n";
    std::cout << "Speedup: " << (double)scalar_time / vectorised_time << "x\n";

    _mm_free(a);
    _mm_free(b);
    _mm_free(c);
    return 0;
}

Compile with:

clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 \
        -mfma vectorized_compute.cpp -o vectorized_compute

Example 2: Cache-aware data layout

This example demonstrates the importance of data layout for cache performance:

// cache_layout_example.cpp
#include <chrono>
#include <iostream>
#include <vector>

// Array of Structures (AoS) - poor cache locality
struct Point {
    float x, y, z;
    int id;
};

void process_aos(Point* points, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += points[i].x * points[i].y;
    }
}

// Structure of Arrays (SoA) - improved cache locality
struct Points {
    std::vector<float> x, y, z;
    std::vector<int> id;
};

void process_soa(Points& points, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += points.x[i] * points.y[i];
    }
}

int main() {
    const size_t n = 10000000;

    // AoS version
    Point* aos_points = new Point[n];
    for (size_t i = 0; i < n; ++i) {
        aos_points[i].x = 1.0f;
        aos_points[i].y = 2.0f;
    }

    auto start = std::chrono::high_resolution_clock::now();
    process_aos(aos_points, n);
    auto end = std::chrono::high_resolution_clock::now();
    auto aos_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    // SoA version
    Points soa_points;
    soa_points.x.resize(n);
    soa_points.y.resize(n);
    for (size_t i = 0; i < n; ++i) {
        soa_points.x[i] = 1.0f;
        soa_points.y[i] = 2.0f;
    }

    start = std::chrono::high_resolution_clock::now();
    process_soa(soa_points, n);
    end = std::chrono::high_resolution_clock::now();
    auto soa_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "AoS time: " << aos_time << " us\n";
    std::cout << "SoA time: " << soa_time << " us\n";
    std::cout << "Speedup: " << (double)aos_time / soa_time << "x\n";

    delete[] aos_points;
    return 0;
}

Compile with:

clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 \
        cache_layout_example.cpp -o cache_layout_example

Example 3: Profile-guided optimisation benefit

This example demonstrates PGO workflow and benefits:

// pgo_example.cpp
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>

// Hot path function
void process_hot_path(std::vector<int>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        if (data[i] > 1000) {  // Common branch
            data[i] = data[i] * 2 + 1;
        } else {  // Less common branch
            data[i] = data[i] / 2;
        }
    }
}

// Cold path function
void process_cold_path(std::vector<int>& data) {
    std::sort(data.begin(), data.end());
}

int main(int argc, char* argv[]) {
    const size_t n = 10000000;
    std::vector<int> data(n);

    // Initialize with pattern that makes hot path common
    for (size_t i = 0; i < n; ++i) {
        data[i] = (i % 10 == 0) ? 500 : 2000;  // 90% go to hot path
    }

    // Simulate typical workload
    for (int iter = 0; iter < 100; ++iter) {
        process_hot_path(data);
        if (iter % 10 == 0) {
            process_cold_path(data);
        }
    }

    return 0;
}

PGO workflow:

# Step 1: Instrumentation build
clang++ -fprofile-generate -O3 -march=sapphirerapids \
        pgo_example.cpp -o pgo_example

# Step 2: Run representative workload
./pgo_example

# Step 3: Merge profile
llvm-profdata merge -o pgo.profdata default.profraw

# Step 4: Optimised build
clang++ -fprofile-use=pgo.profdata -O3 -march=sapphirerapids \
        pgo_example.cpp -o pgo_example_optimized

Example 4: Intel MKL with LLVM and Intel compilers

This example demonstrates how both LLVM/21 and Intel oneAPI compilers can use Intel Math Kernel Library (MKL) for optimized linear algebra operations.

Source file: sapphirerapids/mkl_benchmark.cpp
SLURM script: sapphirerapids/slurm_mkl_benchmark.sh
Location: sapphirerapids/ directory

Compilation with LLVM/21:

module load mkl/2025.0 llvm/21
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \
        -I$MKLROOT/include \
        -L$MKLROOT/lib/intel64 \
        -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl \
        mkl_benchmark.cpp -o mkl_benchmark_llvm

Compilation with Intel oneAPI:

module load mkl/2025.0 compiler-intel-llvm/2025.0.4
icpx -qmkl=sequential -O3 -march=sapphirerapids -mprefer-vector-width=512 \
     mkl_benchmark.cpp -o mkl_benchmark_intel

Performance results:

  • Both compilers achieve similar MKL performance (within 0.5%)
  • For 2048x2048 DGEMM: LLVM/21 = 99.75 GFLOPS, Intel oneAPI = 99.88 GFLOPS
  • MKL library performance is independent of compiler choice
  • Differences come from user code optimisation, not MKL library calls

Compiler differences:

  • Intel oneAPI provides simpler MKL linking with -qmkl flag
  • LLVM/21 requires manual library linking but offers more control
  • Both compilers can achieve optimal MKL performance
  • Compiler choice primarily affects user code, not pre-compiled MKL routines

MKL with OpenMP threading

MKL can use OpenMP for internal threading, which is important for multi-threaded applications. The choice of threading library must match between your application and MKL to avoid conflicts.

Source file: sapphirerapids/mkl_openmp_example.cpp
SLURM script: sapphirerapids/slurm_mkl_openmp.sh

Compilation with LLVM/21:

module load mkl/2025.0 llvm/21
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \
        -fopenmp \
        -I$MKLROOT/include \
        -L$MKLROOT/lib/intel64 \
        -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lpthread -lm -ldl \
        mkl_openmp_example.cpp -o mkl_openmp_llvm

Compilation with Intel oneAPI:

module load mkl/2025.0 compiler-intel-llvm/2025.0.4
icpx -qmkl=parallel -qopenmp -O3 -march=sapphirerapids -mprefer-vector-width=512 \
     mkl_openmp_example.cpp -o mkl_openmp_intel

Threading library differences:

  • LLVM/21: Uses -lmkl_gnu_thread with -liomp5 (Intel OpenMP runtime) for improved scaling
  • Intel oneAPI: Uses -qmkl=parallel which automatically selects libmkl_intel_thread with Intel OpenMP
  • Both require matching OpenMP runtime libraries to avoid conflicts

Runtime configuration:

export OMP_NUM_THREADS=56
export MKL_NUM_THREADS=56
export OMP_PLACES=cores
export OMP_PROC_BIND=close
./mkl_openmp_llvm 2048 56

Performance Considerations:

  • Set OMP_NUM_THREADS and MKL_NUM_THREADS to the same value
  • Use OMP_PLACES=cores and OMP_PROC_BIND=close for NUMA-aware placement
  • Intel OpenMP (libiomp5) typically provides higher scaling than GNU OpenMP for MKL

MKL with MPI

MKL can be used with MPI for distributed-memory parallel applications. Intel MPI (provided with oneAPI) works with both LLVM and Intel compilers.

Source file: sapphirerapids/mkl_mpi_example.cpp
SLURM script: sapphirerapids/slurm_mkl_mpi.sh

Compilation with LLVM/21:

module load mkl/2025.0 llvm/21 mpi/2021.14
# Intel MPI uses I_MPI_CXX environment variable to select compiler
export I_MPI_CXX=clang++
export CXXFLAGS="-O3 -march=sapphirerapids -DNDEBUG -std=c++17 -stdlib=libc++"
mpicxx ${CXXFLAGS} \
       -I$MKLROOT/include \
       -L$MKLROOT/lib/intel64 \
       -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lpthread -lm -ldl \
       mkl_mpi_example.cpp -o mkl_mpi_llvm

Compilation with Intel oneAPI:

module load mkl/2025.0 compiler-intel-llvm/2025.0.4 mpi/2021.14
# Force mpicxx to use icpx instead of default g++
# Intel MPI uses I_MPI_CXX environment variable to select compiler
export I_MPI_CXX=icpx
mpicxx -O3 -march=sapphirerapids -DNDEBUG -std=c++17 \
       -qmkl=parallel \
       mkl_mpi_example.cpp -o mkl_mpi_intel

MPI process grid configuration:

# Run with 2 MPI processes, 56 threads each
export MKL_NUM_THREADS=56
export OMP_NUM_THREADS=56
srun -n 2 --cpu-bind=sockets ./mkl_mpi_llvm 2048 56

MPI configuration notes:

  • Intel MPI (mpicxx) wrapper works with both compilers
  • For LLVM, set I_MPI_CXX=clang++ to override default g++ compiler
  • For Intel oneAPI, set I_MPI_CXX=icpx to override default g++ compiler
  • By default, mpicxx uses g++ unless compiler is explicitly specified
  • Intel MPI uses I_MPI_CXX environment variable (not CXX) to select the C++ compiler for both LLVM and Intel compilers
  • MKL threading (MKL_NUM_THREADS) should match OpenMP threads per process
  • Use --cpu-bind=sockets in SLURM to bind processes to NUMA domains

MKL BLACS for ScaLAPACK:

For distributed linear algebra (ScaLAPACK), MKL provides BLACS libraries:

  • libmkl_blacs_intelmpi_lp64 - Intel MPI
  • libmkl_blacs_openmpi_lp64 - OpenMPI

These are automatically selected when using -qmkl=cluster with Intel compilers, or manually linked with LLVM.

Example 5: Intel oneDNN with LLVM and Intel compilers

Intel oneDNN (Deep Neural Network Library) is a performance library for deep learning applications, providing optimized primitives for neural network operations. It supports both LLVM and Intel compilers.

Source file: sapphirerapids/onednn_benchmark.cpp
SLURM script: sapphirerapids/slurm_onednn_benchmark.sh

Compilation with LLVM/21:

module load llvm/21 dnnl/latest
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \
        -I$DNNLROOT/include \
        -L$DNNLROOT/lib -ldnnl -Wl,-rpath,$DNNLROOT/lib \
        onednn_benchmark.cpp -o onednn_llvm

Compilation with Intel oneAPI:

module load compiler-intel-llvm/2025.0.4 dnnl/latest
icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 \
     -I$DNNLROOT/include \
     -L$DNNLROOT/lib -ldnnl \
     onednn_benchmark.cpp -o onednn_intel

Performance results:

  • LLVM/21: 4278 GFLOPS (average, 2048x2048 matrix multiplication)
  • Intel oneAPI: 4022 GFLOPS (average, 2048x2048 matrix multiplication)
  • LLVM/21 shows approximately 6.4% higher performance

Compiler performance with oneDNN:

  • Both compilers can successfully use oneDNN library
  • oneDNN library itself is pre-compiled, but user code compilation affects performance
  • LLVM/21 shows slightly higher performance for the benchmark code
  • oneDNN automatically detects and uses AMX instructions on Sapphire Rapids
  • Both compilers link against the same oneDNN library (version 3.6.1)

Runtime configuration:

# Disable verbose output (optional)
export DNNL_VERBOSE=0
export ONEDNN_VERBOSE=0

# Run benchmark
./onednn_llvm 2048
./onednn_intel 2048

Integration with ML frameworks:

  • TensorFlow: Enable with TF_ENABLE_ONEDNN_OPTS=1
  • PyTorch: Uses oneDNN automatically on Sapphire Rapids
  • Both frameworks benefit from oneDNN’s AMX optimisations

Example 6: AMX for ML/AI workloads

This example demonstrates how to use AMX (Advanced Matrix Extensions) for machine learning and AI workloads, showing all three AMX types: AMX-TILE, AMX-INT8, and AMX-BF16.

Source file: sapphirerapids/amx_ml_example.cpp
SLURM script: sapphirerapids/slurm_amx_example.sh
Location: sapphirerapids/ directory

Complete AMX example with runtime detection

// amx_ml_example.cpp
#include <immintrin.h>
#include <stdint.h>
#include <iostream>
#include <chrono>
#include <cstring>
#include <cstdlib>

// Runtime AMX detection
bool check_amx_support() {
    unsigned int eax, ebx, ecx, edx;

    // Check for AMX-TILE support (CPUID leaf 0x1D, subleaf 0x0)
    __cpuid_count(0x1D, 0x0, eax, ebx, ecx, edx);
    bool has_tile = (eax & (1 << 0)) != 0;
    bool has_int8 = (eax & (1 << 1)) != 0;
    bool has_bf16 = (eax & (1 << 5)) != 0;

    std::cout << "AMX-TILE: " << (has_tile ? "Yes" : "No") << "\n";
    std::cout << "AMX-INT8: " << (has_int8 ? "Yes" : "No") << "\n";
    std::cout << "AMX-BF16: " << (has_bf16 ? "Yes" : "No") << "\n";

    return has_tile && has_int8 && has_bf16;
}

// AMX-BF16 matrix multiplication for neural network layers
// C = A × B where A, B, C are bfloat16 matrices
void amx_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C,
                     int M, int N, int K) {
    // Configure AMX tiles
    uint8_t tilecfg[64] = {0};

    // Tile 0: A matrix (16 rows × 32 bf16 elements = 64 bytes per row)
    tilecfg[0] = 16;   // rows
    tilecfg[1] = 64;   // bytes per row

    // Tile 1: B matrix (transposed, 16 rows × 32 bf16 elements)
    tilecfg[16] = 16;
    tilecfg[17] = 64;

    // Tile 2: C accumulator (16 rows × 32 bf16 elements, stores FP32)
    tilecfg[32] = 16;
    tilecfg[33] = 64;

    _tile_loadconfig(tilecfg);

    // Blocked matrix multiplication
    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            // Zero accumulator tile
            _tile_zero(2);

            // Inner product accumulation
            for (int k = 0; k < K; k += 32) {
                // Load A[i:i+16, k:k+32] into tile 0
                _tile_loadd(0, &A[i * K + k], K * sizeof(__bf16));

                // Load B[k:k+32, j:j+16] (transposed) into tile 1
                _tile_loadd(1, &B[k * N + j], N * sizeof(__bf16));

                // Compute: tile2 += tile0 × tile1 (BF16)
                _tile_dpbf16ps(2, 0, 1);
            }

            // Store result from tile 2 to C[i:i+16, j:j+16]
            _tile_stored(2, &C[i * N + j], N * sizeof(__bf16));
        }
    }

    _tile_release();
}

// AMX-INT8 quantized matrix multiplication for inference
// C = A × B where A, B are int8, C is int32 accumulator
void amx_int8_matmul(const int8_t* A, const int8_t* B, int32_t* C,
                     int M, int N, int K) {
    // Configure AMX tiles for INT8
    uint8_t tilecfg[64] = {0};

    // Tile 0: A matrix (16 rows × 64 int8 elements = 64 bytes per row)
    tilecfg[0] = 16;
    tilecfg[1] = 64;

    // Tile 1: B matrix (64 rows × 16 int8 elements, transposed)
    tilecfg[16] = 16;
    tilecfg[17] = 64;

    // Tile 2: C accumulator (16 rows × 16 int32 elements)
    tilecfg[32] = 16;
    tilecfg[33] = 64;

    _tile_loadconfig(tilecfg);

    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            _tile_zero(2);  // Zero accumulator

            for (int k = 0; k < K; k += 64) {
                // Load A[i:i+16, k:k+64]
                _tile_loadd(0, &A[i * K + k], K);

                // Load B[k:k+64, j:j+16] (transposed)
                _tile_loadd(1, &B[k * N + j], N);

                // Compute: tile2 += tile0 × tile1 (INT8)
                _tile_dpbssd(2, 0, 1);
            }

            // Store result (int32 accumulator)
            _tile_stored(2, &C[i * N + j], N * sizeof(int32_t));
        }
    }

    _tile_release();
}

// Reference implementation using AVX-512 (for comparison)
void avx512_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C,
                        int M, int N, int K) {
    // Simplified AVX-512 implementation for comparison
    // This is a basic version; full implementation would be more complex
    for (int i = 0; i < M; ++i) {
        for (int j = 0; j < N; ++j) {
            float sum = 0.0f;
            for (int k = 0; k < K; ++k) {
                sum += (float)A[i * K + k] * (float)B[k * N + j];
            }
            C[i * N + j] = (__bf16)sum;
        }
    }
}

int main() {
    // Check AMX support
    std::cout << "Checking AMX support...\n";
    if (!check_amx_support()) {
        std::cerr << "AMX not supported on this system\n";
        return 1;
    }

    // Matrix dimensions (typical neural network layer)
    const int M = 1024;  // Batch size × sequence length
    const int N = 4096;  // Output features
    const int K = 2048;  // Input features

    // Allocate and initialize matrices
    __bf16* A_bf16 = (__bf16*)aligned_alloc(64, M * K * sizeof(__bf16));
    __bf16* B_bf16 = (__bf16*)aligned_alloc(64, K * N * sizeof(__bf16));
    __bf16* C_bf16 = (__bf16*)aligned_alloc(64, M * N * sizeof(__bf16));
    __bf16* C_ref = (__bf16*)aligned_alloc(64, M * N * sizeof(__bf16));

    // Initialize with random values
    for (int i = 0; i < M * K; ++i) {
        A_bf16[i] = (__bf16)((float)rand() / RAND_MAX);
    }
    for (int i = 0; i < K * N; ++i) {
        B_bf16[i] = (__bf16)((float)rand() / RAND_MAX);
    }

    // Benchmark AMX-BF16
    std::cout << "\nBenchmarking AMX-BF16 matrix multiplication...\n";
    std::cout << "Matrix dimensions: " << M << " × " << K << " × " << N << "\n";

    const int iterations = 10;
    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        amx_bf16_matmul(A_bf16, B_bf16, C_bf16, M, N, K);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto amx_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    // Benchmark AVX-512 reference (for comparison)
    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        avx512_bf16_matmul(A_bf16, B_bf16, C_ref, M, N, K);
    }
    end = std::chrono::high_resolution_clock::now();
    auto avx512_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "AMX-BF16 time: " << amx_time / iterations << " us per iteration\n";
    std::cout << "AVX-512 time: " << avx512_time / iterations << " us per iteration\n";
    if (avx512_time > 0) {
        std::cout << "Speedup: " << (double)avx512_time / amx_time << "x\n";
    }

    // INT8 quantized example
    std::cout << "\nBenchmarking AMX-INT8 quantized matrix multiplication...\n";

    int8_t* A_int8 = (int8_t*)aligned_alloc(64, M * K);
    int8_t* B_int8 = (int8_t*)aligned_alloc(64, K * N);
    int32_t* C_int32 = (int32_t*)aligned_alloc(64, M * N * sizeof(int32_t));

    // Initialize INT8 matrices (quantized values)
    for (int i = 0; i < M * K; ++i) {
        A_int8[i] = (int8_t)(rand() % 256 - 128);
    }
    for (int i = 0; i < K * N; ++i) {
        B_int8[i] = (int8_t)(rand() % 256 - 128);
    }

    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        amx_int8_matmul(A_int8, B_int8, C_int32, M, N, K);
    }
    end = std::chrono::high_resolution_clock::now();
    auto int8_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "AMX-INT8 time: " << int8_time / iterations << " us per iteration\n";
    std::cout << "Throughput: " << (double)(M * N * K) / (int8_time / iterations) * 1e6 / 1e9
              << " GFLOPs\n";

    // Cleanup
    free(A_bf16);
    free(B_bf16);
    free(C_bf16);
    free(C_ref);
    free(A_int8);
    free(B_int8);
    free(C_int32);

    return 0;
}

Compile with:

clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 \
        -mprefer-vector-width=512 \
        amx_ml_example.cpp -o amx_ml_example

Use cases for each AMX type

  1. AMX-BF16:
    • Neural network training with mixed precision
    • Inference with bfloat16 precision
    • Transformer models (attention mechanisms)
    • Large language model inference
  2. AMX-INT8:
    • Quantized neural network inference
    • Post-training quantization models
    • Edge AI inference
    • Maximum throughput inference workloads
  3. AMX-TILE:
    • Base infrastructure for both INT8 and BF16
    • Provides 8KB of tile register storage
    • Enables efficient matrix blocking strategies

Performance tips for AMX

  • Tile configuration: Set tile configuration once and reuse across multiple operations
  • Blocking strategy: Use 16×64 blocking for optimal tile utilisation
  • Memory alignment: Align all matrices to 64-byte boundaries
  • Threading: Use one thread per core; each thread has independent tile registers
  • NUMA awareness: For multi-socket systems, bind threads to local NUMA domain
  • Mixed precision: Use BF16 when precision allows; INT8 for maximum throughput

Integration with ML frameworks

Many ML frameworks automatically use AMX when available:

# TensorFlow with oneDNN AMX support
export TF_ENABLE_ONEDNN_OPTS=1
export ONEDNN_VERBOSE=1  # Enable verbose output to verify AMX usage

# PyTorch with oneDNN
export ONEDNN_VERBOSE=1

# Verify AMX is being used
# Look for "amx" in framework logs

Expected performance improvements:

  • AMX-BF16: 2-4x speedup over AVX-512 for large matrix multiplications
  • AMX-INT8: 4-8x speedup over AVX-512 for quantized inference
  • Suitable for: Large batch sizes, deep neural networks, transformer models

Example 7: Fortran compiler performance (flang, ifx, and gfortran)

Fortran code can be compiled with LLVM’s flang, Intel’s ifx, and GCC’s gfortran compilers. This example demonstrates Sapphire Rapids-specific optimisations available in Fortran.

Important

Compiler results are not directly comparable because:

  • flang (LLVM/21): Only generates AVX-256 (ymm) instructions, not AVX-512
  • ifx (Intel oneAPI): Generates AVX-512 (zmm) instructions
  • gfortran (GCC 15.1.0): Generates AVX-512 (zmm) instructions
  • These are separate benchmarks using different instruction sets

Source files:

  • sapphirerapids/fortran_avx512_example.f90 - AVX-512 vectorisation
  • sapphirerapids/fortran_openmp_example.f90 - OpenMP parallelization
  • sapphirerapids/fortran_mkl_example.f90 - Intel MKL integration

SLURM script: sapphirerapids/slurm_fortran_benchmarks.sh

Compilation with flang (LLVM/21):

module load llvm/21
flang -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
      fortran_avx512_example.f90 -o fortran_avx512_flang

Compilation with ifx (Intel oneAPI):

module load compiler-intel-llvm/2025.0.4
ifx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \
    fortran_avx512_example.f90 -o fortran_avx512_ifx

Compilation with gfortran (GCC 15.1.0):

module load gcc/15.1.0
gfortran -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
         fortran_avx512_example.f90 -o fortran_avx512_gcc

Fortran features for Sapphire Rapids:

  1. AVX-512 vectorisation:
    • Use assumed-shape arrays (a(:)) for improved vectorisation hints
    • Compiler automatically vectorises simple loops with -march=sapphirerapids
    • flang (LLVM/21): Only generates AVX-256 (ymm) code, -mprefer-vector-width=512 flag is ignored
    • ifx (Intel oneAPI): Generates AVX-512 (zmm) code with -mprefer-vector-width=512, but may show performance overhead due to frequency scaling
  2. OpenMP parallelization:
    • Use !$omp parallel do directives for NUMA-aware parallelization
    • Set OMP_PLACES=cores and OMP_PROC_BIND=close for optimal placement
    • Both compilers support OpenMP 4.5+ features
  3. MKL integration:
    • Intel MKL can be called from Fortran using standard BLAS/LAPACK interfaces
    • Both compilers can link against MKL libraries
    • Use -qmkl=parallel with ifx or manual linking with flang

Important

Performance results (separate benchmarks - not comparable):

flang (LLVM/21) - AVX-256 only:

  • Generates AVX-256 (ymm) code only, does not generate AVX-512 (zmm) instructions
  • -mprefer-vector-width=512 flag is ignored with warning
  • Single precision (C_FLOAT): 2.93x speedup (vectorised vs scalar)
  • These results apply to AVX-256 vectorisation only

ifx (Intel oneAPI) - AVX-512:

  • Generates AVX-512 (zmm) code with -mprefer-vector-width=512
  • Single precision (C_FLOAT): 1.74x speedup (vectorised vs scalar)
  • AVX-512 code generation confirmed (zmm registers in assembly)
  • These results apply to AVX-512 vectorisation

ifx (Intel oneAPI) - AVX-256 (equivalent instruction set):

  • Can be forced to use AVX-256 with -mprefer-vector-width=256
  • Single precision (C_FLOAT): 1.49x speedup (vectorised vs scalar)
  • AVX-256 code generation confirmed (ymm registers in assembly and binary)
  • This configuration uses the same instruction set as flang (AVX-256), enabling direct comparison

gfortran (GCC 15.1.0) - AVX-512:

  • Generates AVX-512 (zmm) code with -march=sapphirerapids -mprefer-vector-width=512
  • Single precision (C_FLOAT): 14.08x speedup (vectorised vs scalar) - highest performance observed
  • AVX-512 code generation confirmed (zmm registers in assembly and binary)
  • AMX flags available: -mamx-tile, -mamx-int8, -mamx-bf16 (but AMX requires explicit intrinsics)
  • These results apply to AVX-512 vectorisation

Equivalent instruction set comparison (AVX-256):

  • flang (AVX-256): 2.93x speedup
  • ifx (AVX-256): 1.49x speedup
  • flang achieves 2.93x speedup compared to 1.49x for ifx when both use AVX-256
  • Both compilers use the same instruction set (AVX-256), enabling direct comparison

Note

Results using different instruction sets (AVX-256 vs AVX-512) should not be compared directly. For equivalent instruction set comparisons, use AVX-256 mode with -mprefer-vector-width=256.

  • OpenMP scaling: Both compilers demonstrate scaling with OpenMP
  • Code compatibility: Same source code compiles with both compilers
  • Compiler flags:
  • flang: -march=sapphirerapids (AVX-512 code generation not supported)
  • ifx: -march=sapphirerapids -mprefer-vector-width=512 (generates AVX-512)

Compiler-specific features:

  • flang (LLVM/21):
    • Uses -fopenmp flag for OpenMP
    • Manual MKL linking required
    • Compatible with modern Fortran standards
  • ifx (Intel oneAPI):
    • Uses -qopenmp flag for OpenMP
    • Simplified MKL linking with -qmkl=parallel
    • Integrated with Intel tools (VTune, Advisor)
  • gfortran (GCC 15.1.0):
    • Uses -fopenmp flag for OpenMP
    • Manual MKL linking required
    • Capability: Generates AVX-512 code with -march=sapphirerapids -mprefer-vector-width=512
    • AMX support: AMX flags available (-mamx-tile, -mamx-int8, -mamx-bf16) but AMX requires explicit intrinsics
    • Performance: 14.08x speedup with AVX-512 (single precision C_FLOAT) - highest performance observed

Important

Compiler performance results are not directly comparable because they use fundamentally different instruction sets (AVX-256 vs AVX-512). Each compiler’s results should be evaluated independently.

Example 8: C++ compiler comparison (clang++, g++, icpx)

This section compares C++ compilers (clang++, g++, and icpx) for AVX-512, AVX-256, and AMX support on Sapphire Rapids.

Compilers tested:

  • clang++ (LLVM/21): module load llvm/21
  • g++ (GCC 15.1.0): module load gcc/15.1.0
  • icpx (Intel oneAPI 2025.0.4): module load compiler-intel-llvm/2025.0.4

Test code: sapphirerapids/vectorized_compute.cpp

AVX-512 support

All three compilers support AVX-512 code generation:

Compilation flags for AVX-512:

# clang++ (LLVM/21)
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
        vectorised_compute.cpp -o vectorised_compute_clang

# g++ (GCC 15.1.0)
g++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
    vectorised_compute.cpp -o vectorised_compute_gcc

# icpx (Intel oneAPI)
icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \
     vectorised_compute.cpp -o vectorised_compute_icpx

AVX-512 performance results (10M elements, single precision):

Compiler Speedup Remarks
clang++ (LLVM/21) 1.25x AVX-512 (zmm) confirmed in assembly
g++ (GCC 15.1.0) 0.96x AVX-512 (zmm) confirmed, but slower than scalar
icpx (Intel oneAPI) 1.06x AVX-512 (zmm) confirmed

Note

AVX-512 results show limited speedup due to CPU frequency scaling on Sapphire Rapids. The vectorised code uses zmm registers but may experience downclocking.

AVX-256 support (equivalent instruction set)

For equivalent instruction set comparison, all compilers can be forced to use AVX-256:

Compilation flags for AVX-256:

# clang++ (LLVM/21)
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=256 -fopenmp \
        vectorized_compute.cpp -o vectorized_compute_clang_avx256

# g++ (GCC 15.1.0)
g++ -O3 -march=sapphirerapids -mprefer-vector-width=256 -fopenmp \
    vectorized_compute.cpp -o vectorized_compute_gcc_avx256

# icpx (Intel oneAPI)
icpx -O3 -march=sapphirerapids -mprefer-vector-width=256 -qopenmp \
     vectorized_compute.cpp -o vectorized_compute_icpx_avx256

AVX-256 performance results (10M elements, single precision):

Compiler Speedup Remarks
clang++ (LLVM/21) 1.28x AVX-256 (ymm) confirmed
g++ (GCC 15.1.0) 1.02x AVX-256 (ymm) confirmed
icpx (Intel oneAPI) 1.05x AVX-256 (ymm) confirmed

clang++ achieves 1.28x speedup with AVX-256, followed by icpx (1.05x) and g++ (1.02x). All compilers use the same instruction set (AVX-256), enabling direct comparison.

AMX support

All three compilers support AMX flags, but AMX requires explicit intrinsics:

AMX compilation flags:

# clang++ (LLVM/21)
clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
        amx_ml_example.cpp -o amx_ml_example_clang

# g++ (GCC 15.1.0)
g++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
    amx_ml_example.cpp -o amx_ml_example_gcc

# icpx (Intel oneAPI)
icpx -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -qopenmp \
     amx_ml_example.cpp -o amx_ml_example_icpx

AMX Support Status:

Compiler AMX Flags AMX Intrinsics Remarks
clang++ (LLVM/21) Supported Works AMX intrinsics compile successfully
g++ (GCC 15.1.0) Supported Partial vmovw (BF16) instruction not supported by assembler
icpx (Intel oneAPI) Supported Works AMX intrinsics compile successfully
  • AMX flags enable AMX instruction support, but AMX is not auto-vectorised
  • AMX must be used via explicit intrinsics (e.g., _tile_loadd, _tile_dpbssd)
  • g++ has an assembler limitation with BF16 instructions (vmovw)
  • For AMX usage, clang++ and icpx are recommended

AVX-512:

  • All compilers generate AVX-512 code (zmm registers)
  • Performance limited by CPU frequency scaling
  • clang++ achieves 1.25x speedup

AVX-256 (Equivalent Instruction Set):

  • All compilers can be forced to use AVX-256
  • clang++ achieves 1.28x speedup
  • This configuration provides equivalent instruction sets for direct comparison

AMX:

  • All compilers support AMX flags
  • AMX requires explicit intrinsics (not auto-vectorised)
  • clang++ and icpx recommended for AMX code
  • g++ has BF16 instruction limitations
  • For AVX-512: clang++ achieves 1.25x speedup
  • For AVX-256: clang++ achieves 1.28x speedup
  • For AMX: Use clang++ or icpx (g++ has limitations)
  • For equivalent instruction set comparisons: Use AVX-256 mode (-mprefer-vector-width=256)

Example 9: OpenMP library comparison

This section compares OpenMP libraries from different compiler suites and their support for AVX-512 and AMX SIMD.

OpenMP Libraries Tested:

  • libomp (LLVM/21): OpenMP 5.0, used with clang++ -fopenmp
  • libiomp5 (Intel oneAPI): Intel OpenMP, used with icpx -qopenmp
  • libgomp (GCC 15.1.0): GNU OpenMP 4.5, used with g++ -fopenmp

Test code: sapphirerapids/openmp_simd_test.cpp

OpenMP libraries and versions

Compiler OpenMP Library OpenMP Version Library Path
clang++ (LLVM/21) libomp.so 5.0 (202011) /opt/software/llvm/21/21.1.0/lib/x86_64-unknown-linux-gnu/libomp.so
icpx (Intel oneAPI) libiomp5.so 5.0 (202011) /opt/intel/oneapi/compiler/2025.0/lib/libiomp5.so
g++ (GCC 15.1.0) libgomp.so.1 4.5 (201511) /opt/software/gnu/gcc-15/gcc-15.1.0/lib64/libgomp.so.1

AVX-512 support in OpenMP SIMD

All three OpenMP libraries support AVX-512 SIMD vectorisation:

Compilation:

# clang++ with libomp
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
        openmp_simd_test.cpp -o openmp_simd_test_clang

# icpx with libiomp5
icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \
     openmp_simd_test.cpp -o openmp_simd_test_icpx

# g++ with libgomp
g++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
    openmp_simd_test.cpp -o openmp_simd_test_gcc

AVX-512 Confirmation:

  • All three compilers generate AVX-512 (zmm) instructions in OpenMP SIMD loops
  • Assembly analysis confirms: vmovaps %zmm, vfmadd231ps %zmm, etc.
  • OpenMP SIMD pragmas successfully vectorise to AVX-512

Performance comparison

Test Configuration:

  • Array size: 100M elements (single precision)
  • Iterations: 10
  • Threads: 56 (one socket)

Results (Parallel+SIMD, 56 threads):

OpenMP Library SIMD Speedup Parallel+SIMD Speedup Remarks
libomp (LLVM/21) 0.96x 22.04x Highest parallel performance (22.04x)
libiomp5 (Intel) 1.12x 6.26x Higher SIMD performance (1.12x), lower parallel scaling (6.26x)
libgomp (GCC) 1.29x 17.34x Highest SIMD-only performance (1.29x)
  • libomp (LLVM/21): Highest parallel+SIMD performance (22.04x), strong thread scaling
  • libgomp (GCC): Highest SIMD-only performance (1.29x), parallel scaling of 17.34x
  • libiomp5 (Intel): Moderate performance, lower parallel scaling than libomp

Thread scaling (Parallel+SIMD):

Threads libomp (LLVM) libiomp5 (Intel) libgomp (GCC)
1 1.12x 1.11x 1.23x
28 9.36x 9.61x 9.33x
56 22.04x 6.26x 17.34x

libomp achieves the highest scaling to 56 threads, while libiomp5 shows reduced scaling at high thread counts.

AMX support in OpenMP

Important

OpenMP SIMD does not auto-vectorise to AMX instructions.

AMX Usage with OpenMP:

  • AMX requires explicit intrinsics (_tile_loadd, _tile_dpbssd, _tile_stored, etc.)
  • AMX intrinsics can be used within OpenMP parallel regions
  • OpenMP does not generate AMX code automatically from SIMD pragmas
  • AMX must be manually integrated into OpenMP parallel code

Example:

#pragma omp parallel
{
    // AMX intrinsics can be used here
    _tile_loadd(0, A, K);
    _tile_loadd(1, B, K);
    _tile_dpbssd(2, 0, 1);
    _tile_stored(2, C, N);
}

Compilation with AMX:

# All compilers support AMX flags
clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
        amx_code.cpp -o amx_code_clang

icpx -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -qopenmp \
     amx_code.cpp -o amx_code_icpx

g++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
    amx_code.cpp -o amx_code_gcc

AVX-512 support:

  • All three OpenMP libraries support AVX-512 SIMD vectorisation
  • OpenMP SIMD pragmas generate AVX-512 (zmm) instructions
  • Highest performance: libomp (22.04x parallel+SIMD speedup)

AMX support:

  • OpenMP SIMD does not auto-vectorise to AMX
  • AMX requires explicit intrinsics
  • AMX intrinsics can be used in OpenMP parallel regions
  • All compilers support AMX flags, but AMX must be manually integrated
  • For highest parallel+SIMD performance: Use libomp (LLVM/21) with clang++
  • For highest SIMD-only performance: Use libgomp (GCC 15.1.0) with g++
  • For AMX: Use explicit intrinsics within OpenMP parallel regions
  • For AVX-512 SIMD: All three libraries work, choose based on parallel scaling needs

Example 10: C++ threads performance comparison

This section compares native C++ std::thread performance across different compilers for matrix multiplication.

Compilers tested:

  • clang++ (LLVM/21): module load llvm/21
  • icpx (Intel oneAPI 2025.0.4): module load compiler-intel-llvm/2025.0.4
  • g++ (GCC 15.1.0): module load gcc/15.1.0

Test code: sapphirerapids/cpp_threads_matmul.cpp

Requirements:

  • C++17 standard (-std=c++17)
  • Native C++ threads (std::thread), not OpenMP
  • Maximum 56 threads
  • Matrix size: 2048x2048

Compilation

Compilation flags:

# clang++ (LLVM/21)
clang++ -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \
        cpp_threads_matmul.cpp -o cpp_threads_matmul_clang

# icpx (Intel oneAPI)
icpx -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \
     cpp_threads_matmul.cpp -o cpp_threads_matmul_icpx

# g++ (GCC 15.1.0)
g++ -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \
    cpp_threads_matmul.cpp -o cpp_threads_matmul_gcc

Note

The -pthread flag is required for C++ threads support.

AVX-512 vectorization

All three compilers generate AVX-512 code in the matrix multiplication kernel:

  • Assembly analysis confirms: vmovaps %zmm, vfmadd213ps %zmm, vfmadd231ps %zmm
  • The inner loop is auto-vectorised to use AVX-512 instructions
  • Each thread benefits from AVX-512 vectorisation

Performance results

Test Configuration:

  • Matrix size: 2048x2048 (single precision)
  • Iterations: 5
  • Thread counts: 1, 14, 28, 56

Performance Comparison (GFLOPS):

Threads clang++ (LLVM/21) icpx (Intel oneAPI) g++ (GCC 15.1.0)
1 17.57 GFLOPS 17.34 GFLOPS 17.56 GFLOPS
14 206.99 GFLOPS 208.49 GFLOPS 204.04 GFLOPS
28 381.77 GFLOPS 319.33 GFLOPS 381.77 GFLOPS
56 423.15 GFLOPS 287.29 GFLOPS 401.40 GFLOPS

Speedup Comparison (relative to 1 thread):

Threads clang++ (LLVM/21) icpx (Intel oneAPI) g++ (GCC 15.1.0)
1 1.00x 1.00x 1.00x
14 11.78x 12.02x 11.62x
28 21.73x 18.42x 21.74x
56 24.09x 16.57x 22.86x

Single-threaded performance:

  • All compilers show similar single-threaded performance (~17.5 GFLOPS)
  • Differences are within measurement variance

Multi-threaded scaling:

  • clang++ (LLVM/21): Highest scaling to 56 threads (24.09x speedup, 423.15 GFLOPS)
  • g++ (GCC 15.1.0): Scaling of 22.86x speedup (401.40 GFLOPS)
  • icpx (Intel oneAPI): Shows performance degradation at 56 threads (16.57x speedup, 287.29 GFLOPS)
  1. clang++ achieves 423.15 GFLOPS at 56 threads
  2. g++ achieves 401.40 GFLOPS at 56 threads, lower than clang++
  3. icpx shows reduced performance at 56 threads, possibly due to thread contention or NUMA issues
  4. All compilers demonstrate strong scaling up to 28 threads
  5. AVX-512 vectorisation is used by all compilers in the inner loop

Threading Model:

  • Uses C++17 std::thread (native C++ threads)
  • Each thread processes a portion of rows
  • No OpenMP overhead - pure C++ threading
  • Thread creation and synchronization handled by C++ standard library

Comparison with OpenMP

C++ Threads vs OpenMP (56 threads, clang++):

  • C++ Threads: 423.15 GFLOPS (24.09x speedup)
  • OpenMP Parallel+SIMD: Similar performance range
  • C++ threads provide more control but require manual thread management
  • OpenMP provides easier parallelization with pragmas

Performance results:

  • clang++ (LLVM/21): 423.15 GFLOPS at 56 threads (24.09x speedup) - highest performance
  • g++ (GCC 15.1.0): 401.40 GFLOPS at 56 threads (22.86x speedup)
  • icpx (Intel oneAPI): 287.29 GFLOPS at 56 threads (16.57x speedup) - shows degradation
  • For highest C++ threads performance: clang++ (LLVM/21) achieves 423.15 GFLOPS
  • For performance with simpler code:g++ (GCC 15.1.0) achieves 401.40 GFLOPS
  • For 28 threads or fewer: All compilers demonstrate acceptable performance
  • For 56 threads: clang++ or g++ recommended (icpx shows degradation)

AVX-512 support:

  • All compilers generate AVX-512 code in the matrix multiplication kernel
  • Each thread benefits from AVX-512 vectorisation
  • Performance scales well with thread count when using AVX-512

Runtime optimiser considerations

Intel processor optimisation features

Unlike AMD Zen2, Intel Sapphire Rapids does not have a documented embedded runtime optimiser that performs instruction-level optimisations at runtime. However, Intel processors include several hardware-level optimisation features:

  1. Out-of-Order Execution: The processor can reorder instructions at runtime to maximise instruction-level parallelism, but this is a standard feature of modern processors, not a specialised runtime optimiser.
  2. Hardware Prefetching: Aggressive hardware prefetchers that predict and prefetch data into cache, reducing memory latency.
  3. Branch Prediction: Sophisticated branch prediction units that minimize branch misprediction penalties.
  4. Turbo Boost: Dynamic frequency scaling based on workload and thermal headroom.
  5. Hyper-Threading (SMT): Simultaneous Multi-Threading that enables improved utilisation of execution units.

Implications for optimisation strategy

The lack of a runtime optimizer means:

  1. Compile-time optimizations are critical: Unlike Zen2, where -O2 and -O3 show minimal differences due to suspected runtime optimization, Sapphire Rapids benefits significantly from aggressive compile-time optimizations. -O3 typically provides 5-15% improvement over -O2 for compute-bound workloads.
  2. Explicit optimization hints are valuable: Hints like __restrict__, explicit vectorisation, and loop unrolling provide measurable benefits because the compiler is the primary optimization mechanism.
  3. Profile-Guided Optimization is essential: PGO provides significant benefits (10-30%) because it guides compile-time optimizations based on actual runtime behaviour.
  4. Architecture-specific flags matter more: Flags like -march=sapphirerapids and -mprefer-vector-width=512 are critical for enabling hardware features that the compiler can utilise.
  5. Code layout optimizations: BOLT and other post-link optimizations are valuable because they optimize code layout based on runtime profiles, compensating for the lack of runtime optimization.
  6. Vectorization is compiler-dependent: Unlike systems with runtime optimizers that might optimize vectorisation at runtime, Sapphire Rapids relies entirely on compiler vectorisation. Explicit vectorisation hints and compiler flags are important.

Intel Sapphire Rapids processors do not include an embedded runtime optimiser like AMD Zen2 may have. This means:

  • Compile-time optimisations are the primary mechanism for performance improvements
  • -O3 provides significant benefits over -O2 for compute-bound workloads (typically 5-15%)
  • Explicit optimisation hints (__restrict__, vectorisation, etc.) provide measurable benefits
  • Profile-Guided Optimisation is essential and provides 10-30% improvements
  • Architecture-specific flags are critical for enabling hardware features
  • Post-link optimisations (BOLT) are valuable for code layout optimisation

The optimisation strategy for Sapphire Rapids should focus on aggressive compile-time optimisations, PGO, and architecture-specific flags rather than relying on runtime optimisation capabilities.

Benchmark results summary

Measured performance results from the optimisation examples on Intel Xeon Platinum 8480C (Sapphire Rapids):

Measured speedups

  1. Vectorization with AVX-512: 1.29x speedup (LLVM/21) vs 1.01x (Intel oneAPI 2025.0.4)
    • LLVM shows better vectorisation optimisation
    • Both compilers produce correct results (checksums match)
  2. Combined optimisations (AVX-512 + restrict + alignment): 1.27x speedup (LLVM/21) vs 1.03x (Intel oneAPI)
    • LLVM better at combining multiple optimisations
    • Intel compiler shows more conservative optimisation
  3. Restrict pointer optimisation: 1.28x speedup (LLVM/21) vs 1.22x (Intel oneAPI)
    • Both compilers benefit from __restrict__ hints
    • LLVM shows slightly better alias analysis optimisation
  4. Cache-aware data layout: 8.05x speedup (LLVM/21) vs 8.09x (Intel oneAPI)
    • Both compilers demonstrate strong performance for memory layout optimisations
    • This is the largest speedup category (memory-bound optimisation)
  5. Memory alignment: 7.48x speedup (LLVM/21) vs 5.20x (Intel oneAPI)
    • Both compilers benefit significantly from proper alignment
    • LLVM shows better utilisation of aligned memory access

Compiler comparison: LLVM/21 vs Intel oneAPI 2025.0.4

The examples were tested with both LLVM/21 (clang++) and Intel oneAPI 2025.0.4 (icpx) compilers. Results show:

LLVM/21 Advantages:

  • Better vectorisation optimisation (1.29x vs 1.01x for AVX-512)
  • Better combined optimisation performance (1.27x vs 1.03x)
  • Better memory alignment utilisation (7.48x vs 5.20x)
  • More aggressive optimisation with -O3

Intel oneAPI Advantages:

  • Slightly better cache layout optimisation (8.09x vs 8.05x)
  • More conservative optimisation may be beneficial for stability
  • Better integration with Intel-specific tools (VTune, oneDNN)

Code Compatibility:

  • All examples compile and run with both compilers using the same source code
  • No code modifications needed between compilers
  • Both compilers support the same optimisation flags (-march=sapphirerapids, -mprefer-vector-width=512, etc.)
  • Use LLVM/21 for maximum performance on compute-bound workloads
  • Use Intel oneAPI when integration with Intel tools (VTune, oneDNN) is required
  • Test both compilers for your specific workload to determine the appropriate choice

Performance comparison table

The following table shows measured performance differences between LLVM/21 and Intel oneAPI 2025.0.4 compilers:

Optimisation Type Example LLVM/21 Speedup Intel oneAPI Speedup Performance Difference Remarks
AVX-512 Vectorization vectorized_compute 1.29x 1.01x +27.7% LLVM shows significantly better vectorization
Combined Optimisations combined_optimization 1.27x 1.03x +23.3% LLVM better at combining optimisations
Restrict Pointers restrict_example 1.28x 1.22x +4.9% Both compilers benefit, LLVM slightly better
Cache Layout cache_layout 8.05x 8.09x -0.5% Essentially equivalent performance
Memory Alignment memory_alignment 7.48x 5.20x +43.8% LLVM shows higher alignment utilisation
MKL DGEMM (2048x2048) mkl_benchmark 99.75 GFLOPS 99.88 GFLOPS -0.1% Essentially equivalent (MKL is pre-compiled library)
MKL DGEMM OpenMP (2048x2048, 56 threads) mkl_openmp_example ~2800 GFLOPS ~2850 GFLOPS -1.8% Both achieve strong scaling with OpenMP threading
MKL DGEMM MPI (2 processes, 56 threads/process) mkl_mpi_example 3193 GFLOPS 3192 GFLOPS +0.03% Essentially equivalent performance, both demonstrate scaling with MPI
oneDNN MatMul (2048x2048) onednn_benchmark 4278 GFLOPS 4022 GFLOPS +6.4% LLVM shows higher performance for user code, oneDNN library is pre-compiled
Fortran AVX-256 (100M elements, flang only) fortran_avx512_example 2.97x speedup N/A N/A flang AVX-256 only (single precision C_FLOAT), not comparable to ifx
Fortran AVX-512 (100M elements, ifx only) fortran_avx512_example N/A 1.74x speedup N/A ifx AVX-512 (single precision C_FLOAT), not comparable to flang

Performance Difference = ((LLVM Speedup - Intel Speedup) / Intel Speedup) × 100%

Note

For MKL benchmarks, performance is measured in GFLOPS rather than speedup, as MKL is a pre-compiled optimized library. Both compilers achieve similar performance since MKL routines are independent of the compiler used. OpenMP and MPI variants demonstrate strong scaling characteristics with both compilers. Results shown are from local testing; for production runs, use the provided SLURM job scripts.

Performance analysis by category

Compute-bound optimisations (vectorisation, combined):

  • LLVM/21 shows 23-28% better performance for compute-bound workloads
  • Intel oneAPI is more conservative, showing minimal speedup (1-3%)
  • Recommendation: Use LLVM/21 for compute-intensive applications

Memory-bound optimisations (cache, alignment):

  • Both compilers show excellent cache layout optimisation (8x speedup)
  • LLVM/21 shows 44% better performance for memory alignment
  • Recommendation: Both compilers work well, but LLVM has an edge for alignment-sensitive code

Pointer aliasing (restrict):

  • Both compilers benefit from __restrict__ hints
  • LLVM/21 shows slightly better optimisation (4.9% difference)
  • Recommendation: Either compiler works well, LLVM has a small advantage

Overall performance summary

Category LLVM/21 Performance Intel oneAPI Performance
Compute-bound 23-28% higher speedup 1-3% speedup (conservative)
Memory-bound 44% higher speedup (alignment) 0.5% higher speedup (cache layout)
MKL Integration Manual linking required Simplified -qmkl flag
MKL Performance (Sequential) 99.75 GFLOPS (DGEMM 2048x2048) 99.88 GFLOPS (DGEMM 2048x2048)
MKL with OpenMP Uses -lmkl_gnu_thread -liomp5 Uses -qmkl=parallel (Intel OpenMP)
MKL with MPI Manual MPI linking, set I_MPI_CXX=clang++ Automatic with -qmkl=parallel, set I_MPI_CXX=icpx
oneDNN Performance 4278 GFLOPS (MatMul 2048x2048) 4022 GFLOPS (MatMul 2048x2048)
Fortran Compilers flang (LLVM/21) with -fopenmp ifx (Intel oneAPI) with -qopenmp
Fortran Performance (AVX-256) 2.97x speedup (single precision C_FLOAT) N/A (ifx generates AVX-512, not AVX-256)
Fortran Performance (AVX-512) N/A (flang does not generate AVX-512) 1.74x speedup (single precision C_FLOAT)
Note Results not comparable - different instruction sets Results not comparable - different instruction sets
Tool Integration Standard LLVM toolchain Intel VTune, oneDNN integration

General performance characteristics

  1. Optimisation level -O3 provides significant benefits over -O2 for compute-bound workloads (typically 5-15% improvement).
  2. Data layout optimisation provides the largest performance improvements. Cache-aware data structure design shows 8x speedup in benchmarks, exceeding other optimisation techniques.
  3. Profile-Guided Optimisation (PGO) provides 10-30% performance gains with proper profiling workflows.
  4. Use -march=sapphirerapids to enable architecture-specific optimisations including AVX-512 and AMX. For ML/AI workloads, AMX provides 2-8x speedup over AVX-512 for matrix operations. Use -mamx-tile -mamx-int8 -mamx-bf16 to enable all AMX types.
  5. Sapphire Rapids supports AVX-512. Use 512-bit vectors with -mprefer-vector-width=512 for compute-bound workloads. For memory-bound code, 256-bit vectors may be preferable to reduce register pressure.
  6. Loop unrolling should be tuned based on instruction cache capacity. Profile to find optimal unroll factor.
  7. For mixed workloads, blend PGO profiles by weighting representative workloads appropriately.
  8. __restrict__ benefits are significant for complex pointer patterns (1.2-1.3x speedup observed). Profile to identify where alias analysis limits optimisation.
  9. Memory alignment provides significant performance improvements (5-7x speedup), enabling vectorisation and reducing cache penalties.
  10. Dual-socket systems have 2 NUMA domains (one per socket). Use SLURM --sockets-per-node and --cpu-bind=sockets to bind to specific NUMA domains.
  11. Thread affinity binding is workload-dependent. For single-process workloads, OS scheduling often performs well, but explicit CPU affinity binding may be valuable for multi-process applications and NUMA-aware code.
  12. Optimisation priorities: data layout (8x), memory alignment (5-7x), NUMA awareness, and memory access patterns provide larger performance gains than micro-optimisations (1.2-1.3x).