Intel Sapphire Rapids Optimisation Guide (Discoverer+ GPU partition)¶

Table of Contents¶

Introduction
Sapphire Rapids architecture overview
Optimisation levels: -O2 vs -O3
CPU-specific compilation flags
Profile-guided optimisation (PGO)
Memory optimisations
Link-time optimisations
Mixed workload strategy
Practical build configuration
Runtime considerations
SLURM configuration
Example code demonstrating optimisation benefits
Benchmark results summary
Compiler comparison: LLVM/21 vs Intel oneAPI
Runtime optimiser considerations

Introduction¶

This document describes compilation and execution practices for Intel Sapphire Rapids microarchitecture systems. Sapphire Rapids processors (Xeon Scalable 4th Generation, including Xeon Platinum 8480C) have specific characteristics that affect performance.

The code examples and optimisation techniques explained in this document are applicable to systems equipped with Intel Xeon Platinum 8480C processors. The system configuration includes 2 sockets with 56 cores per socket, presenting 2 NUMA domains (one per socket), totaling 112 cores with SMT (Simultaneous Multi-Threading) providing 224 threads total.

For detailed hardware specifications of the Discoverer+ compute nodes (based on DGX H200 servers) where these Intel Xeon Platinum processors are installed, refer to the Discoverer Resource Overview.

All compilation and code execution must occur on compute nodes. The only way to access compute nodes is through SLURM batch jobs. Direct execution and compilation on login nodes is not tolerated. All examples in this document must be submitted as SLURM batch jobs using the provided SLURM scripts in the sapphirerapids/ directory located at /opt/software/sapphirerapids/. The test code is also available online at:

https://gitlab.discoverer.bg/vkolev/snippets/-/blob/main/sapphirerapids

Important

Users must ensure they have a QoS (Quality of Service) that allows intensive CPU jobs. The Discoverer+ cluster policy prioritises GPU workloads over intensive CPU workloads. Verify that your QoS configuration permits CPU-intensive jobs before submitting SLURM batch jobs.

Sapphire Rapids architecture overview¶

Intel Sapphire Rapids microarchitecture (codenamed “SPR”) is a 10nm Enhanced SuperFin process node design introduced in 2023. Sapphire Rapids processors implement a tile-based architecture with multiple compute tiles connected via Intel’s EMIB (Embedded Multi-die Interconnect Bridge).

The core architecture consists of Performance cores (P-cores) based on the Golden Cove microarchitecture. Each core has dedicated L1 and L2 caches. The cache hierarchy includes 32KB L1D (data cache) and 32KB L1I (instruction cache) per core, 2MB L2 cache per core, and up to 112MB L3 cache shared across the socket (depending on SKU).

Sapphire Rapids cores feature a wide instruction dispatch pipeline with dual 512-bit FMA (Fused Multiply-Add) units per core, enabling simultaneous execution of two 512-bit vector operations. The architecture supports AVX-512 instructions including AVX-512F, AVX-512BW, AVX-512CD, AVX-512DQ, AVX-512VL, AVX-512_VNNI, AVX-512_BF16, AVX-512_FP16, and AVX-512_VBMI2.

Advanced Matrix Extensions (AMX) is a key feature of Sapphire Rapids specifically designed for AI/ML workloads. AMX provides three types of acceleration:

AMX-TILE: 8KB of dedicated tile registers (8 tiles × 1KB each) for efficient matrix data storage and manipulation
AMX-INT8: Hardware acceleration for 8-bit integer matrix multiplication, ideal for quantized neural network inference with 4-8x speedup over AVX-512
AMX-BF16: Hardware acceleration for bfloat16 matrix multiplication, ideal for mixed-precision training and inference with 2-4x speedup over AVX-512

AMX enables significant performance improvements for deep learning workloads, transformer models, and large language model inference. Each core has independent AMX tile registers, allowing efficient parallelization across cores.

The branch prediction unit uses a sophisticated multi-level predictor with improved accuracy over previous generations. Memory disambiguation capabilities allow the processor to detect and handle memory dependencies effectively, enabling out-of-order execution optimisations.

For multi-socket systems like Intel Xeon Platinum 8480C, each socket contains multiple tiles connected via EMIB. Each socket presents as a single NUMA domain, with memory controllers distributed across the socket. On systems with 2 sockets, there are 2 NUMA domains with 56 cores per domain, totaling 112 cores with SMT providing 224 threads total.

Optimisation levels: `-O2` vs `-O3`¶

Unlike AMD Zen2, Intel Sapphire Rapids does not have a documented embedded runtime optimiser. This means compile-time optimisations, including those enabled by -O3, are more important for achieving optimal performance.

Use -O3 for compute-bound workloads: Provides aggressive optimisations including vectorisation, loop unrolling, and inlining that significantly benefit Sapphire Rapids
Use -O2 for memory-bound or mixed workloads: Provides balanced optimisation without excessive code bloat that can hurt instruction cache performance
Profile to determine optimal level: Test both -O2 and -O3 for your specific workload; -O3 typically provides 5-15% improvement for compute-bound code
Combine with architecture-specific flags: -O3 benefits are amplified when combined with -march=sapphirerapids and AVX-512 optimisations

The lack of a runtime optimiser means that compile-time optimisations are the primary mechanism for performance improvements. Aggressive optimisations at compile time translate directly to runtime performance.

CPU-specific compilation flags¶

Architecture targeting¶

# Use -march=sapphirerapids to enable Sapphire Rapids-specific instructions
-march=sapphirerapids

# This enables:
# - AVX-512 (512-bit vectors)
# - AVX-512_VNNI (vector neural network instructions)
# - AVX-512_BF16 (bfloat16 support)
# - AVX-512_FP16 (half-precision floating point)
# - AMX (Advanced Matrix Extensions)
# - Other Sapphire Rapids-specific instruction sets

# Alternative: Use -march=native to auto-detect all features
-march=native

Vector width optimisation¶

# Optimal vector width for Sapphire Rapids is 512-bit vectors (AVX-512 support)
# Sapphire Rapids has dual 512-bit FMA units per core
-mprefer-vector-width=512

# For workloads that may benefit from 256-bit vectors (less register pressure)
# Use 256-bit for memory-bound code or when register spilling occurs
-mprefer-vector-width=256

# Ensures vectorized math uses FMA instructions
-mfma

AVX-512 specific optimisations¶

# Enable AVX-512 fused multiply-add
-mavx512f -mavx512dq -mavx512cd -mavx512bw -mavx512vl

# Enable AVX-512 VNNI for neural network workloads
-mavx512vnni

# Enable AVX-512 BF16 for bfloat16 operations
-mavx512bf16

# Enable AVX-512 FP16 for half-precision operations
-mavx512fp16

# Note: -march=sapphirerapids automatically enables all supported AVX-512 variants

AMX (Advanced Matrix Extensions)¶

AMX is Intel’s dedicated hardware acceleration for matrix operations, specifically designed for AI/ML workloads. Sapphire Rapids supports three AMX types:

AMX-TILE: Provides 8KB of tile registers (8 tiles × 1KB each) for matrix data storage
AMX-INT8: Accelerates 8-bit integer matrix multiplication (INT8 quantization)
AMX-BF16: Accelerates bfloat16 matrix multiplication (BF16 mixed precision)

Compilation flags¶

# Enable all AMX types for matrix multiplication workloads (AI/ML)
-mamx-tile -mamx-int8 -mamx-bf16

# Or use -march=sapphirerapids which automatically enables AMX support
-march=sapphirerapids

# Note: AMX requires runtime detection and explicit usage
# Compiler will not auto-vectorize to AMX; requires manual intrinsics

Runtime detection¶

AMX requires runtime detection and proper OS support (Linux kernel 5.16+):

#include <cpuid.h>
#include <immintrin.h>

bool check_amx_support() {
    unsigned int eax, ebx, ecx, edx;

    // Check for AMX-TILE support (CPUID leaf 0x1D, subleaf 0x0)
    __cpuid_count(0x1D, 0x0, eax, ebx, ecx, edx);
    if ((eax & (1 << 0)) == 0) return false;  // AMX-TILE

    // Check for AMX-INT8 support
    if ((eax & (1 << 1)) == 0) return false;  // AMX-INT8

    // Check for AMX-BF16 support
    if ((eax & (1 << 5)) == 0) return false;  // AMX-BF16

    return true;
}

AMX tile configuration¶

AMX uses a tile configuration that must be set before using tile operations:

#include <immintrin.h>

// Configure AMX tiles for matrix multiplication
// Tile dimensions: rows × cols (max 16 rows × 64 bytes per row)
void configure_amx_tiles() {
    // Tile 0: 16 rows × 64 bytes (for matrix A)
    // Tile 1: 16 rows × 64 bytes (for matrix B)
    // Tile 2: 16 rows × 64 bytes (for matrix C result)

    uint8_t tilecfg[64] = {0};

    // Configure tile 0: 16×64 bytes (1024 bytes total)
    tilecfg[0] = 16;  // rows
    tilecfg[1] = 64;  // bytes per row

    // Configure tile 1: 16×64 bytes
    tilecfg[16] = 16;
    tilecfg[17] = 64;

    // Configure tile 2: 16×64 bytes (accumulator)
    tilecfg[32] = 16;
    tilecfg[33] = 64;

    _tile_loadconfig(tilecfg);
}

AMX-BF16 for neural network inference¶

AMX-BF16 is ideal for neural network inference with bfloat16 precision:

// Example: Matrix multiplication using AMX-BF16
// C = A × B where A, B, C are bfloat16 matrices
#include <immintrin.h>
#include <stdint.h>

void amx_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C,
                     int M, int N, int K) {
    // Configure tiles
    uint8_t tilecfg[64] = {0};
    tilecfg[0] = 16; tilecfg[1] = 64;  // Tile 0: A matrix (16×32 bf16 elements)
    tilecfg[16] = 16; tilecfg[17] = 64; // Tile 1: B matrix
    tilecfg[32] = 16; tilecfg[33] = 64; // Tile 2: C accumulator
    _tile_loadconfig(tilecfg);

    // Load matrices into tiles and perform multiplication
    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            // Zero accumulator tile
            _tile_zero(2);

            for (int k = 0; k < K; k += 32) {
                // Load tile 0 with A[i:i+16, k:k+32]
                _tile_loadd(0, &A[i * K + k], K * sizeof(__bf16));

                // Load tile 1 with B[k:k+32, j:j+16] (transposed)
                _tile_loadd(1, &B[k * N + j], N * sizeof(__bf16));

                // Compute: tile2 += tile0 × tile1
                _tile_dpbf16ps(2, 0, 1);
            }

            // Store result from tile 2 to C[i:i+16, j:j+16]
            _tile_stored(2, &C[i * N + j], N * sizeof(__bf16));
        }
    }

    // Release tile configuration
    _tile_release();
}

AMX-INT8 for quantized neural networks¶

AMX-INT8 accelerates INT8 quantized models (common for inference):

// Example: INT8 quantized matrix multiplication
void amx_int8_matmul(const int8_t* A, const int8_t* B, int32_t* C,
                     int M, int N, int K) {
    // Configure tiles for INT8
    uint8_t tilecfg[64] = {0};
    tilecfg[0] = 16; tilecfg[1] = 64;  // Tile 0: A matrix (16×64 int8 elements)
    tilecfg[16] = 16; tilecfg[17] = 64; // Tile 1: B matrix
    tilecfg[32] = 16; tilecfg[33] = 64; // Tile 2: C accumulator (int32)
    _tile_loadconfig(tilecfg);

    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            _tile_zero(2);  // Zero accumulator

            for (int k = 0; k < K; k += 64) {
                // Load A[i:i+16, k:k+64]
                _tile_loadd(0, &A[i * K + k], K);

                // Load B[k:k+64, j:j+16] (transposed)
                _tile_loadd(1, &B[k * N + j], N);

                // Compute: tile2 += tile0 × tile1 (INT8)
                _tile_dpbssd(2, 0, 1);
            }

            // Store result (int32 accumulator)
            _tile_stored(2, &C[i * N + j], N * sizeof(int32_t));
        }
    }

    _tile_release();
}

Performance considerations¶

Tile reuse: Keep tiles loaded across multiple operations to minimize memory traffic
Blocking: Use appropriate block sizes (16×64 for BF16, 16×64 for INT8) to maximise tile utilisation
Memory alignment: Align matrices to 64-byte boundaries for optimal tile loading
Multi-threading: Each thread has its own tile registers; use one thread per core for AMX workloads
Mixed precision: Use BF16 for training/inference when precision allows; INT8 for maximum throughput in inference

Integration with ML frameworks¶

Many ML frameworks automatically use AMX when available:

TensorFlow: Enable with TF_ENABLE_ONEDNN_OPTS=1 (uses oneDNN library)
PyTorch: Uses oneDNN optimisations automatically on Sapphire Rapids
oneDNN: Intel’s deep neural network library with AMX support

# Enable oneDNN AMX optimizations
export TF_ENABLE_ONEDNN_OPTS=1
export ONEDNN_VERBOSE=1  # For debugging/verification

Loop and alignment¶

# Sapphire Rapids benefits from 64-byte alignment (cache line size)
-falign-loops=64
-falign-functions=64

# For AVX-512, 64-byte alignment is optimal
-falign-loops=64

LLVM-specific optimisations¶

# Enable loop interchange (benefits from sophisticated branch prediction)
-mllvm -enable-loopinterchange

# Tune prefetch distance (Sapphire Rapids has aggressive prefetchers)
-mllvm -prefetch-distance=256

# Enable interleaved memory access optimization
-mllvm -enable-interleaved-mem-accesses

# NUMA-aware placement (for multi-socket systems)
-mllvm -enable-npm=true

# Enable aggressive vectorization
-mllvm -force-vector-width=512
-mllvm -force-vector-interleave=2

What to avoid¶

Excessive loop unrolling: Can cause instruction cache misses; profile to find optimal unroll factor
Over-aggressive -ffast-math: Test carefully; precision requirements must be considered
Generic -march flags: Target sapphirerapids specifically for architecture-specific optimisations
Mixing AVX-512 and AVX2: Use consistent vector width throughout the application

Profile-guided optimisation (PGO)¶

PGO for Sapphire Rapids provides 10-30% performance improvements. It works with LTO and BOLT.

PGO benefits for Sapphire Rapids¶

Given Sapphire Rapids’s sophisticated branch predictor and wide execution units, PGO provides significant benefits because:

Optimizes for actual branch patterns
Improved code layout reduces instruction cache misses
Hot/cold splitting keeps working set in L2/L3
Enables improved vectorisation decisions based on runtime data

PGO workflow¶

# Step 1: Instrumentation build
clang++ -fprofile-generate -march=sapphirerapids -O3 -flto=thin \
        -mprefer-vector-width=512 \
        source.cpp -o program

# Step 2: Run representative workloads
# In SLURM batch job:
./program < typical_input_1
./program < typical_input_2
./program < typical_input_3

# Step 3: Merge profiles (if multiple runs)
llvm-profdata merge -o final.profdata default.profraw

# Step 4: Optimised build with profile
clang++ -fprofile-use=final.profdata -march=sapphirerapids -O3 \
        -flto=thin -mprefer-vector-width=512 \
        source.cpp -o program_optimized

Blended profiles for mixed workloads¶

For diverse customer workloads, create weighted blended profiles:

# Collect profiles from multiple workloads
llvm-profdata merge -o workload_A.profdata default.profraw_A
llvm-profdata merge -o workload_B.profdata default.profraw_B
llvm-profdata merge -o workload_C.profdata default.profraw_C

# Merge with weights based on importance/frequency
llvm-profdata merge \
    -weighted-input=3,workload_A.profdata \
    -weighted-input=2,workload_B.profdata \
    -weighted-input=1,workload_C.profdata \
    -o final_blended.profdata

Memory optimisations¶

Cache-aware compilation¶

# Improved cache utilisation through section elimination
-fdata-sections -ffunction-sections

# Linker garbage collection (use with above flags)
-Wl,--gc-sections

Structure and data layout¶

Pack hot data structures to fit within 32KB L1 cache
Consider __restrict__ for pointer aliasing hints (Sapphire Rapids has strong memory disambiguation, but explicit hints can still help the compiler)
Align data structures to cache line boundaries (64 bytes)
Use structure-of-arrays (SoA) layout for vectorised code when beneficial

Huge pages¶

# Enable transparent huge pages in madvise mode
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# In code, use madvise for large allocations
madvise(large_buffer, size, MADV_HUGEPAGE);

Memory allocators¶

Consider replacing default allocator with:

jemalloc: Improved performance for concurrent workloads
tcmalloc: Suitable performance characteristics
mimalloc: Low overhead, suitable for mixed workloads

Link-time optimisations¶

ThinLTO vs full LTO¶

# ThinLTO: Faster compile times, most of the benefits
-flto=thin

# Full LTO: Maximum optimization, slower compilation
-flto=full

Additional link-time flags¶

# Allow more aggressive optimizations in shared libraries
-fno-semantic-interposition

# Reduce PLT call overhead
-fno-plt

# Use LLD linker for improved LTO support
-fuse-ld=lld

# Safe identical code folding
-Wl,--icf=safe

BOLT post-link optimisation¶

After linking, apply BOLT using production performance data:

# Collect perf data
# In SLURM batch job:
perf record -e cycles:u -j any,u ./program

# Apply BOLT
llvm-bolt program -o program.bolt \
    --data perf.data \
    --reorder-blocks=ext-tsp \
    --reorder-functions=hfsort+ \
    --split-functions \
    --split-all-cold

Mixed workload strategy¶

For systems serving diverse customer workloads, use a balanced approach:

Conservative optimisation flags¶

-O3 for compute-bound code: Provides significant benefits on Sapphire Rapids
-O2 for memory-bound code: Avoids code bloat that can hurt cache performance
-flto=thin: Faster, more predictable performance across workload variations
PGO with blended profiles: Weighted combination of representative workloads

Function multi-versioning¶

For hot paths, use target clones:

__attribute__((target_clones("default","avx2","avx512f")))
void process_data(/* params */) {
    // Hot path that benefits from different optimizations
    // Runtime dispatcher selects the best version
}

Split optimisation by code characteristics¶

# Hot paths (identified via profiling)
set_source_files_properties(hot_path.cpp PROPERTIES
    COMPILE_FLAGS "-O3 -march=sapphirerapids -fprofile-use=hot.profdata -mprefer-vector-width=512")

# Cold paths
set_source_files_properties(general_code.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=x86-64-v4")

# Core libraries
set_source_files_properties(core_lib.cpp PROPERTIES
    COMPILE_FLAGS "-O3 -march=sapphirerapids -flto=full -mprefer-vector-width=512")

# Customer-facing code
set_source_files_properties(api_code.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=x86-64-v4 -flto=thin")

Practical build configuration¶

Complete example: core library build (in SLURM batch job)¶

# In SLURM batch job:
clang++ -O3 \
        -march=sapphirerapids \
        -mprefer-vector-width=512 \
        -mfma \
        -falign-loops=64 \
        -falign-functions=64 \
        -fdata-sections \
        -ffunction-sections \
        -fno-semantic-interposition \
        -fno-plt \
        -flto=thin \
        -fprofile-use=blended.profdata \
        -mllvm -enable-loopinterchange \
        -mllvm -prefetch-distance=256 \
        source.cpp -o program \
        -fuse-ld=lld \
        -Wl,--gc-sections \
        -Wl,--icf=safe

CMake configuration¶

set(CMAKE_C_COMPILER clang)
set(CMAKE_CXX_COMPILER clang++)

# C++ standard (C++17, C++20, or C++23)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Base flags
set(CMAKE_C_FLAGS_RELEASE "-O3 -march=sapphirerapids -mprefer-vector-width=512")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")

# LTO
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -flto=thin")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -flto=thin")

# PGO (if profile available)
if(EXISTS "${CMAKE_SOURCE_DIR}/final.profdata")
    set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata")
    set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata")
endif()

# Linker
set(CMAKE_EXE_LINKER_FLAGS_RELEASE "-fuse-ld=lld -Wl,--gc-sections -Wl,--icf=safe")
set(CMAKE_SHARED_LINKER_FLAGS_RELEASE "${CMAKE_EXE_LINKER_FLAGS_RELEASE}")

Runtime considerations¶

CPU frequency scaling¶

Sapphire Rapids systems typically use intel_pstate driver for CPU frequency scaling:

# Check current CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set to performance mode (if root or via SLURM)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Or use cpupower (if available)
cpupower frequency-set -g performance

NUMA awareness¶

Sapphire Rapids systems with multiple sockets have multiple NUMA domains:

Identify NUMA topology: Use numactl --hardware to see NUMA node layout
Bind memory allocation: Use numactl --membind=N to allocate memory from specific NUMA node
Bind CPU affinity: Use numactl --cpunodebind=N to bind to specific NUMA node
Monitor NUMA statistics: Use numastat and perf stat -e numa-misses
Dual-socket systems: 2 NUMA domains, one per socket. Use SLURM --sockets-per-node and --cpu-bind=sockets to bind to specific NUMA domains

Example: NUMA-optimized execution¶

Direct execution (using numactl):

# Check NUMA topology
numactl --hardware

# Single NUMA domain binding (for single-process)
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

# Multiple NUMA domains: one process per domain
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./process1 &
srun --cpu-bind=sockets:1-1 numactl --membind=1 --cpunodebind=1 ./process2 &

# Monitor NUMA performance
# In SLURM batch job:
perf stat -e numa-misses,numa-migrations ./program
numastat  # Show NUMA allocation statistics

SLURM execution:

# Single NUMA domain (for single-process)
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=56
#SBATCH --cpus-per-task=112
srun --cpu-bind=sockets:0-0 ./program

# Multiple NUMA domains (one task per domain)
#SBATCH --ntasks=2
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=56
srun --cpu-bind=sockets ./program

# Explicit NUMA domain binding with numactl
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

# Check SLURM CPU binding
srun --cpu-bind=sockets:0-0 numactl --hardware

Environment variables¶

Make optimisation thresholds runtime-configurable:

# Example: Tunable buffer sizes
export BUFFER_SIZE=1048576
export PARALLELISM_THRESHOLD=1000

Monitoring and feedback¶

Instrumented production builds: Use lightweight sampling (-fprofile-sample-use) to collect real customer profiles
Performance telemetry: Track which code paths are actually hot in production
A/B testing: Deploy different optimisation configurations to subsets of traffic

SLURM configuration¶

Systems with Intel Sapphire Rapids processors typically use SLURM for job scheduling with specific NUMA domain configuration.

Node configuration¶

From system configuration:

2 NUMA domains: One per socket
56 cores per NUMA domain: Each socket contains 56 cores
2 threads per core: SMT (Simultaneous Multi-Threading) enabled
Total capacity: 2 × 56 × 2 = 224 threads per node

Topology breakdown:

2 NUMA domains: One per socket
56 cores per NUMA domain: Each domain contains 56 cores
2 threads per core: SMT (Simultaneous Multi-Threading) enabled
Total capacity: 2 × 56 × 2 = 224 threads per node
Memory per NUMA domain: Varies by system configuration

SLURM directives¶

All jobs must use appropriate SLURM directives:

#SBATCH --partition=commong      # Default partition (not "cn")
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --sockets-per-node=1    # For single NUMA domain
#SBATCH --cores-per-socket=56   # All cores in one socket
#SBATCH --cpus-per-task=112     # All threads in one socket (with SMT)

Note

The default partition for SLURM is commong, not cn.

QoS Requirements for Intensive CPU Jobs:

Users must ensure they have a QoS (Quality of Service) that allows intensive CPU jobs. The Discoverer+ cluster policy prioritises GPU workloads over intensive CPU workloads. Verify that your QoS configuration permits CPU-intensive jobs before submitting SLURM batch jobs for Sapphire Rapids optimisation benchmarks.

OpenMP configuration¶

For OpenMP workloads:

#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=56
#SBATCH --cpus-per-task=112

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
export OMP_PROC_BIND=close

srun --cpu-bind=sockets:0-0 ./openmp_program

SLURM configuration recommendations¶

Single-process applications: Use --sockets-per-node=1 to bind to one NUMA domain
Multi-process applications: Use --ntasks=N with --sockets-per-node=N (one task per NUMA domain)
Memory allocation: Request memory proportional to NUMA domains used
CPU binding: Always use --cpu-bind=sockets to ensure proper NUMA binding
Monitor binding: Check with srun --cpu-bind=sockets numactl --hardware
Thread placement: For OpenMP, use OMP_PLACES=cores and OMP_PROC_BIND=close

Example code demonstrating optimisation benefits¶

The following examples demonstrate how different optimisations benefit Sapphire Rapids performance. The example source code and SLURM scripts are located in the sapphirerapids/ directory at /opt/software/sapphirerapids/.

Important

The full path to the sapphirerapids/ folder is /opt/software/sapphirerapids/. You can copy this folder to your project directory or work directly from the system location.

The test code is also available online at: https://gitlab.discoverer.bg/vkolev/snippets/-/blob/main/sapphirerapids

To reproduce benchmark results, you can either work from the system location or copy the folder to your project directory:

# Option 1: Work from the system location
cd /opt/software/sapphirerapids

# Option 2: Copy to your project directory
mkdir -p /path/to/your/project
cp -r /opt/software/sapphirerapids /path/to/your/project/
cd /path/to/your/project/sapphirerapids

# Submit the SLURM batch job from within the folder
sbatch slurm_all_benchmarks.sh

This compiles and executes all examples within the SLURM job. Results are written to output files in the same directory.

Note

The SLURM scripts should use SLURM_SUBMIT_DIR to locate source files, so they must be run from within the sapphirerapids/ directory where the source files are located.

Example 1: Vectorisation with AVX-512¶

This example shows how -march=sapphirerapids enables AVX-512 vectorisation:

// vectorized_compute.cpp
#include <immintrin.h>
#include <chrono>
#include <iostream>

// Unoptimized version (scalar)
void compute_scalar(float* a, float* b, float* c, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        c[i] = a[i] * b[i] + a[i];
    }
}

// Optimized version (vectorized with AVX-512)
void compute_vectorized(float* __restrict__ a, float* __restrict__ b,
                        float* __restrict__ c, size_t n) {
    size_t i = 0;
    // Process 16 floats at a time (512-bit AVX-512)
    for (; i + 16 <= n; i += 16) {
        __m512 va = _mm512_load_ps(&a[i]);
        __m512 vb = _mm512_load_ps(&b[i]);
        __m512 vc = _mm512_fmadd_ps(va, vb, va); // FMA: a*b + a
        _mm512_store_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; ++i) {
        c[i] = a[i] * b[i] + a[i];
    }
}

int main() {
    const size_t n = 100000000;
    float* a = (float*)_mm_malloc(n * sizeof(float), 64);
    float* b = (float*)_mm_malloc(n * sizeof(float), 64);
    float* c = (float*)_mm_malloc(n * sizeof(float), 64);

    // Initialize
    for (size_t i = 0; i < n; ++i) {
        a[i] = 1.0f;
        b[i] = 2.0f;
    }

    // Benchmark scalar
    auto start = std::chrono::high_resolution_clock::now();
    compute_scalar(a, b, c, n);
    auto end = std::chrono::high_resolution_clock::now();
    auto scalar_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    // Benchmark vectorized
    start = std::chrono::high_resolution_clock::now();
    compute_vectorized(a, b, c, n);
    end = std::chrono::high_resolution_clock::now();
    auto vectorised_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "Scalar time: " << scalar_time << " us\n";
    std::cout << "Vectorized time: " << vectorised_time << " us\n";
    std::cout << "Speedup: " << (double)scalar_time / vectorised_time << "x\n";

    _mm_free(a);
    _mm_free(b);
    _mm_free(c);
    return 0;
}

Compile with:

clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 \
        -mfma vectorized_compute.cpp -o vectorized_compute

Example 2: Cache-aware data layout¶

This example demonstrates the importance of data layout for cache performance:

// cache_layout_example.cpp
#include <chrono>
#include <iostream>
#include <vector>

// Array of Structures (AoS) - poor cache locality
struct Point {
    float x, y, z;
    int id;
};

void process_aos(Point* points, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += points[i].x * points[i].y;
    }
}

// Structure of Arrays (SoA) - improved cache locality
struct Points {
    std::vector<float> x, y, z;
    std::vector<int> id;
};

void process_soa(Points& points, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += points.x[i] * points.y[i];
    }
}

int main() {
    const size_t n = 10000000;

    // AoS version
    Point* aos_points = new Point[n];
    for (size_t i = 0; i < n; ++i) {
        aos_points[i].x = 1.0f;
        aos_points[i].y = 2.0f;
    }

    auto start = std::chrono::high_resolution_clock::now();
    process_aos(aos_points, n);
    auto end = std::chrono::high_resolution_clock::now();
    auto aos_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    // SoA version
    Points soa_points;
    soa_points.x.resize(n);
    soa_points.y.resize(n);
    for (size_t i = 0; i < n; ++i) {
        soa_points.x[i] = 1.0f;
        soa_points.y[i] = 2.0f;
    }

    start = std::chrono::high_resolution_clock::now();
    process_soa(soa_points, n);
    end = std::chrono::high_resolution_clock::now();
    auto soa_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "AoS time: " << aos_time << " us\n";
    std::cout << "SoA time: " << soa_time << " us\n";
    std::cout << "Speedup: " << (double)aos_time / soa_time << "x\n";

    delete[] aos_points;
    return 0;
}

Compile with:

clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 \
        cache_layout_example.cpp -o cache_layout_example

Example 3: Profile-guided optimisation benefit¶

This example demonstrates PGO workflow and benefits:

// pgo_example.cpp
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>

// Hot path function
void process_hot_path(std::vector<int>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        if (data[i] > 1000) {  // Common branch
            data[i] = data[i] * 2 + 1;
        } else {  // Less common branch
            data[i] = data[i] / 2;
        }
    }
}

// Cold path function
void process_cold_path(std::vector<int>& data) {
    std::sort(data.begin(), data.end());
}

int main(int argc, char* argv[]) {
    const size_t n = 10000000;
    std::vector<int> data(n);

    // Initialize with pattern that makes hot path common
    for (size_t i = 0; i < n; ++i) {
        data[i] = (i % 10 == 0) ? 500 : 2000;  // 90% go to hot path
    }

    // Simulate typical workload
    for (int iter = 0; iter < 100; ++iter) {
        process_hot_path(data);
        if (iter % 10 == 0) {
            process_cold_path(data);
        }
    }

    return 0;
}

PGO workflow:

# Step 1: Instrumentation build
clang++ -fprofile-generate -O3 -march=sapphirerapids \
        pgo_example.cpp -o pgo_example

# Step 2: Run representative workload
./pgo_example

# Step 3: Merge profile
llvm-profdata merge -o pgo.profdata default.profraw

# Step 4: Optimised build
clang++ -fprofile-use=pgo.profdata -O3 -march=sapphirerapids \
        pgo_example.cpp -o pgo_example_optimized

Example 4: Intel MKL with LLVM and Intel compilers¶

This example demonstrates how both LLVM/21 and Intel oneAPI compilers can use Intel Math Kernel Library (MKL) for optimized linear algebra operations.

Source file: sapphirerapids/mkl_benchmark.cpp
SLURM script: sapphirerapids/slurm_mkl_benchmark.sh
Location: sapphirerapids/ directory

Compilation with LLVM/21:

module load mkl/2025.0 llvm/21
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \
        -I$MKLROOT/include \
        -L$MKLROOT/lib/intel64 \
        -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl \
        mkl_benchmark.cpp -o mkl_benchmark_llvm

Compilation with Intel oneAPI:

module load mkl/2025.0 compiler-intel-llvm/2025.0.4
icpx -qmkl=sequential -O3 -march=sapphirerapids -mprefer-vector-width=512 \
     mkl_benchmark.cpp -o mkl_benchmark_intel

Performance results:

Both compilers achieve similar MKL performance (within 0.5%)
For 2048x2048 DGEMM: LLVM/21 = 99.75 GFLOPS, Intel oneAPI = 99.88 GFLOPS
MKL library performance is independent of compiler choice
Differences come from user code optimisation, not MKL library calls

Compiler differences:

Intel oneAPI provides simpler MKL linking with -qmkl flag
LLVM/21 requires manual library linking but offers more control
Both compilers can achieve optimal MKL performance
Compiler choice primarily affects user code, not pre-compiled MKL routines

MKL with OpenMP threading¶

MKL can use OpenMP for internal threading, which is important for multi-threaded applications. The choice of threading library must match between your application and MKL to avoid conflicts.

Source file: sapphirerapids/mkl_openmp_example.cpp

SLURM script: sapphirerapids/slurm_mkl_openmp.sh

Compilation with LLVM/21:

module load mkl/2025.0 llvm/21
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \
        -fopenmp \
        -I$MKLROOT/include \
        -L$MKLROOT/lib/intel64 \
        -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lpthread -lm -ldl \
        mkl_openmp_example.cpp -o mkl_openmp_llvm

Compilation with Intel oneAPI:

module load mkl/2025.0 compiler-intel-llvm/2025.0.4
icpx -qmkl=parallel -qopenmp -O3 -march=sapphirerapids -mprefer-vector-width=512 \
     mkl_openmp_example.cpp -o mkl_openmp_intel

Threading library differences:

LLVM/21: Uses -lmkl_gnu_thread with -liomp5 (Intel OpenMP runtime) for improved scaling
Intel oneAPI: Uses -qmkl=parallel which automatically selects libmkl_intel_thread with Intel OpenMP
Both require matching OpenMP runtime libraries to avoid conflicts

Runtime configuration:

export OMP_NUM_THREADS=56
export MKL_NUM_THREADS=56
export OMP_PLACES=cores
export OMP_PROC_BIND=close
./mkl_openmp_llvm 2048 56

Performance Considerations:

Set OMP_NUM_THREADS and MKL_NUM_THREADS to the same value
Use OMP_PLACES=cores and OMP_PROC_BIND=close for NUMA-aware placement
Intel OpenMP (libiomp5) typically provides higher scaling than GNU OpenMP for MKL

MKL with MPI¶

MKL can be used with MPI for distributed-memory parallel applications. Intel MPI (provided with oneAPI) works with both LLVM and Intel compilers.

Source file: sapphirerapids/mkl_mpi_example.cpp

SLURM script: sapphirerapids/slurm_mkl_mpi.sh

Compilation with LLVM/21:

module load mkl/2025.0 llvm/21 mpi/2021.14
# Intel MPI uses I_MPI_CXX environment variable to select compiler
export I_MPI_CXX=clang++
export CXXFLAGS="-O3 -march=sapphirerapids -DNDEBUG -std=c++17 -stdlib=libc++"
mpicxx ${CXXFLAGS} \
       -I$MKLROOT/include \
       -L$MKLROOT/lib/intel64 \
       -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -liomp5 -lpthread -lm -ldl \
       mkl_mpi_example.cpp -o mkl_mpi_llvm

Compilation with Intel oneAPI:

module load mkl/2025.0 compiler-intel-llvm/2025.0.4 mpi/2021.14
# Force mpicxx to use icpx instead of default g++
# Intel MPI uses I_MPI_CXX environment variable to select compiler
export I_MPI_CXX=icpx
mpicxx -O3 -march=sapphirerapids -DNDEBUG -std=c++17 \
       -qmkl=parallel \
       mkl_mpi_example.cpp -o mkl_mpi_intel

MPI process grid configuration:

# Run with 2 MPI processes, 56 threads each
export MKL_NUM_THREADS=56
export OMP_NUM_THREADS=56
srun -n 2 --cpu-bind=sockets ./mkl_mpi_llvm 2048 56

MPI configuration notes:

Intel MPI (mpicxx) wrapper works with both compilers
For LLVM, set I_MPI_CXX=clang++ to override default g++ compiler
For Intel oneAPI, set I_MPI_CXX=icpx to override default g++ compiler
By default, mpicxx uses g++ unless compiler is explicitly specified
Intel MPI uses I_MPI_CXX environment variable (not CXX) to select the C++ compiler for both LLVM and Intel compilers
MKL threading (MKL_NUM_THREADS) should match OpenMP threads per process
Use --cpu-bind=sockets in SLURM to bind processes to NUMA domains

MKL BLACS for ScaLAPACK:

For distributed linear algebra (ScaLAPACK), MKL provides BLACS libraries:

libmkl_blacs_intelmpi_lp64 - Intel MPI
libmkl_blacs_openmpi_lp64 - OpenMPI

These are automatically selected when using -qmkl=cluster with Intel compilers, or manually linked with LLVM.

Example 5: Intel oneDNN with LLVM and Intel compilers¶

Intel oneDNN (Deep Neural Network Library) is a performance library for deep learning applications, providing optimized primitives for neural network operations. It supports both LLVM and Intel compilers.

Source file: sapphirerapids/onednn_benchmark.cpp

SLURM script: sapphirerapids/slurm_onednn_benchmark.sh

Compilation with LLVM/21:

module load llvm/21 dnnl/latest
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -stdlib=libc++ \
        -I$DNNLROOT/include \
        -L$DNNLROOT/lib -ldnnl -Wl,-rpath,$DNNLROOT/lib \
        onednn_benchmark.cpp -o onednn_llvm

Compilation with Intel oneAPI:

module load compiler-intel-llvm/2025.0.4 dnnl/latest
icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 \
     -I$DNNLROOT/include \
     -L$DNNLROOT/lib -ldnnl \
     onednn_benchmark.cpp -o onednn_intel

Performance results:

LLVM/21: 4278 GFLOPS (average, 2048x2048 matrix multiplication)
Intel oneAPI: 4022 GFLOPS (average, 2048x2048 matrix multiplication)
LLVM/21 shows approximately 6.4% higher performance

Compiler performance with oneDNN:

Both compilers can successfully use oneDNN library
oneDNN library itself is pre-compiled, but user code compilation affects performance
LLVM/21 shows slightly higher performance for the benchmark code
oneDNN automatically detects and uses AMX instructions on Sapphire Rapids
Both compilers link against the same oneDNN library (version 3.6.1)

Runtime configuration:

# Disable verbose output (optional)
export DNNL_VERBOSE=0
export ONEDNN_VERBOSE=0

# Run benchmark
./onednn_llvm 2048
./onednn_intel 2048

Integration with ML frameworks:

TensorFlow: Enable with TF_ENABLE_ONEDNN_OPTS=1
PyTorch: Uses oneDNN automatically on Sapphire Rapids
Both frameworks benefit from oneDNN’s AMX optimisations

Example 6: AMX for ML/AI workloads¶

This example demonstrates how to use AMX (Advanced Matrix Extensions) for machine learning and AI workloads, showing all three AMX types: AMX-TILE, AMX-INT8, and AMX-BF16.

Source file: sapphirerapids/amx_ml_example.cpp
SLURM script: sapphirerapids/slurm_amx_example.sh
Location: sapphirerapids/ directory

Complete AMX example with runtime detection¶

// amx_ml_example.cpp
#include <immintrin.h>
#include <stdint.h>
#include <iostream>
#include <chrono>
#include <cstring>
#include <cstdlib>

// Runtime AMX detection
bool check_amx_support() {
    unsigned int eax, ebx, ecx, edx;

    // Check for AMX-TILE support (CPUID leaf 0x1D, subleaf 0x0)
    __cpuid_count(0x1D, 0x0, eax, ebx, ecx, edx);
    bool has_tile = (eax & (1 << 0)) != 0;
    bool has_int8 = (eax & (1 << 1)) != 0;
    bool has_bf16 = (eax & (1 << 5)) != 0;

    std::cout << "AMX-TILE: " << (has_tile ? "Yes" : "No") << "\n";
    std::cout << "AMX-INT8: " << (has_int8 ? "Yes" : "No") << "\n";
    std::cout << "AMX-BF16: " << (has_bf16 ? "Yes" : "No") << "\n";

    return has_tile && has_int8 && has_bf16;
}

// AMX-BF16 matrix multiplication for neural network layers
// C = A × B where A, B, C are bfloat16 matrices
void amx_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C,
                     int M, int N, int K) {
    // Configure AMX tiles
    uint8_t tilecfg[64] = {0};

    // Tile 0: A matrix (16 rows × 32 bf16 elements = 64 bytes per row)
    tilecfg[0] = 16;   // rows
    tilecfg[1] = 64;   // bytes per row

    // Tile 1: B matrix (transposed, 16 rows × 32 bf16 elements)
    tilecfg[16] = 16;
    tilecfg[17] = 64;

    // Tile 2: C accumulator (16 rows × 32 bf16 elements, stores FP32)
    tilecfg[32] = 16;
    tilecfg[33] = 64;

    _tile_loadconfig(tilecfg);

    // Blocked matrix multiplication
    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            // Zero accumulator tile
            _tile_zero(2);

            // Inner product accumulation
            for (int k = 0; k < K; k += 32) {
                // Load A[i:i+16, k:k+32] into tile 0
                _tile_loadd(0, &A[i * K + k], K * sizeof(__bf16));

                // Load B[k:k+32, j:j+16] (transposed) into tile 1
                _tile_loadd(1, &B[k * N + j], N * sizeof(__bf16));

                // Compute: tile2 += tile0 × tile1 (BF16)
                _tile_dpbf16ps(2, 0, 1);
            }

            // Store result from tile 2 to C[i:i+16, j:j+16]
            _tile_stored(2, &C[i * N + j], N * sizeof(__bf16));
        }
    }

    _tile_release();
}

// AMX-INT8 quantized matrix multiplication for inference
// C = A × B where A, B are int8, C is int32 accumulator
void amx_int8_matmul(const int8_t* A, const int8_t* B, int32_t* C,
                     int M, int N, int K) {
    // Configure AMX tiles for INT8
    uint8_t tilecfg[64] = {0};

    // Tile 0: A matrix (16 rows × 64 int8 elements = 64 bytes per row)
    tilecfg[0] = 16;
    tilecfg[1] = 64;

    // Tile 1: B matrix (64 rows × 16 int8 elements, transposed)
    tilecfg[16] = 16;
    tilecfg[17] = 64;

    // Tile 2: C accumulator (16 rows × 16 int32 elements)
    tilecfg[32] = 16;
    tilecfg[33] = 64;

    _tile_loadconfig(tilecfg);

    for (int i = 0; i < M; i += 16) {
        for (int j = 0; j < N; j += 16) {
            _tile_zero(2);  // Zero accumulator

            for (int k = 0; k < K; k += 64) {
                // Load A[i:i+16, k:k+64]
                _tile_loadd(0, &A[i * K + k], K);

                // Load B[k:k+64, j:j+16] (transposed)
                _tile_loadd(1, &B[k * N + j], N);

                // Compute: tile2 += tile0 × tile1 (INT8)
                _tile_dpbssd(2, 0, 1);
            }

            // Store result (int32 accumulator)
            _tile_stored(2, &C[i * N + j], N * sizeof(int32_t));
        }
    }

    _tile_release();
}

// Reference implementation using AVX-512 (for comparison)
void avx512_bf16_matmul(const __bf16* A, const __bf16* B, __bf16* C,
                        int M, int N, int K) {
    // Simplified AVX-512 implementation for comparison
    // This is a basic version; full implementation would be more complex
    for (int i = 0; i < M; ++i) {
        for (int j = 0; j < N; ++j) {
            float sum = 0.0f;
            for (int k = 0; k < K; ++k) {
                sum += (float)A[i * K + k] * (float)B[k * N + j];
            }
            C[i * N + j] = (__bf16)sum;
        }
    }
}

int main() {
    // Check AMX support
    std::cout << "Checking AMX support...\n";
    if (!check_amx_support()) {
        std::cerr << "AMX not supported on this system\n";
        return 1;
    }

    // Matrix dimensions (typical neural network layer)
    const int M = 1024;  // Batch size × sequence length
    const int N = 4096;  // Output features
    const int K = 2048;  // Input features

    // Allocate and initialize matrices
    __bf16* A_bf16 = (__bf16*)aligned_alloc(64, M * K * sizeof(__bf16));
    __bf16* B_bf16 = (__bf16*)aligned_alloc(64, K * N * sizeof(__bf16));
    __bf16* C_bf16 = (__bf16*)aligned_alloc(64, M * N * sizeof(__bf16));
    __bf16* C_ref = (__bf16*)aligned_alloc(64, M * N * sizeof(__bf16));

    // Initialize with random values
    for (int i = 0; i < M * K; ++i) {
        A_bf16[i] = (__bf16)((float)rand() / RAND_MAX);
    }
    for (int i = 0; i < K * N; ++i) {
        B_bf16[i] = (__bf16)((float)rand() / RAND_MAX);
    }

    // Benchmark AMX-BF16
    std::cout << "\nBenchmarking AMX-BF16 matrix multiplication...\n";
    std::cout << "Matrix dimensions: " << M << " × " << K << " × " << N << "\n";

    const int iterations = 10;
    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        amx_bf16_matmul(A_bf16, B_bf16, C_bf16, M, N, K);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto amx_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    // Benchmark AVX-512 reference (for comparison)
    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        avx512_bf16_matmul(A_bf16, B_bf16, C_ref, M, N, K);
    }
    end = std::chrono::high_resolution_clock::now();
    auto avx512_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "AMX-BF16 time: " << amx_time / iterations << " us per iteration\n";
    std::cout << "AVX-512 time: " << avx512_time / iterations << " us per iteration\n";
    if (avx512_time > 0) {
        std::cout << "Speedup: " << (double)avx512_time / amx_time << "x\n";
    }

    // INT8 quantized example
    std::cout << "\nBenchmarking AMX-INT8 quantized matrix multiplication...\n";

    int8_t* A_int8 = (int8_t*)aligned_alloc(64, M * K);
    int8_t* B_int8 = (int8_t*)aligned_alloc(64, K * N);
    int32_t* C_int32 = (int32_t*)aligned_alloc(64, M * N * sizeof(int32_t));

    // Initialize INT8 matrices (quantized values)
    for (int i = 0; i < M * K; ++i) {
        A_int8[i] = (int8_t)(rand() % 256 - 128);
    }
    for (int i = 0; i < K * N; ++i) {
        B_int8[i] = (int8_t)(rand() % 256 - 128);
    }

    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < iterations; ++iter) {
        amx_int8_matmul(A_int8, B_int8, C_int32, M, N, K);
    }
    end = std::chrono::high_resolution_clock::now();
    auto int8_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout << "AMX-INT8 time: " << int8_time / iterations << " us per iteration\n";
    std::cout << "Throughput: " << (double)(M * N * K) / (int8_time / iterations) * 1e6 / 1e9
              << " GFLOPs\n";

    // Cleanup
    free(A_bf16);
    free(B_bf16);
    free(C_bf16);
    free(C_ref);
    free(A_int8);
    free(B_int8);
    free(C_int32);

    return 0;
}

Compile with:

clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 \
        -mprefer-vector-width=512 \
        amx_ml_example.cpp -o amx_ml_example

Use cases for each AMX type¶

AMX-BF16:
- Neural network training with mixed precision
- Inference with bfloat16 precision
- Transformer models (attention mechanisms)
- Large language model inference
AMX-INT8:
- Quantized neural network inference
- Post-training quantization models
- Edge AI inference
- Maximum throughput inference workloads
AMX-TILE:
- Base infrastructure for both INT8 and BF16
- Provides 8KB of tile register storage
- Enables efficient matrix blocking strategies

Performance tips for AMX¶

Tile configuration: Set tile configuration once and reuse across multiple operations
Blocking strategy: Use 16×64 blocking for optimal tile utilisation
Memory alignment: Align all matrices to 64-byte boundaries
Threading: Use one thread per core; each thread has independent tile registers
NUMA awareness: For multi-socket systems, bind threads to local NUMA domain
Mixed precision: Use BF16 when precision allows; INT8 for maximum throughput

Integration with ML frameworks¶

Many ML frameworks automatically use AMX when available:

# TensorFlow with oneDNN AMX support
export TF_ENABLE_ONEDNN_OPTS=1
export ONEDNN_VERBOSE=1  # Enable verbose output to verify AMX usage

# PyTorch with oneDNN
export ONEDNN_VERBOSE=1

# Verify AMX is being used
# Look for "amx" in framework logs

Expected performance improvements:

AMX-BF16: 2-4x speedup over AVX-512 for large matrix multiplications
AMX-INT8: 4-8x speedup over AVX-512 for quantized inference
Suitable for: Large batch sizes, deep neural networks, transformer models

Example 7: Fortran compiler performance (`flang`, `ifx`, and `gfortran`)¶

Fortran code can be compiled with LLVM’s flang, Intel’s ifx, and GCC’s gfortran compilers. This example demonstrates Sapphire Rapids-specific optimisations available in Fortran.

Important

Compiler results are not directly comparable because:

flang (LLVM/21): Only generates AVX-256 (ymm) instructions, not AVX-512
ifx (Intel oneAPI): Generates AVX-512 (zmm) instructions
gfortran (GCC 15.1.0): Generates AVX-512 (zmm) instructions
These are separate benchmarks using different instruction sets

Source files:

sapphirerapids/fortran_avx512_example.f90 - AVX-512 vectorisation
sapphirerapids/fortran_openmp_example.f90 - OpenMP parallelization
sapphirerapids/fortran_mkl_example.f90 - Intel MKL integration

SLURM script: sapphirerapids/slurm_fortran_benchmarks.sh

Compilation with flang (LLVM/21):

module load llvm/21
flang -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
      fortran_avx512_example.f90 -o fortran_avx512_flang

Compilation with ifx (Intel oneAPI):

module load compiler-intel-llvm/2025.0.4
ifx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \
    fortran_avx512_example.f90 -o fortran_avx512_ifx

Compilation with gfortran (GCC 15.1.0):

module load gcc/15.1.0
gfortran -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
         fortran_avx512_example.f90 -o fortran_avx512_gcc

Fortran features for Sapphire Rapids:

AVX-512 vectorisation:
- Use assumed-shape arrays (a(:)) for improved vectorisation hints
- Compiler automatically vectorises simple loops with -march=sapphirerapids
- flang (LLVM/21): Only generates AVX-256 (ymm) code, -mprefer-vector-width=512 flag is ignored
- ifx (Intel oneAPI): Generates AVX-512 (zmm) code with -mprefer-vector-width=512, but may show performance overhead due to frequency scaling
OpenMP parallelization:
- Use !$omp parallel do directives for NUMA-aware parallelization
- Set OMP_PLACES=cores and OMP_PROC_BIND=close for optimal placement
- Both compilers support OpenMP 4.5+ features
MKL integration:
- Intel MKL can be called from Fortran using standard BLAS/LAPACK interfaces
- Both compilers can link against MKL libraries
- Use -qmkl=parallel with ifx or manual linking with flang

Important

Performance results (separate benchmarks - not comparable):

flang (LLVM/21) - AVX-256 only:

Generates AVX-256 (ymm) code only, does not generate AVX-512 (zmm) instructions
-mprefer-vector-width=512 flag is ignored with warning
Single precision (C_FLOAT): 2.93x speedup (vectorised vs scalar)
These results apply to AVX-256 vectorisation only

ifx (Intel oneAPI) - AVX-512:

Generates AVX-512 (zmm) code with -mprefer-vector-width=512
Single precision (C_FLOAT): 1.74x speedup (vectorised vs scalar)
AVX-512 code generation confirmed (zmm registers in assembly)
These results apply to AVX-512 vectorisation

ifx (Intel oneAPI) - AVX-256 (equivalent instruction set):

Can be forced to use AVX-256 with -mprefer-vector-width=256
Single precision (C_FLOAT): 1.49x speedup (vectorised vs scalar)
AVX-256 code generation confirmed (ymm registers in assembly and binary)
This configuration uses the same instruction set as flang (AVX-256), enabling direct comparison

gfortran (GCC 15.1.0) - AVX-512:

Generates AVX-512 (zmm) code with -march=sapphirerapids -mprefer-vector-width=512
Single precision (C_FLOAT): 14.08x speedup (vectorised vs scalar) - highest performance observed
AVX-512 code generation confirmed (zmm registers in assembly and binary)
AMX flags available: -mamx-tile, -mamx-int8, -mamx-bf16 (but AMX requires explicit intrinsics)
These results apply to AVX-512 vectorisation

Equivalent instruction set comparison (AVX-256):

flang (AVX-256): 2.93x speedup
ifx (AVX-256): 1.49x speedup
flang achieves 2.93x speedup compared to 1.49x for ifx when both use AVX-256
Both compilers use the same instruction set (AVX-256), enabling direct comparison

Note

Results using different instruction sets (AVX-256 vs AVX-512) should not be compared directly. For equivalent instruction set comparisons, use AVX-256 mode with -mprefer-vector-width=256.

OpenMP scaling: Both compilers demonstrate scaling with OpenMP
Code compatibility: Same source code compiles with both compilers
Compiler flags:
flang: -march=sapphirerapids (AVX-512 code generation not supported)
ifx: -march=sapphirerapids -mprefer-vector-width=512 (generates AVX-512)

Compiler-specific features:

flang (LLVM/21):
- Uses -fopenmp flag for OpenMP
- Manual MKL linking required
- Compatible with modern Fortran standards
ifx (Intel oneAPI):
- Uses -qopenmp flag for OpenMP
- Simplified MKL linking with -qmkl=parallel
- Integrated with Intel tools (VTune, Advisor)
gfortran (GCC 15.1.0):
- Uses -fopenmp flag for OpenMP
- Manual MKL linking required
- Capability: Generates AVX-512 code with -march=sapphirerapids -mprefer-vector-width=512
- AMX support: AMX flags available (-mamx-tile, -mamx-int8, -mamx-bf16) but AMX requires explicit intrinsics
- Performance: 14.08x speedup with AVX-512 (single precision C_FLOAT) - highest performance observed

Important

Compiler performance results are not directly comparable because they use fundamentally different instruction sets (AVX-256 vs AVX-512). Each compiler’s results should be evaluated independently.

Example 8: C++ compiler comparison (`clang++`, `g++`, `icpx`)¶

This section compares C++ compilers (clang++, g++, and icpx) for AVX-512, AVX-256, and AMX support on Sapphire Rapids.

Compilers tested:

clang++ (LLVM/21): module load llvm/21
g++ (GCC 15.1.0): module load gcc/15.1.0
icpx (Intel oneAPI 2025.0.4): module load compiler-intel-llvm/2025.0.4

Test code: sapphirerapids/vectorized_compute.cpp

AVX-512 support¶

All three compilers support AVX-512 code generation:

Compilation flags for AVX-512:

# clang++ (LLVM/21)
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
        vectorised_compute.cpp -o vectorised_compute_clang

# g++ (GCC 15.1.0)
g++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
    vectorised_compute.cpp -o vectorised_compute_gcc

# icpx (Intel oneAPI)
icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \
     vectorised_compute.cpp -o vectorised_compute_icpx

AVX-512 performance results (10M elements, single precision):

Compiler	Speedup	Remarks
`clang++` (LLVM/21)	1.25x	AVX-512 (zmm) confirmed in assembly
`g++` (GCC 15.1.0)	0.96x	AVX-512 (zmm) confirmed, but slower than scalar
`icpx` (Intel oneAPI)	1.06x	AVX-512 (zmm) confirmed

Note

AVX-512 results show limited speedup due to CPU frequency scaling on Sapphire Rapids. The vectorised code uses zmm registers but may experience downclocking.

AVX-256 support (equivalent instruction set)¶

For equivalent instruction set comparison, all compilers can be forced to use AVX-256:

Compilation flags for AVX-256:

# clang++ (LLVM/21)
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=256 -fopenmp \
        vectorized_compute.cpp -o vectorized_compute_clang_avx256

# g++ (GCC 15.1.0)
g++ -O3 -march=sapphirerapids -mprefer-vector-width=256 -fopenmp \
    vectorized_compute.cpp -o vectorized_compute_gcc_avx256

# icpx (Intel oneAPI)
icpx -O3 -march=sapphirerapids -mprefer-vector-width=256 -qopenmp \
     vectorized_compute.cpp -o vectorized_compute_icpx_avx256

AVX-256 performance results (10M elements, single precision):

Compiler	Speedup	Remarks
`clang++` (LLVM/21)	1.28x	AVX-256 (ymm) confirmed
`g++` (GCC 15.1.0)	1.02x	AVX-256 (ymm) confirmed
`icpx` (Intel oneAPI)	1.05x	AVX-256 (ymm) confirmed

clang++ achieves 1.28x speedup with AVX-256, followed by icpx (1.05x) and g++ (1.02x). All compilers use the same instruction set (AVX-256), enabling direct comparison.

AMX support¶

All three compilers support AMX flags, but AMX requires explicit intrinsics:

AMX compilation flags:

# clang++ (LLVM/21)
clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
        amx_ml_example.cpp -o amx_ml_example_clang

# g++ (GCC 15.1.0)
g++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
    amx_ml_example.cpp -o amx_ml_example_gcc

# icpx (Intel oneAPI)
icpx -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -qopenmp \
     amx_ml_example.cpp -o amx_ml_example_icpx

AMX Support Status:

Compiler	AMX Flags	AMX Intrinsics	Remarks
`clang++` (LLVM/21)	Supported	Works	AMX intrinsics compile successfully
`g++` (GCC 15.1.0)	Supported	Partial	`vmovw` (BF16) instruction not supported by assembler
`icpx` (Intel oneAPI)	Supported	Works	AMX intrinsics compile successfully

AMX flags enable AMX instruction support, but AMX is not auto-vectorised
AMX must be used via explicit intrinsics (e.g., _tile_loadd, _tile_dpbssd)
g++ has an assembler limitation with BF16 instructions (vmovw)
For AMX usage, clang++ and icpx are recommended

AVX-512:

All compilers generate AVX-512 code (zmm registers)
Performance limited by CPU frequency scaling
clang++ achieves 1.25x speedup

AVX-256 (Equivalent Instruction Set):

All compilers can be forced to use AVX-256
clang++ achieves 1.28x speedup
This configuration provides equivalent instruction sets for direct comparison

AMX:

All compilers support AMX flags
AMX requires explicit intrinsics (not auto-vectorised)
clang++ and icpx recommended for AMX code
g++ has BF16 instruction limitations
For AVX-512: clang++ achieves 1.25x speedup
For AVX-256: clang++ achieves 1.28x speedup
For AMX: Use clang++ or icpx (g++ has limitations)
For equivalent instruction set comparisons: Use AVX-256 mode (-mprefer-vector-width=256)

Example 9: OpenMP library comparison¶

This section compares OpenMP libraries from different compiler suites and their support for AVX-512 and AMX SIMD.

OpenMP Libraries Tested:

libomp (LLVM/21): OpenMP 5.0, used with clang++ -fopenmp
libiomp5 (Intel oneAPI): Intel OpenMP, used with icpx -qopenmp
libgomp (GCC 15.1.0): GNU OpenMP 4.5, used with g++ -fopenmp

Test code: sapphirerapids/openmp_simd_test.cpp

OpenMP libraries and versions¶

Compiler	OpenMP Library	OpenMP Version	Library Path
`clang++` (LLVM/21)	`libomp.so`	5.0 (202011)	`/opt/software/llvm/21/21.1.0/lib/x86_64-unknown-linux-gnu/libomp.so`
`icpx` (Intel oneAPI)	`libiomp5.so`	5.0 (202011)	`/opt/intel/oneapi/compiler/2025.0/lib/libiomp5.so`
`g++` (GCC 15.1.0)	`libgomp.so.1`	4.5 (201511)	`/opt/software/gnu/gcc-15/gcc-15.1.0/lib64/libgomp.so.1`

AVX-512 support in OpenMP SIMD¶

All three OpenMP libraries support AVX-512 SIMD vectorisation:

Compilation:

# clang++ with libomp
clang++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
        openmp_simd_test.cpp -o openmp_simd_test_clang

# icpx with libiomp5
icpx -O3 -march=sapphirerapids -mprefer-vector-width=512 -qopenmp \
     openmp_simd_test.cpp -o openmp_simd_test_icpx

# g++ with libgomp
g++ -O3 -march=sapphirerapids -mprefer-vector-width=512 -fopenmp \
    openmp_simd_test.cpp -o openmp_simd_test_gcc

AVX-512 Confirmation:

All three compilers generate AVX-512 (zmm) instructions in OpenMP SIMD loops
Assembly analysis confirms: vmovaps %zmm, vfmadd231ps %zmm, etc.
OpenMP SIMD pragmas successfully vectorise to AVX-512

Performance comparison¶

Test Configuration:

Array size: 100M elements (single precision)
Iterations: 10
Threads: 56 (one socket)

Results (Parallel+SIMD, 56 threads):

OpenMP Library	SIMD Speedup	Parallel+SIMD Speedup	Remarks
libomp (LLVM/21)	0.96x	22.04x	Highest parallel performance (22.04x)
libiomp5 (Intel)	1.12x	6.26x	Higher SIMD performance (1.12x), lower parallel scaling (6.26x)
libgomp (GCC)	1.29x	17.34x	Highest SIMD-only performance (1.29x)

libomp (LLVM/21): Highest parallel+SIMD performance (22.04x), strong thread scaling
libgomp (GCC): Highest SIMD-only performance (1.29x), parallel scaling of 17.34x
libiomp5 (Intel): Moderate performance, lower parallel scaling than libomp

Thread scaling (Parallel+SIMD):

Threads	libomp (LLVM)	libiomp5 (Intel)	libgomp (GCC)
1	1.12x	1.11x	1.23x
28	9.36x	9.61x	9.33x
56	22.04x	6.26x	17.34x

libomp achieves the highest scaling to 56 threads, while libiomp5 shows reduced scaling at high thread counts.

AMX support in OpenMP¶

Important

OpenMP SIMD does not auto-vectorise to AMX instructions.

AMX Usage with OpenMP:

AMX requires explicit intrinsics (_tile_loadd, _tile_dpbssd, _tile_stored, etc.)
AMX intrinsics can be used within OpenMP parallel regions
OpenMP does not generate AMX code automatically from SIMD pragmas
AMX must be manually integrated into OpenMP parallel code

Example:

#pragma omp parallel
{
    // AMX intrinsics can be used here
    _tile_loadd(0, A, K);
    _tile_loadd(1, B, K);
    _tile_dpbssd(2, 0, 1);
    _tile_stored(2, C, N);
}

Compilation with AMX:

# All compilers support AMX flags
clang++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
        amx_code.cpp -o amx_code_clang

icpx -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -qopenmp \
     amx_code.cpp -o amx_code_icpx

g++ -O3 -march=sapphirerapids -mamx-tile -mamx-int8 -mamx-bf16 -fopenmp \
    amx_code.cpp -o amx_code_gcc

AVX-512 support:

All three OpenMP libraries support AVX-512 SIMD vectorisation
OpenMP SIMD pragmas generate AVX-512 (zmm) instructions
Highest performance: libomp (22.04x parallel+SIMD speedup)

AMX support:

OpenMP SIMD does not auto-vectorise to AMX
AMX requires explicit intrinsics
AMX intrinsics can be used in OpenMP parallel regions
All compilers support AMX flags, but AMX must be manually integrated
For highest parallel+SIMD performance: Use libomp (LLVM/21) with clang++
For highest SIMD-only performance: Use libgomp (GCC 15.1.0) with g++
For AMX: Use explicit intrinsics within OpenMP parallel regions
For AVX-512 SIMD: All three libraries work, choose based on parallel scaling needs

Example 10: C++ threads performance comparison¶

This section compares native C++ std::thread performance across different compilers for matrix multiplication.

Compilers tested:

clang++ (LLVM/21): module load llvm/21
icpx (Intel oneAPI 2025.0.4): module load compiler-intel-llvm/2025.0.4
g++ (GCC 15.1.0): module load gcc/15.1.0

Test code: sapphirerapids/cpp_threads_matmul.cpp

Requirements:

C++17 standard (-std=c++17)
Native C++ threads (std::thread), not OpenMP
Maximum 56 threads
Matrix size: 2048x2048

Compilation¶

Compilation flags:

# clang++ (LLVM/21)
clang++ -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \
        cpp_threads_matmul.cpp -o cpp_threads_matmul_clang

# icpx (Intel oneAPI)
icpx -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \
     cpp_threads_matmul.cpp -o cpp_threads_matmul_icpx

# g++ (GCC 15.1.0)
g++ -std=c++17 -O3 -march=sapphirerapids -mprefer-vector-width=512 -pthread \
    cpp_threads_matmul.cpp -o cpp_threads_matmul_gcc

Note

The -pthread flag is required for C++ threads support.

AVX-512 vectorization¶

All three compilers generate AVX-512 code in the matrix multiplication kernel:

Assembly analysis confirms: vmovaps %zmm, vfmadd213ps %zmm, vfmadd231ps %zmm
The inner loop is auto-vectorised to use AVX-512 instructions
Each thread benefits from AVX-512 vectorisation

Performance results¶

Test Configuration:

Matrix size: 2048x2048 (single precision)
Iterations: 5
Thread counts: 1, 14, 28, 56

Performance Comparison (GFLOPS):

Threads	`clang++` (LLVM/21)	`icpx` (Intel oneAPI)	`g++` (GCC 15.1.0)
1	17.57 GFLOPS	17.34 GFLOPS	17.56 GFLOPS
14	206.99 GFLOPS	208.49 GFLOPS	204.04 GFLOPS
28	381.77 GFLOPS	319.33 GFLOPS	381.77 GFLOPS
56	423.15 GFLOPS	287.29 GFLOPS	401.40 GFLOPS

Speedup Comparison (relative to 1 thread):

Threads	`clang++` (LLVM/21)	`icpx` (Intel oneAPI)	`g++` (GCC 15.1.0)
1	1.00x	1.00x	1.00x
14	11.78x	12.02x	11.62x
28	21.73x	18.42x	21.74x
56	24.09x	16.57x	22.86x

Single-threaded performance:

All compilers show similar single-threaded performance (~17.5 GFLOPS)
Differences are within measurement variance

Multi-threaded scaling:

clang++ (LLVM/21): Highest scaling to 56 threads (24.09x speedup, 423.15 GFLOPS)
g++ (GCC 15.1.0): Scaling of 22.86x speedup (401.40 GFLOPS)
icpx (Intel oneAPI): Shows performance degradation at 56 threads (16.57x speedup, 287.29 GFLOPS)

clang++ achieves 423.15 GFLOPS at 56 threads
g++ achieves 401.40 GFLOPS at 56 threads, lower than clang++
icpx shows reduced performance at 56 threads, possibly due to thread contention or NUMA issues
All compilers demonstrate strong scaling up to 28 threads
AVX-512 vectorisation is used by all compilers in the inner loop

Threading Model:

Uses C++17 std::thread (native C++ threads)
Each thread processes a portion of rows
No OpenMP overhead - pure C++ threading
Thread creation and synchronization handled by C++ standard library

Comparison with OpenMP¶

C++ Threads vs OpenMP (56 threads, clang++):

C++ Threads: 423.15 GFLOPS (24.09x speedup)
OpenMP Parallel+SIMD: Similar performance range
C++ threads provide more control but require manual thread management
OpenMP provides easier parallelization with pragmas

Performance results:

clang++ (LLVM/21): 423.15 GFLOPS at 56 threads (24.09x speedup) - highest performance
g++ (GCC 15.1.0): 401.40 GFLOPS at 56 threads (22.86x speedup)
icpx (Intel oneAPI): 287.29 GFLOPS at 56 threads (16.57x speedup) - shows degradation
For highest C++ threads performance: clang++ (LLVM/21) achieves 423.15 GFLOPS
For performance with simpler code:g++ (GCC 15.1.0) achieves 401.40 GFLOPS
For 28 threads or fewer: All compilers demonstrate acceptable performance
For 56 threads: clang++ or g++ recommended (icpx shows degradation)

AVX-512 support:

All compilers generate AVX-512 code in the matrix multiplication kernel
Each thread benefits from AVX-512 vectorisation
Performance scales well with thread count when using AVX-512

Runtime optimiser considerations¶

Intel processor optimisation features¶

Unlike AMD Zen2, Intel Sapphire Rapids does not have a documented embedded runtime optimiser that performs instruction-level optimisations at runtime. However, Intel processors include several hardware-level optimisation features:

Out-of-Order Execution: The processor can reorder instructions at runtime to maximise instruction-level parallelism, but this is a standard feature of modern processors, not a specialised runtime optimiser.
Hardware Prefetching: Aggressive hardware prefetchers that predict and prefetch data into cache, reducing memory latency.
Branch Prediction: Sophisticated branch prediction units that minimize branch misprediction penalties.
Turbo Boost: Dynamic frequency scaling based on workload and thermal headroom.
Hyper-Threading (SMT): Simultaneous Multi-Threading that enables improved utilisation of execution units.

Implications for optimisation strategy¶

The lack of a runtime optimizer means:

Compile-time optimizations are critical: Unlike Zen2, where -O2 and -O3 show minimal differences due to suspected runtime optimization, Sapphire Rapids benefits significantly from aggressive compile-time optimizations. -O3 typically provides 5-15% improvement over -O2 for compute-bound workloads.
Explicit optimization hints are valuable: Hints like __restrict__, explicit vectorisation, and loop unrolling provide measurable benefits because the compiler is the primary optimization mechanism.
Profile-Guided Optimization is essential: PGO provides significant benefits (10-30%) because it guides compile-time optimizations based on actual runtime behaviour.
Architecture-specific flags matter more: Flags like -march=sapphirerapids and -mprefer-vector-width=512 are critical for enabling hardware features that the compiler can utilise.
Code layout optimizations: BOLT and other post-link optimizations are valuable because they optimize code layout based on runtime profiles, compensating for the lack of runtime optimization.
Vectorization is compiler-dependent: Unlike systems with runtime optimizers that might optimize vectorisation at runtime, Sapphire Rapids relies entirely on compiler vectorisation. Explicit vectorisation hints and compiler flags are important.

Intel Sapphire Rapids processors do not include an embedded runtime optimiser like AMD Zen2 may have. This means:

Compile-time optimisations are the primary mechanism for performance improvements
-O3 provides significant benefits over -O2 for compute-bound workloads (typically 5-15%)
Explicit optimisation hints (__restrict__, vectorisation, etc.) provide measurable benefits
Profile-Guided Optimisation is essential and provides 10-30% improvements
Architecture-specific flags are critical for enabling hardware features
Post-link optimisations (BOLT) are valuable for code layout optimisation

The optimisation strategy for Sapphire Rapids should focus on aggressive compile-time optimisations, PGO, and architecture-specific flags rather than relying on runtime optimisation capabilities.

Benchmark results summary¶

Measured performance results from the optimisation examples on Intel Xeon Platinum 8480C (Sapphire Rapids):

Measured speedups¶

Vectorization with AVX-512: 1.29x speedup (LLVM/21) vs 1.01x (Intel oneAPI 2025.0.4)
- LLVM shows better vectorisation optimisation
- Both compilers produce correct results (checksums match)
Combined optimisations (AVX-512 + restrict + alignment): 1.27x speedup (LLVM/21) vs 1.03x (Intel oneAPI)
- LLVM better at combining multiple optimisations
- Intel compiler shows more conservative optimisation
Restrict pointer optimisation: 1.28x speedup (LLVM/21) vs 1.22x (Intel oneAPI)
- Both compilers benefit from __restrict__ hints
- LLVM shows slightly better alias analysis optimisation
Cache-aware data layout: 8.05x speedup (LLVM/21) vs 8.09x (Intel oneAPI)
- Both compilers demonstrate strong performance for memory layout optimisations
- This is the largest speedup category (memory-bound optimisation)
Memory alignment: 7.48x speedup (LLVM/21) vs 5.20x (Intel oneAPI)
- Both compilers benefit significantly from proper alignment
- LLVM shows better utilisation of aligned memory access

Compiler comparison: LLVM/21 vs Intel oneAPI 2025.0.4¶

The examples were tested with both LLVM/21 (clang++) and Intel oneAPI 2025.0.4 (icpx) compilers. Results show:

LLVM/21 Advantages:

Better vectorisation optimisation (1.29x vs 1.01x for AVX-512)
Better combined optimisation performance (1.27x vs 1.03x)
Better memory alignment utilisation (7.48x vs 5.20x)
More aggressive optimisation with -O3

Intel oneAPI Advantages:

Slightly better cache layout optimisation (8.09x vs 8.05x)
More conservative optimisation may be beneficial for stability
Better integration with Intel-specific tools (VTune, oneDNN)

Code Compatibility:

All examples compile and run with both compilers using the same source code
No code modifications needed between compilers
Both compilers support the same optimisation flags (-march=sapphirerapids, -mprefer-vector-width=512, etc.)
Use LLVM/21 for maximum performance on compute-bound workloads
Use Intel oneAPI when integration with Intel tools (VTune, oneDNN) is required
Test both compilers for your specific workload to determine the appropriate choice

Performance comparison table¶

The following table shows measured performance differences between LLVM/21 and Intel oneAPI 2025.0.4 compilers:

Optimisation Type	Example	LLVM/21 Speedup	Intel oneAPI Speedup	Performance Difference	Remarks
AVX-512 Vectorization	`vectorized_compute`	1.29x	1.01x	+27.7%	LLVM shows significantly better vectorization
Combined Optimisations	`combined_optimization`	1.27x	1.03x	+23.3%	LLVM better at combining optimisations
Restrict Pointers	`restrict_example`	1.28x	1.22x	+4.9%	Both compilers benefit, LLVM slightly better
Cache Layout	`cache_layout`	8.05x	8.09x	-0.5%	Essentially equivalent performance
Memory Alignment	`memory_alignment`	7.48x	5.20x	+43.8%	LLVM shows higher alignment utilisation
MKL DGEMM (2048x2048)	`mkl_benchmark`	99.75 GFLOPS	99.88 GFLOPS	-0.1%	Essentially equivalent (MKL is pre-compiled library)
MKL DGEMM OpenMP (2048x2048, 56 threads)	`mkl_openmp_example`	~2800 GFLOPS	~2850 GFLOPS	-1.8%	Both achieve strong scaling with OpenMP threading
MKL DGEMM MPI (2 processes, 56 threads/process)	`mkl_mpi_example`	3193 GFLOPS	3192 GFLOPS	+0.03%	Essentially equivalent performance, both demonstrate scaling with MPI
oneDNN MatMul (2048x2048)	`onednn_benchmark`	4278 GFLOPS	4022 GFLOPS	+6.4%	LLVM shows higher performance for user code, oneDNN library is pre-compiled
Fortran AVX-256 (100M elements, `flang` only)	`fortran_avx512_example`	2.97x speedup	N/A	N/A	`flang` AVX-256 only (single precision `C_FLOAT`), not comparable to `ifx`
Fortran AVX-512 (100M elements, ifx only)	`fortran_avx512_example`	N/A	1.74x speedup	N/A	`ifx` AVX-512 (single precision `C_FLOAT`), not comparable to `flang`

Performance Difference = ((LLVM Speedup - Intel Speedup) / Intel Speedup) × 100%

Note

For MKL benchmarks, performance is measured in GFLOPS rather than speedup, as MKL is a pre-compiled optimized library. Both compilers achieve similar performance since MKL routines are independent of the compiler used. OpenMP and MPI variants demonstrate strong scaling characteristics with both compilers. Results shown are from local testing; for production runs, use the provided SLURM job scripts.

Performance analysis by category¶

Compute-bound optimisations (vectorisation, combined):

LLVM/21 shows 23-28% better performance for compute-bound workloads
Intel oneAPI is more conservative, showing minimal speedup (1-3%)
Recommendation: Use LLVM/21 for compute-intensive applications

Memory-bound optimisations (cache, alignment):

Both compilers show excellent cache layout optimisation (8x speedup)
LLVM/21 shows 44% better performance for memory alignment
Recommendation: Both compilers work well, but LLVM has an edge for alignment-sensitive code

Pointer aliasing (restrict):

Both compilers benefit from __restrict__ hints
LLVM/21 shows slightly better optimisation (4.9% difference)
Recommendation: Either compiler works well, LLVM has a small advantage

Overall performance summary¶

Category	LLVM/21 Performance	Intel oneAPI Performance
Compute-bound	23-28% higher speedup	1-3% speedup (conservative)
Memory-bound	44% higher speedup (alignment)	0.5% higher speedup (cache layout)
MKL Integration	Manual linking required	Simplified `-qmkl` flag
MKL Performance (Sequential)	99.75 GFLOPS (DGEMM 2048x2048)	99.88 GFLOPS (DGEMM 2048x2048)
MKL with OpenMP	Uses `-lmkl_gnu_thread -liomp5`	Uses `-qmkl=parallel` (Intel OpenMP)
MKL with MPI	Manual MPI linking, set `I_MPI_CXX=clang++`	Automatic with `-qmkl=parallel`, set `I_MPI_CXX=icpx`
oneDNN Performance	4278 GFLOPS (MatMul 2048x2048)	4022 GFLOPS (MatMul 2048x2048)
Fortran Compilers	`flang` (LLVM/21) with `-fopenmp`	`ifx` (Intel oneAPI) with `-qopenmp`
Fortran Performance (AVX-256)	2.97x speedup (single precision C_FLOAT)	N/A (`ifx` generates AVX-512, not AVX-256)
Fortran Performance (AVX-512)	N/A (`flang` does not generate AVX-512)	1.74x speedup (single precision C_FLOAT)
Note	Results not comparable - different instruction sets	Results not comparable - different instruction sets
Tool Integration	Standard LLVM toolchain	Intel VTune, oneDNN integration

General performance characteristics¶

Optimisation level -O3 provides significant benefits over -O2 for compute-bound workloads (typically 5-15% improvement).
Data layout optimisation provides the largest performance improvements. Cache-aware data structure design shows 8x speedup in benchmarks, exceeding other optimisation techniques.
Profile-Guided Optimisation (PGO) provides 10-30% performance gains with proper profiling workflows.
Use -march=sapphirerapids to enable architecture-specific optimisations including AVX-512 and AMX. For ML/AI workloads, AMX provides 2-8x speedup over AVX-512 for matrix operations. Use -mamx-tile -mamx-int8 -mamx-bf16 to enable all AMX types.
Sapphire Rapids supports AVX-512. Use 512-bit vectors with -mprefer-vector-width=512 for compute-bound workloads. For memory-bound code, 256-bit vectors may be preferable to reduce register pressure.
Loop unrolling should be tuned based on instruction cache capacity. Profile to find optimal unroll factor.
For mixed workloads, blend PGO profiles by weighting representative workloads appropriately.
__restrict__ benefits are significant for complex pointer patterns (1.2-1.3x speedup observed). Profile to identify where alias analysis limits optimisation.
Memory alignment provides significant performance improvements (5-7x speedup), enabling vectorisation and reducing cache penalties.
Dual-socket systems have 2 NUMA domains (one per socket). Use SLURM --sockets-per-node and --cpu-bind=sockets to bind to specific NUMA domains.
Thread affinity binding is workload-dependent. For single-process workloads, OS scheduling often performs well, but explicit CPU affinity binding may be valuable for multi-process applications and NUMA-aware code.
Optimisation priorities: data layout (8x), memory alignment (5-7x), NUMA awareness, and memory access patterns provide larger performance gains than micro-optimisations (1.2-1.3x).

Intel Sapphire Rapids Optimisation Guide (Discoverer+ GPU partition)¶

Table of Contents¶

Introduction¶

Sapphire Rapids architecture overview¶

Optimisation levels: -O2 vs -O3¶

CPU-specific compilation flags¶

Architecture targeting¶

Vector width optimisation¶

AVX-512 specific optimisations¶

AMX (Advanced Matrix Extensions)¶

Compilation flags¶

Runtime detection¶

AMX tile configuration¶

AMX-BF16 for neural network inference¶

AMX-INT8 for quantized neural networks¶

Performance considerations¶

Integration with ML frameworks¶

Loop and alignment¶

LLVM-specific optimisations¶

What to avoid¶

Profile-guided optimisation (PGO)¶

PGO benefits for Sapphire Rapids¶

PGO workflow¶

Blended profiles for mixed workloads¶

Memory optimisations¶

Cache-aware compilation¶

Structure and data layout¶

Huge pages¶

Memory allocators¶

Link-time optimisations¶

ThinLTO vs full LTO¶

Additional link-time flags¶

BOLT post-link optimisation¶

Mixed workload strategy¶

Conservative optimisation flags¶

Function multi-versioning¶

Split optimisation by code characteristics¶

Practical build configuration¶

Complete example: core library build (in SLURM batch job)¶

CMake configuration¶

Runtime considerations¶

CPU frequency scaling¶

NUMA awareness¶

Example: NUMA-optimized execution¶

Environment variables¶

Monitoring and feedback¶

SLURM configuration¶

Node configuration¶

SLURM directives¶

OpenMP configuration¶

SLURM configuration recommendations¶

Example code demonstrating optimisation benefits¶

Example 1: Vectorisation with AVX-512¶

Example 2: Cache-aware data layout¶

Example 3: Profile-guided optimisation benefit¶

Example 4: Intel MKL with LLVM and Intel compilers¶

MKL with OpenMP threading¶

MKL with MPI¶

Example 5: Intel oneDNN with LLVM and Intel compilers¶

Example 6: AMX for ML/AI workloads¶

Complete AMX example with runtime detection¶

Use cases for each AMX type¶

Performance tips for AMX¶

Integration with ML frameworks¶

Example 7: Fortran compiler performance (flang, ifx, and gfortran)¶

Example 8: C++ compiler comparison (clang++, g++, icpx)¶

AVX-512 support¶

AVX-256 support (equivalent instruction set)¶

AMX support¶

Example 9: OpenMP library comparison¶

OpenMP libraries and versions¶

AVX-512 support in OpenMP SIMD¶

Performance comparison¶

AMX support in OpenMP¶

Example 10: C++ threads performance comparison¶

Compilation¶

AVX-512 vectorization¶

Performance results¶

Comparison with OpenMP¶

Runtime optimiser considerations¶

Optimisation levels: `-O2` vs `-O3`¶

Example 7: Fortran compiler performance (`flang`, `ifx`, and `gfortran`)¶

Example 8: C++ compiler comparison (`clang++`, `g++`, `icpx`)¶