AMD Zen2 Optimization Guide (Discoverer CPU partition)

Table of Contents

Introduction

This document describes compilation and execution practices for AMD Zen2 microarchitecture systems. Zen2 processors (EPYC 7002 series, Ryzen 3000 series) have specific characteristics that affect performance.

The code examples and optimization techniques explained in this document are applicable to the Discoverer Petascale Supercomputer (CPU partition). Discoverer compute nodes are equipped with 2 × AMD EPYC 7H12 64-Core Processors, presenting 8 NUMA domains per node with 16 cores per domain, totaling 128 cores per node with SMT (Simultaneous Multi-Threading) providing 256 threads per node. For detailed hardware information about Discoverer, see the Resource Overview documentation.

All compilation and code execution must occur on compute nodes. The only way to access compute nodes is through SLURM batch jobs. Direct execution and compilation on login nodes is not tolerated. All examples in this document must be submitted as SLURM batch jobs using the provided SLURM scripts in the zen2/ directory. The zen2/ folder is located at /opt/software/optimisations/zen2/ on Discoverer and is also available online at https://gitlab.discoverer.bg/vkolev/snippets/-/tree/main/zen2. To reproduce the benchmark results, copy this folder to a project directory and run the SLURM batch scripts from within the copied folder. Compilation commands shown in examples are executed within SLURM batch jobs, not on login nodes.

Zen2 architecture overview

AMD Zen2 microarchitecture (codenamed “Rome” for EPYC and “Matisse” for Ryzen) is a 7nm process node design introduced in 2019. Zen2 processors implement a chiplet-based architecture with multiple Core Complex Dies (CCDs) connected via Infinity Fabric.

The core architecture consists of Core Complexes (CCX), where each CCX contains 4 CPU cores sharing a 16MB L3 cache. Each core has dedicated L1 and L2 caches. The cache hierarchy includes 32KB L1D (data cache) and 32KB L1I (instruction cache) per core, 512KB L2 cache per core, and 16MB L3 cache shared among the 4 cores within a CCX.

Zen2 cores feature a 7-wide instruction dispatch pipeline with dual 256-bit FMA (Fused Multiply-Add) units per core, enabling simultaneous execution of two 256-bit vector operations. The architecture supports AVX2 instructions but does not support AVX-512. Each core includes a 4K-entry µOP cache that stores decoded micro-operations to reduce decode latency for frequently executed code paths.

The branch prediction unit uses a sophisticated multi-level predictor with TAGE (Tagged Geometric History Length) algorithm, providing high accuracy for branch prediction. Memory disambiguation capabilities allow the processor to detect and handle memory dependencies effectively, enabling out-of-order execution optimizations.

For multi-socket systems like AMD EPYC 7002 series, each socket contains multiple CCDs connected via Infinity Fabric. Each socket presents as multiple NUMA domains, with memory controllers distributed across the socket. On Discoverer compute nodes, there are 8 NUMA domains per node with 16 cores per domain, totaling 128 cores per node with SMT (Simultaneous Multi-Threading) providing 256 threads per node.

Optimization levels: -O2 vs -O3

Note

While AMD does not officially document this, there is strong suspicion that the Zen2 microarchitecture has an embedded runtime optimizer. This makes the performance difference between -O2 and -O3 almost insignificant for most code.

Recommendations

  • Use -O2 for most production builds: Provides balanced optimization without excessive code bloat
  • Reserve -O3 for specific hot paths: Only when profiling shows clear benefits
  • Focus on other optimizations: PGO, LTO, and architecture-specific flags provide more benefit than -O3 alone

The runtime optimizer handles many optimizations that -O3 performs at compile time, making aggressive compile-time optimizations less necessary.

CPU-specific compilation flags

Architecture targeting

# Use -march=znver2 (not just -mtune) to enable Zen2-specific instructions
-march=znver2

# This enables:
# - AVX2 (256-bit vectors)
# - BMI2 (bit manipulation)
# - CLZERO (cache line zero)
# - Other Zen2-specific instruction sets

Vector width optimization

# Optimal vector width for Zen2 is 256-bit vectors (no AVX-512 support)
# Note: Explicit vectorization may provide minimal benefit when the compiler
# already optimizes effectively at -O2 (see Example 1 benchmark results)
-mprefer-vector-width=256

# Ensures vectorized math uses FMA instructions
-mfma

Loop and alignment

# Zen2 fetches 32-byte aligned chunks
-falign-loops=32
-falign-functions=32

LLVM-specific optimizations

# Enable loop interchange (benefits from excellent branch prediction)
-mllvm -enable-loopinterchange

# Tune prefetch distance (Zen2 has aggressive prefetchers)
-mllvm -prefetch-distance=128

# NUMA-aware placement (for multi-CCX systems)
-mllvm -enable-npm=true

What to avoid

  • Excessive loop unrolling: Zen2’s µOP cache is 4K entries; over-unrolling exceeds cache capacity and degrades performance
  • Over-aggressive -ffast-math: Test carefully; Zen2’s FP units are strong but precision must be considered
  • Generic -march flags: Target znver2 specifically for architecture-specific optimizations

Profile-guided optimization (PGO)

PGO for Zen2 provides 15-25% performance improvements. It works with LTO and BOLT.

PGO benefits for Zen2

Given Zen2’s sophisticated branch predictor and µOP cache, PGO provides outsized benefits because:

  • Optimizes for actual branch patterns
  • Better code layout reduces µOP cache thrashing
  • Hot/cold splitting keeps working set in L2/L3

PGO workflow

# Step 1: Instrumentation build
clang++ -fprofile-generate -march=znver2 -O2 -flto=thin \
        -mprefer-vector-width=256 \
        source.cpp -o program

# Step 2: Run representative workloads
# In SLURM batch job:
./program < typical_input_1
./program < typical_input_2
./program < typical_input_3

# Step 3: Merge profiles (if multiple runs)
llvm-profdata merge -o final.profdata default.profraw

# Step 4: Optimized build with profile
clang++ -fprofile-use=final.profdata -march=znver2 -O2 \
        -flto=thin -mprefer-vector-width=256 \
        source.cpp -o program_optimized

Blended profiles for mixed workloads

For diverse customer workloads, create weighted blended profiles:

# Collect profiles from multiple workloads
llvm-profdata merge -o workload_A.profdata default.profraw_A
llvm-profdata merge -o workload_B.profdata default.profraw_B
llvm-profdata merge -o workload_C.profdata default.profraw_C

# Merge with weights based on importance/frequency
llvm-profdata merge \
    -weighted-input=3,workload_A.profdata \
    -weighted-input=2,workload_B.profdata \
    -weighted-input=1,workload_C.profdata \
    -o final_blended.profdata

Memory optimizations

Cache-aware compilation

# Better cache utilization through section elimination
-fdata-sections -ffunction-sections

# Linker garbage collection (use with above flags)
-Wl,--gc-sections

Structure and data layout

  • Pack hot data structures to fit within 32KB L1 cache
  • Consider __restrict__ for pointer aliasing hints (Zen2 has strong memory disambiguation, but modern compilers with -O2 already perform effective alias analysis, so explicit __restrict__ may provide minimal benefit - see Example 4)
  • Align data structures to cache line boundaries (64 bytes)

Huge pages

# Enable transparent huge pages in madvise mode
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# In code, use madvise for large allocations
madvise(large_buffer, size, MADV_HUGEPAGE);

Memory allocators

Consider replacing default allocator with:

  • jemalloc: Improved performance for concurrent workloads
  • tcmalloc: Suitable performance characteristics
  • mimalloc: Low overhead, suitable for mixed workloads

Mixed workload strategy

For systems serving diverse customer workloads, use a balanced approach:

Conservative optimization flags

  • -O2 instead of -O3: More balanced, avoids excessive code bloat
  • -flto=thin: Faster, more predictable performance across workload variations
  • PGO with blended profiles: Weighted combination of representative workloads

Function multi-versioning

For hot paths, use target clones:

__attribute__((target_clones("default","avx2","bmi2")))
void process_data(/* params */) {
    // Hot path that benefits from different optimizations
    // Runtime dispatcher selects the best version
}

Split optimization by code characteristics

# Hot paths (identified via profiling)
set_source_files_properties(hot_path.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=znver2 -fprofile-use=hot.profdata")

# Cold paths
set_source_files_properties(general_code.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=x86-64-v3")

# Core libraries
set_source_files_properties(core_lib.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=znver2 -flto=full")

# Customer-facing code
set_source_files_properties(api_code.cpp PROPERTIES
    COMPILE_FLAGS "-O2 -march=x86-64-v3 -flto=thin")

Practical build configuration

Complete example: core library build (in SLURM batch job)

# In SLURM batch job:
clang++ -O2 \
        -march=znver2 \
        -mprefer-vector-width=256 \
        -mfma \
        -falign-loops=32 \
        -falign-functions=32 \
        -fdata-sections \
        -ffunction-sections \
        -fno-semantic-interposition \
        -fno-plt \
        -flto=thin \
        -fprofile-use=blended.profdata \
        -mllvm -enable-loopinterchange \
        -mllvm -prefetch-distance=128 \
        source.cpp -o program \
        -fuse-ld=lld \
        -Wl,--gc-sections \
        -Wl,--icf=safe

CMake configuration

set(CMAKE_C_COMPILER clang)
set(CMAKE_CXX_COMPILER clang++)

# C++ standard (C++17, C++20, or C++23)
# Note: Standard version has minimal impact on performance for equivalent code
# C++20 ranges may have overhead compared to traditional loops
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Base flags
set(CMAKE_C_FLAGS_RELEASE "-O2 -march=znver2 -mprefer-vector-width=256")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE}")

# LTO
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -flto=thin")
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -flto=thin")

# PGO (if profile available)
if(EXISTS "${CMAKE_SOURCE_DIR}/final.profdata")
    set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata")
    set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -fprofile-use=${CMAKE_SOURCE_DIR}/final.profdata")
endif()

# Linker
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fuse-ld=lld")

Runtime considerations

NUMA awareness

For multi-CCX or multi-socket Zen2 systems, NUMA awareness affects performance:

Understanding Zen2 NUMA topology

  • Each CCX (Core Complex) contains 4 cores sharing 16MB L3 cache
  • Multiple CCXs are grouped into NUMA domains (each NUMA domain contains 4 CCXs on Discoverer)
  • Remote memory access (cross-CCX or cross-NUMA domain) has higher latency than local access
  • Discoverer compute nodes configuration (from SLURM):
    • 8 NUMA domains (SLURM refers to these as “sockets”)
    • 16 cores per NUMA domain
    • 2 threads per core (SMT enabled)
    • Total: 8 × 16 × 2 = 256 threads per node
  • Each NUMA domain contains 4 CCXs (16 cores = 4 CCXs × 4 cores)
  • Bind processes to specific NUMA domains

NUMA binding strategies

# Bind to specific NUMA node (memory and CPU)
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

# Bind to specific CCX (4 cores)
# In SLURM batch job:
srun --cpu-bind=map_cpu:0,1,2,3 ./program   # First CCX
srun --cpu-bind=map_cpu:4,5,6,7 ./program   # Second CCX

# Interleave memory across all NUMA nodes (for large datasets)
# In SLURM batch job:
srun numactl --interleave=all ./program

# Prefer specific NUMA node but allow fallback
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --preferred=0 ./program

NUMA configuration for Zen2

  1. Bind threads to cores within the same CCX: Reduces L3 cache contention (4 cores per CCX)
  2. Allocate memory on local NUMA node: Use SLURM --cpu-bind=sockets with numactl --membind or mbind() in code
  3. For dual-socket systems: Bind processes to specific sockets to avoid remote memory access
  4. For multi-process applications: Use SLURM --ntasks=N with --sockets-per-node=N (one process per socket) with local memory allocation
  5. Monitor NUMA statistics: Use numastat and perf stat -e numa-misses
  6. Discoverer compute nodes: 8 NUMA domains, 16 cores per domain. Use SLURM --sockets-per-node and --cpu-bind=sockets to bind to specific NUMA domains

Example: NUMA-optimized execution

Direct execution (using numactl):

# Check NUMA topology
numactl --hardware

# Single NUMA domain binding (for single-process)
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

# Single CCX binding (for small workloads)
# In SLURM batch job:
srun --cpu-bind=map_cpu:0,1,2,3 numactl --membind=0 --cpunodebind=0 ./program

# Multiple NUMA domains: one process per domain
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./process1 &
srun --cpu-bind=sockets:1-1 numactl --membind=1 --cpunodebind=1 ./process2 &

# Monitor NUMA performance
# In SLURM batch job:
perf stat -e numa-misses,numa-migrations ./program
numastat  # Show NUMA allocation statistics

SLURM execution (Discoverer compute nodes):

# Single NUMA domain (for single-process)
#SBATCH --partition=cn
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=16
#SBATCH --cpus-per-task=32
srun --cpu-bind=sockets:0-0 ./program

# Multiple NUMA domains (one task per domain)
#SBATCH --partition=cn
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --ntasks=2
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=16
srun --cpu-bind=sockets ./program

# Explicit NUMA domain binding with numactl
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

# Check SLURM CPU binding
srun --cpu-bind=sockets:0-0 numactl --hardware

Environment variables

Make optimization thresholds runtime-configurable:

# Example: Tunable buffer sizes
export BUFFER_SIZE=1048576
export PARALLELISM_THRESHOLD=1000

Monitoring and feedback

  • Instrumented production builds: Use lightweight sampling (-fprofile-sample-use) to collect real customer profiles
  • Performance telemetry: Track which code paths are actually hot in production
  • A/B testing: Deploy different optimization configurations to subsets of traffic

SLURM configuration for Discoverer

Discoverer compute nodes use SLURM for job scheduling with specific NUMA domain configuration.

Discoverer node configuration

From SLURM configuration:

NodeName=cn0806
Sockets=8          # NUMA domains (SLURM terminology)
CoresPerSocket=16  # 16 cores per NUMA domain
ThreadsPerCore=2   # SMT enabled
RealMemory=257700  # Total memory in MB

All jobs on Discoverer must use: - --partition=cn (required for compute nodes) - --account=<your_project_slurm_account> (required) - --qos=<your_qos_here> (required)

Topology breakdown:

  • 8 NUMA domains: SLURM refers to NUMA domains as “sockets”
  • 16 cores per NUMA domain: Each domain contains 4 CCXs (4 cores per CCX)
  • 2 threads per core: SMT (Simultaneous Multi-Threading) enabled
  • Total capacity: 8 × 16 × 2 = 256 threads per node
  • Memory per NUMA domain: Approximately 32GB per domain (257700 MB / 8)

SLURM NUMA binding directives

Single NUMA domain (for single-process applications)

#!/bin/bash
#SBATCH --job-name=zen2_single_numa
#SBATCH --partition=cn
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --sockets-per-node=1      # Use 1 NUMA domain
#SBATCH --cores-per-socket=16      # All 16 cores in that domain
#SBATCH --threads-per-core=2       # Use SMT
#SBATCH --cpus-per-task=32         # 16 cores × 2 threads
#SBATCH --mem=32G                  # Memory for one NUMA domain

# Bind to NUMA domain 0
srun --cpu-bind=sockets:0-0 ./program

Multiple NUMA domains (one task per domain)

#!/bin/bash
#SBATCH --job-name=zen2_multi_numa
#SBATCH --partition=cn
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --nodes=1
#SBATCH --ntasks=2                 # One task per NUMA domain
#SBATCH --sockets-per-node=2        # Use 2 NUMA domains
#SBATCH --cores-per-socket=16       # 16 cores per domain
#SBATCH --cpus-per-task=32         # 16 cores × 2 threads per task
#SBATCH --mem=64G                   # Memory for 2 NUMA domains

# SLURM automatically binds each task to its assigned NUMA domain
srun --cpu-bind=sockets ./program

Explicit NUMA domain selection

#!/bin/bash
#SBATCH --job-name=zen2_explicit_numa
#SBATCH --partition=cn
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=16
#SBATCH --cpus-per-task=32

# Bind to specific NUMA domain (0-7 available)
srun --cpu-bind=sockets:2-2 ./program  # Use NUMA domain 2

Combining SLURM with numactl

For additional control, combine SLURM binding with numactl:

#!/bin/bash
#SBATCH --partition=cn
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=16
#SBATCH --cpus-per-task=32

# Explicit memory and CPU binding
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./program

SLURM environment variables

SLURM provides environment variables for NUMA-aware programming:

# In SLURM scripts or programs
echo "SLURM_CPUS_PER_TASK: $SLURM_CPUS_PER_TASK"      # CPUs allocated to task
echo "SLURM_SOCKETS_PER_NODE: $SLURM_SOCKETS_PER_NODE" # NUMA domains allocated
echo "SLURM_CORES_PER_SOCKET: $SLURM_CORES_PER_SOCKET" # Cores per NUMA domain
echo "SLURM_CPUS_ON_NODE: $SLURM_CPUS_ON_NODE"        # Total CPUs on node

OpenMP with SLURM NUMA binding

For OpenMP applications:

#!/bin/bash
#SBATCH --partition=cn
#SBATCH --account=<your_project_slurm_account>
#SBATCH --qos=<your_qos_here>
#SBATCH --sockets-per-node=1
#SBATCH --cores-per-socket=16
#SBATCH --cpus-per-task=32

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
export OMP_PROC_BIND=close

srun --cpu-bind=sockets:0-0 ./openmp_program

SLURM configuration for Discoverer

  1. Partition and account: Always use --partition=cn --account=<your_project_slurm_account> --qos=<your_qos_here> for Discoverer compute nodes
  2. Single-process applications: Use --sockets-per-node=1 to bind to one NUMA domain
  3. Multi-process applications: Use --ntasks=N with --sockets-per-node=N (one task per NUMA domain)
  4. Memory allocation: Request memory proportional to NUMA domains used (32GB per domain)
  5. CPU binding: Always use --cpu-bind=sockets to ensure proper NUMA binding
  6. Monitor binding: Check with srun --cpu-bind=sockets numactl --hardware
  7. Thread placement: For OpenMP, use OMP_PLACES=cores and OMP_PROC_BIND=close

Example SLURM scripts

SLURM scripts are provided in the zen2/ directory:

Primary script (runs all examples):

  • slurm_all_benchmarks.sh - Compiles and runs all 11 C++ optimization examples via SLURM. Use this for comprehensive testing.

Individual performance scripts (optional, for specific tests):

  • slurm_numa_performance.sh - Runs only numa_performance_example.cpp
  • slurm_dual_socket_performance.sh - Runs only dual_socket_numa_example.cpp
  • slurm_thread_affinity_performance.sh - Runs only thread_affinity_example.cpp
  • slurm_fortran_example.sh - Compiles and runs zen2_fortran_example.f90 with flang
  • slurm_cpp_standard_comparison.sh - Compares C++17, C++20, and C++23 performance

Documentation scripts (show command examples, do not run code):

  • slurm_numa_example.sh - Shows SLURM NUMA binding command examples
  • slurm_dual_numa_example.sh - Shows dual-NUMA domain command examples
  • slurm_thread_affinity_example.sh - Shows thread affinity command examples

All SLURM scripts include the required Discoverer directives (--partition=cn --account=<your_project_slurm_account> --qos=<your_qos_here>).

Note

All compilation and execution must occur within SLURM batch jobs. Do not compile or run code on login nodes.

Example code demonstrating optimization benefits

The following examples demonstrate how different optimizations benefit Zen2 performance. The example source code and SLURM scripts are located in the zen2/ directory.

Important: The zen2/ folder is located at /opt/software/optimisations/zen2/. All files are also available online at https://gitlab.discoverer.bg/vkolev/snippets/-/tree/main/zen2.

To reproduce the benchmark results, copy this folder to a project directory and run the SLURM batch scripts from within the copied folder:

# Copy the zen2 folder to your project directory
cp -r /opt/software/optimisations/zen2/ /path/to/your/project/

# Navigate to the copied folder
cd /path/to/your/project/zen2

# Submit the SLURM batch job from within the copied folder
sbatch slurm_all_benchmarks.sh

This compiles and executes all examples within the SLURM job. Results are written to zen2_all_benchmarks.<jobid>.out in the same directory.

Note

The SLURM scripts use SLURM_SUBMIT_DIR to locate source files, so they must be run from within the zen2/ directory (or a copy of it) where the source files are located.

Example 1: Vectorization with AVX2

Source file: zen2/vectorized_compute.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

This example shows how -march=znver2 enables AVX2 vectorization:

// vectorized_compute.cpp
#include <immintrin.h>
#include <chrono>
#include <iostream>

// Unoptimized version (scalar)
void compute_scalar(float* a, float* b, float* c, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        c[i] = a[i] * b[i] + a[i];
    }
}

// Optimized version (vectorized with AVX2)
void compute_vectorized(float* __restrict__ a, float* __restrict__ b,
                        float* __restrict__ c, size_t n) {
    size_t i = 0;
    // Process 8 floats at a time (256-bit AVX2)
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 vc = _mm256_fmadd_ps(va, vb, va); // FMA: a*b + a
        _mm256_store_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; ++i) {
        c[i] = a[i] * b[i] + a[i];
    }
}

int main() {
    const size_t n = 10000000;
    float* a = new float[n];
    float* b = new float[n];
    float* c = new float[n];

    // Initialize
    for (size_t i = 0; i < n; ++i) {
        a[i] = 1.5f;
        b[i] = 2.0f;
    }

    // Benchmark scalar
    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 100; ++iter) {
        compute_scalar(a, b, c, n);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto scalar_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    // Benchmark vectorized
    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 100; ++iter) {
        compute_vectorized(a, b, c, n);
    }
    end = std::chrono::high_resolution_clock::now();
    auto vectorized_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << "Scalar time: " << scalar_time.count() << " microseconds\n";
    std::cout << "Vectorized time: " << vectorized_time.count() << " microseconds\n";
    std::cout << "Speedup: " << (double)scalar_time.count() / vectorized_time.count() << "x\n";

    delete[] a;
    delete[] b;
    delete[] c;
    return 0;
}

Compile and run via SLURM:

The example is compiled and executed within slurm_all_benchmarks.sh. Compilation flags:

# Without Zen2 optimizations
clang++ -O2 vectorized_compute.cpp -o vectorized_compute

# With Zen2 optimizations
clang++ -O2 -march=znver2 -mprefer-vector-width=256 -mfma \
        vectorized_compute.cpp -o vectorized_compute_optimized

Example 1a: Auto-vectorization with -mprefer-vector-width=256

Source file: zen2/auto_vectorization_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

This example demonstrates how -mprefer-vector-width=256 affects compiler auto-vectorization decisions. Unlike Example 1 which uses explicit intrinsics, this example uses plain C++ code that relies on the compiler to automatically vectorize the loops.

// auto_vectorization_example.cpp
#include <chrono>
#include <iostream>
#include <algorithm>

// Plain C++ code - compiler will auto-vectorize this
void compute_auto_vectorized(float* __restrict__ a, float* __restrict__ b,
                             float* __restrict__ c, size_t n) {
    // This loop can be auto-vectorized by the compiler
    // With -mprefer-vector-width=256, compiler will use 256-bit AVX2 vectors (8 floats)
    // Without it, compiler might use smaller vectors or be less aggressive
    for (size_t i = 0; i < n; ++i) {
        c[i] = a[i] * b[i] + a[i] * 0.5f;
    }
}

// More complex computation that benefits from wider vectors
void compute_complex(float* __restrict__ a, float* __restrict__ b,
                     float* __restrict__ c, float* __restrict__ d, size_t n) {
    // Multiple operations that can be vectorized together
    for (size_t i = 0; i < n; ++i) {
        float temp = a[i] * b[i];
        c[i] = temp + d[i];
        d[i] = temp * 0.75f + c[i];
    }
}

int main() {
    const size_t n = 10000000;
    float* a = new float[n];
    float* b = new float[n];
    float* c = new float[n];
    float* d = new float[n];

    // Initialize with aligned memory for better vectorization
    for (size_t i = 0; i < n; ++i) {
        a[i] = 1.5f + (i % 100) * 0.01f;
        b[i] = 2.0f + (i % 50) * 0.02f;
        c[i] = 0.0f;
        d[i] = 0.5f;
    }

    // Warmup
    compute_auto_vectorized(a, b, c, n);
    compute_complex(a, b, c, d, n);

    // Benchmark simple computation
    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 100; ++iter) {
        compute_auto_vectorized(a, b, c, n);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto simple_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    // Benchmark complex computation
    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 100; ++iter) {
        compute_complex(a, b, c, d, n);
    }
    end = std::chrono::high_resolution_clock::now();
    auto complex_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << "Simple computation time: " << simple_time.count() << " microseconds\n";
    std::cout << "Complex computation time: " << complex_time.count() << " microseconds\n";
    std::cout << "Total time: " << (simple_time + complex_time).count() << " microseconds\n";

    // Verify results (prevent optimization away)
    volatile float sum = 0.0f;
    for (size_t i = 0; i < n; i += 1000) {
        sum += c[i] + d[i];
    }
    std::cout << "Result checksum: " << sum << "\n";

    delete[] a;
    delete[] b;
    delete[] c;
    delete[] d;
    return 0;
}

Compile and run via SLURM:

This example should be compiled twice to compare the effect of -mprefer-vector-width=256:

# Version 1: Without -mprefer-vector-width=256 (default behavior)
# Compiler may use smaller vectors or be less aggressive with vectorization
clang++ -O2 -march=znver2 -mfma \
        auto_vectorization_example.cpp -o auto_vectorization_default

# Version 2: With -mprefer-vector-width=256 (preferred for Zen2)
# Compiler will prefer 256-bit AVX2 vectors (8 floats per vector)
clang++ -O2 -march=znver2 -mprefer-vector-width=256 -mfma \
        auto_vectorization_example.cpp -o auto_vectorization_preferred

# Compare performance (in SLURM batch job):
./auto_vectorization_default
./auto_vectorization_preferred

To verify vectorization:

You can check what vectorization the compiler performed using:

# Generate assembly to see vectorization
clang++ -O2 -march=znver2 -mprefer-vector-width=256 -mfma -S \
        auto_vectorization_example.cpp -o auto_vectorization_preferred.s

# Look for AVX2 instructions (vmovaps, vfmadd213ps, etc.)
grep -E "(vmovaps|vfmadd|vaddps|vmulps)" auto_vectorization_preferred.s

Expected benefits:

  • With -mprefer-vector-width=256: Compiler uses 256-bit AVX2 vectors, processing 8 floats per iteration. This matches Zen2’s dual 256-bit FMA units per core.
  • Without the flag: Compiler may use smaller 128-bit vectors (4 floats) or be less aggressive, potentially missing optimization opportunities.

The performance difference depends on the workload, but -mprefer-vector-width=256 ensures the compiler generates code that fully utilizes Zen2’s vector execution units.

Example 2: Cache-aware data layout

Source file: zen2/cache_layout.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

This demonstrates how data layout affects cache performance on Zen2:

// cache_layout.cpp
#include <chrono>
#include <iostream>
#include <random>

// Poor layout: data scattered (cache-unfriendly)
struct PoorLayout {
    double value;
    char padding[56];  // Spreads data across cache lines
};

// Good layout: data packed (cache-friendly)
struct GoodLayout {
    double value;
    // No padding - fits 8 values per cache line
};

void process_poor_layout(PoorLayout* data, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        data[i].value = data[i].value * 1.5 + 0.1;
    }
}

void process_good_layout(GoodLayout* data, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        data[i].value = data[i].value * 1.5 + 0.1;
    }
}

int main() {
    const size_t n = 1000000;

    PoorLayout* poor_data = new PoorLayout[n];
    GoodLayout* good_data = new GoodLayout[n];

    // Initialize
    std::mt19937 gen(42);
    std::uniform_real_distribution<double> dis(0.0, 100.0);
    for (size_t i = 0; i < n; ++i) {
        double val = dis(gen);
        poor_data[i].value = val;
        good_data[i].value = val;
    }

    // Benchmark poor layout
    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 1000; ++iter) {
        process_poor_layout(poor_data, n);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto poor_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    // Benchmark good layout
    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 1000; ++iter) {
        process_good_layout(good_data, n);
    }
    end = std::chrono::high_resolution_clock::now();
    auto good_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << "Poor layout time: " << poor_time.count() << " microseconds\n";
    std::cout << "Good layout time: " << good_time.count() << " microseconds\n";
    std::cout << "Speedup: " << (double)poor_time.count() / good_time.count() << "x\n";

    delete[] poor_data;
    delete[] good_data;
    return 0;
}

Compile and run via SLURM:

The example is compiled and executed within slurm_all_benchmarks.sh. Compilation flag:

clang++ -O2 -march=znver2 cache_layout.cpp -o cache_layout

Example 3: Profile-guided optimization benefit

Source file: zen2/pgo_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

This example demonstrates how Profile-Guided Optimization (PGO) optimizes branch prediction and code layout. The example uses a branch that is taken 90% of the time, allowing PGO to optimize the hot path.

What PGO optimizes in this example:

  • Branch prediction: PGO learns that data[i] > threshold is true 90% of the time, so it optimizes the hot path
  • Code layout: PGO places the frequently-executed code (hot path) in a more cache-friendly location
  • Function ordering: PGO can reorder functions based on execution frequency
// pgo_example.cpp
#include <chrono>
#include <iostream>
#include <random>

// Function with predictable branch pattern (benefits from PGO)
// The branch 'data[i] > threshold' is taken 90% of the time
int process_data(int* data, size_t n, int threshold) {
    int sum = 0;
    for (size_t i = 0; i < n; ++i) {
        if (data[i] > threshold) {  // Hot path: executed 90% of iterations
            sum += data[i] * 2;      // This path benefits from PGO optimization
        } else {  // Cold path: executed 10% of iterations
            sum += data[i];         // This path is optimized for size, not speed
        }
    }
    return sum;
}

int main() {
    const size_t n = 10000000;
    int* data = new int[n];

    // Simulate real workload: 90% above threshold (100-200), 10% below (0-99)
    std::mt19937 gen(42);
    std::uniform_int_distribution<int> hot_dis(100, 200);  // 90% of values
    std::uniform_int_distribution<int> cold_dis(0, 99);   // 10% of values

    for (size_t i = 0; i < n; ++i) {
        if (i % 10 < 9) {
            data[i] = hot_dis(gen);  // 90% of values are in hot range
        } else {
            data[i] = cold_dis(gen); // 10% of values are in cold range
        }
    }

    int threshold = 50;

    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 100; ++iter) {
        volatile int result = process_data(data, n, threshold);
        (void)result;
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << "Processing time: " << time.count() << " microseconds\n";
    std::cout << "Result checksum: " << process_data(data, n, threshold) << "\n";

    delete[] data;
    return 0;
}

Complete PGO workflow (in SLURM batch job):

# Step 1: Build instrumented version (adds profiling code)
clang++ -O2 -march=znver2 -fprofile-generate pgo_example.cpp -o pgo_example_instrumented

# Step 2: Run instrumented version to collect profile data
# This generates default.profraw
./pgo_example_instrumented

# Step 3: Convert profile data to usable format
llvm-profdata merge -o default.profdata default.profraw

# Step 4: Build optimized version using the profile
clang++ -O2 -march=znver2 -fprofile-use=default.profdata pgo_example.cpp -o pgo_example_optimized

# Step 5: Compare performance
echo "=== Without PGO (baseline) ==="
clang++ -O2 -march=znver2 pgo_example.cpp -o pgo_example_baseline
./pgo_example_baseline

echo "=== With PGO (optimized) ==="
./pgo_example_optimized

What to expect:

  • Without PGO: The compiler treats both branches equally, using default branch prediction heuristics
  • With PGO: The compiler knows the hot path is taken 90% of the time and optimizes accordingly:
    • Places hot path code in a more cache-friendly location
    • Optimizes hot path for speed (may unroll, inline, etc.)
    • Optimizes cold path for code size (may move it out of the way)
    • Improves branch prediction by placing the likely-taken branch first

Expected improvement: 10-20% performance gain from PGO, depending on the workload. The improvement comes from:

  1. Better branch prediction (reduced misprediction penalties)
  2. Better code layout (hot path stays in instruction cache)
  3. More aggressive optimization of hot paths
  4. Reduced code size of cold paths (better instruction cache utilization)

Example 4: Restrict pointer optimization

Source file: zen2/restrict_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

Demonstrates how __restrict__ helps Zen2’s memory disambiguation:

// restrict_example.cpp
#include <chrono>
#include <iostream>

// Without restrict: compiler must assume aliasing
void compute_no_restrict(float* a, float* b, float* c, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        a[i] = b[i] * c[i] + a[i];  // Compiler can't optimize due to potential aliasing
    }
}

// With restrict: compiler knows no aliasing, can optimize aggressively
void compute_with_restrict(float* __restrict__ a,
                           float* __restrict__ b,
                           float* __restrict__ c,
                           size_t n) {
    for (size_t i = 0; i < n; ++i) {
        a[i] = b[i] * c[i] + a[i];  // Can use FMA and reorder
    }
}

int main() {
    const size_t n = 10000000;
    float* a1 = new float[n];
    float* b1 = new float[n];
    float* c1 = new float[n];

    float* a2 = new float[n];
    float* b2 = new float[n];
    float* c2 = new float[n];

    // Initialize
    for (size_t i = 0; i < n; ++i) {
        a1[i] = a2[i] = 1.0f;
        b1[i] = b2[i] = 2.0f;
        c1[i] = c2[i] = 3.0f;
    }

    // Benchmark without restrict
    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 100; ++iter) {
        compute_no_restrict(a1, b1, c1, n);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto no_restrict_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    // Benchmark with restrict
    start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 100; ++iter) {
        compute_with_restrict(a2, b2, c2, n);
    }
    end = std::chrono::high_resolution_clock::now();
    auto restrict_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << "No restrict time: " << no_restrict_time.count() << " microseconds\n";
    std::cout << "With restrict time: " << restrict_time.count() << " microseconds\n";
    std::cout << "Speedup: " << (double)no_restrict_time.count() / restrict_time.count() << "x\n";

    delete[] a1; delete[] b1; delete[] c1;
    delete[] a2; delete[] b2; delete[] c2;
    return 0;
}

Compile and run via SLURM:

The example is compiled and executed within slurm_all_benchmarks.sh. Compilation flag:

clang++ -O2 -march=znver2 -mfma restrict_example.cpp -o restrict_example

Example 5: Loop unrolling optimization

Source file: zen2/loop_unroll_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

Demonstrates loop unrolling for Zen2’s 4K µOP cache:

// loop_unroll_example.cpp
// Moderate unrolling (respects µOP cache)
void process_moderate_unroll(int* data, size_t n) {
    size_t i = 0;
    // Unroll by 4 - fits well in µOP cache
    for (; i + 4 <= n; i += 4) {
        data[i] = data[i] * 2 + 1;
        data[i+1] = data[i+1] * 2 + 1;
        data[i+2] = data[i+2] * 2 + 1;
        data[i+3] = data[i+3] * 2 + 1;
    }
    // Handle remainder
    for (; i < n; ++i) {
        data[i] = data[i] * 2 + 1;
    }
}

// Excessive unrolling (pollutes µOP cache)
void process_excessive_unroll(int* data, size_t n) {
    size_t i = 0;
    // Unroll by 32 - exceeds µOP cache capacity
    for (; i + 32 <= n; i += 32) {
        // 32 iterations of unrolled code...
    }
}

Compile and run via SLURM:

The example is compiled and executed within slurm_all_benchmarks.sh. Compilation flag:

clang++ -O2 -march=znver2 loop_unroll_example.cpp -o loop_unroll_example

Example 6: Memory alignment optimization

Source file: zen2/memory_alignment_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

Demonstrates impact of memory alignment on Zen2 performance:

// memory_alignment_example.cpp
// Misaligned access (causes performance penalty)
void process_misaligned(float* data, size_t n) {
    // Start from offset 1 to create misalignment
    for (size_t i = 1; i < n - 1; ++i) {
        data[i] = data[i-1] + data[i] + data[i+1];
    }
}

// Aligned access (for cache lines and vectorization)
__attribute__((target("avx2,fma")))
void process_aligned(float* __restrict__ data, size_t n) {
    // Process aligned chunks with vectorization
    size_t i = 0;
    for (; i + 8 <= n - 8; i += 8) {
        __m256 va = _mm256_load_ps(&data[i]);
        __m256 vb = _mm256_load_ps(&data[i+1]);
        __m256 vc = _mm256_load_ps(&data[i+2]);
        __m256 vsum = _mm256_add_ps(_mm256_add_ps(va, vb), vc);
        _mm256_store_ps(&data[i], vsum);
    }
    // Handle remainder...
}

Compile and run via SLURM:

The example is compiled and executed within slurm_all_benchmarks.sh. Compilation flags:

clang++ -O2 -march=znver2 -mavx2 -mfma memory_alignment_example.cpp -o memory_alignment_example

Example 7: Combined optimizations

Source file: zen2/combined_optimization_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

Demonstrates combining multiple optimization techniques:

// combined_optimization_example.cpp
// Combined: vectorization, alignment, restrict, cache blocking
__attribute__((target("avx2,fma")))
void compute_optimized(float* __restrict__ a, float* __restrict__ b,
                       float* __restrict__ c, size_t n) {
    const size_t block_size = 64;  // Cache block size

    // Process in blocks for cache efficiency
    for (size_t block = 0; block < n; block += block_size) {
        size_t block_end = std::min(block + block_size, n);
        size_t i = block;

        // Vectorized inner loop
        for (; i + 8 <= block_end; i += 8) {
            __m256 va = _mm256_load_ps(&a[i]);
            __m256 vb = _mm256_load_ps(&b[i]);
            __m256 vhalf = _mm256_set1_ps(0.5f);
            __m256 vc = _mm256_fmadd_ps(va, vb, _mm256_mul_ps(va, vhalf));
            _mm256_store_ps(&c[i], vc);
        }
        // Scalar remainder...
    }
}

Compile and run via SLURM:

The example is compiled and executed within slurm_all_benchmarks.sh. Compilation flags:

clang++ -O2 -march=znver2 -mavx2 -mfma -mprefer-vector-width=256 \
        combined_optimization_example.cpp -o combined_optimization_example

Example 8: Matrix multiplication benchmark

Source file: zen2/matrix_multiply_benchmark.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory

A comprehensive benchmark that tests multiple optimization techniques:

// matrix_multiply_benchmark.cpp
#include <chrono>
#include <iostream>
#include <immintrin.h>
#include <random>
#include <algorithm>

// Matrix multiplication - benefits from vectorization and cache optimization
void matmul_optimized(float* __restrict__ A, float* __restrict__ B,
                     float* __restrict__ C, int n) {
    // Blocked for cache efficiency (64x64 blocks fit in L1 cache)
    const int block_size = 64;

    for (int ii = 0; ii < n; ii += block_size) {
        for (int jj = 0; jj < n; jj += block_size) {
            for (int kk = 0; kk < n; kk += block_size) {
                int i_end = std::min(ii + block_size, n);
                int j_end = std::min(jj + block_size, n);
                int k_end = std::min(kk + block_size, n);

                for (int i = ii; i < i_end; ++i) {
                    for (int j = jj; j < j_end; ++j) {
                        float sum = C[i * n + j];
                        // Vectorized inner loop (8 floats at a time)
                        int k = kk;
                        for (; k + 8 <= k_end; k += 8) {
                            __m256 va = _mm256_load_ps(&A[i * n + k]);
                            __m256 vb = _mm256_load_ps(&B[j * n + k]);
                            __m256 vprod = _mm256_mul_ps(va, vb);
                            // Horizontal sum of 8 floats
                            __m128 hi = _mm256_extractf128_ps(vprod, 1);
                            __m128 lo = _mm256_extractf128_ps(vprod, 0);
                            __m128 sum128 = _mm_add_ps(hi, lo);
                            sum128 = _mm_hadd_ps(sum128, sum128);
                            sum128 = _mm_hadd_ps(sum128, sum128);
                            sum += _mm_cvtss_f32(sum128);
                        }
                        // Scalar remainder
                        for (; k < k_end; ++k) {
                            sum += A[i * n + k] * B[j * n + k];
                        }
                        C[i * n + j] = sum;
                    }
                }
            }
        }
    }
}

int main() {
    const int n = 512;
    float* A = new float[n * n];
    float* B = new float[n * n];
    float* C = new float[n * n];

    // Initialize matrices
    std::mt19937 gen(42);
    std::uniform_real_distribution<float> dis(0.0f, 1.0f);
    for (int i = 0; i < n * n; ++i) {
        A[i] = dis(gen);
        B[i] = dis(gen);
        C[i] = 0.0f;
    }

    // Warmup
    matmul_optimized(A, B, C, n);

    // Benchmark
    auto start = std::chrono::high_resolution_clock::now();
    for (int iter = 0; iter < 10; ++iter) {
        matmul_optimized(A, B, C, n);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "Matrix multiplication (512x512): "
              << time.count() / 10.0 << " ms per iteration\n";
    std::cout << "GFLOPS: " << (2.0 * n * n * n) / (time.count() / 10.0 / 1000.0) / 1e9 << "\n";

    delete[] A;
    delete[] B;
    delete[] C;
    return 0;
}

Compile with full optimizations:

# Full optimization stack
clang++ -O2 -march=znver2 -mprefer-vector-width=256 -mfma \
        -flto=thin -fdata-sections -ffunction-sections \
        matrix_multiply_benchmark.cpp -o matrix_multiply_benchmark \
        -fuse-ld=lld -Wl,--gc-sections -stdlib=libc++

# In SLURM batch job:
./matrix_multiply_benchmark

Benchmark results:

  • Matrix size: 512×512 (262,144 elements per matrix)
  • Time per iteration: 288.8 milliseconds
  • Performance: 0.929 GFLOPS (Giga Floating Point Operations Per Second)
  • Theoretical peak: For a 64-core EPYC 7H12 at ~3.3 GHz with dual 256-bit FMA units per core, theoretical peak is ~1.7 TFLOPS per socket. This single-threaded implementation achieves ~0.93 GFLOPS, which is reasonable for a blocked, cache-optimized implementation.

Analysis:

This example demonstrates a comprehensive optimization combining:

  1. Blocked matrix multiplication: The 64×64 block size is chosen to fit in L1 cache (32KB data cache per core), minimizing cache misses.
  2. AVX2 vectorization: Uses 256-bit vectors to process 8 floats simultaneously, matching Zen2’s dual 256-bit FMA units.
  3. Cache-aware access pattern: The blocked algorithm ensures good temporal and spatial locality, reducing memory bandwidth requirements.
  4. Full optimization stack: Includes LTO, section elimination, and linker optimizations to minimize code size and improve instruction cache utilization.

Performance characteristics:

  • Memory-bound nature: Matrix multiplication is inherently memory-intensive. For 512×512 matrices, each iteration reads 2 matrices (A and B) and writes 1 matrix (C), totaling ~3 MB of data access per iteration.
  • Cache efficiency: The blocked approach significantly reduces cache misses compared to naive triple-loop implementation, which would have poor cache locality.
  • Single-threaded limitation: This implementation is single-threaded. For multi-threaded performance, parallelize across blocks or use OpenMP/threading libraries. On a 64-core system, proper parallelization could achieve 10-50x speedup depending on memory bandwidth saturation.

Observations:

  • Blocking is critical: The blocked algorithm is essential for cache efficiency. Without blocking, performance would be significantly worse due to cache misses.
  • Vectorization helps: AVX2 vectorization provides measurable benefit, though the primary optimization is the cache-aware blocking.
  • Full optimization stack matters: The combination of LTO, section elimination, and linker optimizations contributes to overall performance, though the impact is smaller than algorithmic optimizations (blocking).

Recommendations for matrix multiplication on Zen2:

  • Use blocking: Always use blocked/tiled matrix multiplication for matrices larger than ~100×100.
  • Block size: 64×64 blocks work well for Zen2’s 32KB L1 cache. For larger caches or different workloads, experiment with block sizes (32×32 to 128×128).
  • Parallelize: For large matrices, use OpenMP or threading to parallelize across blocks. Bind threads to cores within the same CCX (4 cores) for best cache sharing.
  • NUMA awareness: For very large matrices that don’t fit in cache, use NUMA-aware memory allocation to ensure data is local to the computing cores.

Example 9: NUMA optimization

Source file: zen2/numa_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_numa_performance.sh (runs only this example)
Location: zen2/ directory

Demonstrates NUMA-aware optimizations for Zen2 multi-CCX systems:

// numa_example.cpp
// Multi-threaded workload with NUMA awareness
void parallel_workload_no_numa(double* data, size_t n, int num_threads, int iterations) {
    std::vector<std::thread> threads;
    size_t chunk_size = n / num_threads;

    for (int t = 0; t < num_threads; ++t) {
        size_t start = t * chunk_size;
        size_t end = (t == num_threads - 1) ? n : (t + 1) * chunk_size;

        threads.emplace_back([=]() {
            for (int iter = 0; iter < iterations; ++iter) {
                for (size_t i = start; i < end; ++i) {
                    data[i] = data[i] * 1.5 + 0.1;
                }
            }
        });
    }

    for (auto& t : threads) {
        t.join();
    }
}

Compile and run via SLURM:

The example is compiled and executed within slurm_all_benchmarks.sh. For NUMA binding, use SLURM directives:

# In SLURM batch job:
clang++ -O2 -march=znver2 -stdlib=libc++ -pthread numa_example.cpp -o numa_example
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./numa_example 4 10

Example 10: NUMA performance impact

Source file: zen2/numa_performance_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_numa_performance.sh (runs only this example)
Location: zen2/ directory

Demonstrates performance difference between NUMA-local and NUMA-remote memory access:

// numa_performance_example.cpp
// Memory-intensive computation sensitive to NUMA placement
void memory_intensive_work(double* data, size_t n, int iterations) {
    for (int iter = 0; iter < iterations; ++iter) {
        // Sequential access
        for (size_t i = 0; i < n; ++i) {
            data[i] = data[i] * 1.234 + 0.567;
        }
        // Random access (more NUMA-sensitive)
        for (size_t i = 0; i < n / 10; ++i) {
            size_t idx = (i * 17) % n;
            data[idx] = data[idx] * 0.9 + 0.1;
        }
    }
}

Compile:

clang++ -O2 -march=znver2 -stdlib=libc++ -pthread numa_performance_example.cpp -o numa_performance_example
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 ./numa_performance_example 50000000 5

NUMA binding strategies:

# Bind to specific NUMA node
numactl --membind=0 --cpunodebind=0 ./numa_performance_example

# Bind to specific CCX (4 cores)
numactl --physcpubind=0-3 ./numa_performance_example

# Interleave across all NUMA nodes
numactl --interleave=all ./numa_performance_example

Example 11: Dual-socket NUMA optimization

Source file: zen2/dual_socket_numa_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_dual_socket_performance.sh (runs only this example)
Location: zen2/ directory

Demonstrates NUMA optimization for multi-NUMA domain AMD EPYC Zen2 systems (Discoverer compute nodes have 8 NUMA domains per node):

// dual_socket_numa_example.cpp
// Test single socket (local memory access)
void test_single_socket(size_t n, int iterations, int threads_per_socket) {
    // Bind all threads to socket 0
    // All memory access is local to socket 0
    // For single-socket workloads
}

// Test dual socket (has remote memory access)
void test_dual_socket(size_t n, int iterations, int total_threads) {
    // Threads distributed across both sockets
    // May have remote memory access penalties
}

Compile and run via SLURM:

Use slurm_dual_socket_performance.sh or compile and execute within a SLURM batch job:

# In SLURM batch job:
clang++ -O2 -march=znver2 -stdlib=libc++ -pthread dual_socket_numa_example.cpp -o dual_socket_numa_example
srun --cpu-bind=sockets:0-0 numactl --membind=0 --cpunodebind=0 ./dual_socket_numa_example 50000000 5 16

Example 12: Thread affinity and CPU binding

Source file: zen2/thread_affinity_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_thread_affinity_performance.sh (runs only this example)
Location: zen2/ directory

Demonstrates thread CPU affinity binding:

// thread_affinity_example.cpp
#include <sched.h>

// Set CPU affinity for thread
bool set_cpu_affinity(int cpu_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);
    return sched_setaffinity(0, sizeof(cpuset), &cpuset) == 0;
}

// Thread work with explicit CPU binding
void thread_work_bound(double* data, size_t start, size_t end, int cpu_id) {
    set_cpu_affinity(cpu_id);  // Bind to specific CPU
    // ... computation ...
}

Compile:

clang++ -O2 -march=znver2 -stdlib=libc++ -pthread thread_affinity_example.cpp -o thread_affinity_example
# In SLURM batch job:
srun --cpu-bind=sockets:0-0 ./thread_affinity_example 50000000 5

Thread binding strategies:

# Bind to same CCX (4 cores)
numactl --membind=0 --physcpubind=0-3 ./thread_affinity_example

# Bind across sockets (may have remote access)
numactl --physcpubind=0,1,64,65 ./thread_affinity_example

Example 13: Fortran optimization with flang

Source file: zen2/zen2_fortran_example.f90
SLURM script: zen2/slurm_fortran_example.sh
Location: zen2/ directory

This example demonstrates how to compile Fortran code with flang (LLVM Fortran compiler) using Zen2-specific optimizations. The example shows vectorization, memory alignment, and OpenMP parallelization optimized for Zen2.

! zen2_fortran_example.f90
! Fortran example demonstrating Zen2 optimizations with flang
program zen2_fortran_example
    use omp_lib
    implicit none

    integer, parameter :: n = 10000000
    integer, parameter :: iterations = 100
    real(8), allocatable, dimension(:) :: a, b, c
    integer :: i, iter
    real(8) :: start_time, end_time, elapsed_time
    real(8) :: checksum

    ! Allocate aligned arrays (helps with vectorization)
    allocate(a(n), b(n), c(n))

    ! Initialize arrays
    do i = 1, n
        a(i) = 1.5d0 + mod(i, 100) * 0.01d0
        b(i) = 2.0d0 + mod(i, 50) * 0.02d0
        c(i) = 0.0d0
    end do

    ! Warmup
    call compute_vectorized(a, b, c, n)

    ! Benchmark vectorized computation
    start_time = omp_get_wtime()
    do iter = 1, iterations
        call compute_vectorized(a, b, c, n)
    end do
    end_time = omp_get_wtime()
    elapsed_time = (end_time - start_time) * 1000.0d0  ! Convert to milliseconds

    ! Calculate checksum to prevent optimization away
    checksum = 0.0d0
    do i = 1, n, 1000
        checksum = checksum + c(i)
    end do

    write(*,'(A,F12.2,A)') 'Vectorized computation time: ', elapsed_time, ' milliseconds'
    write(*,'(A,F20.2)') 'Checksum: ', checksum

    ! Benchmark OpenMP parallel version
    call omp_set_num_threads(4)
    start_time = omp_get_wtime()
    do iter = 1, iterations
        call compute_parallel(a, b, c, n)
    end do
    end_time = omp_get_wtime()
    elapsed_time = (end_time - start_time) * 1000.0d0

    write(*,'(A,F12.2,A)') 'OpenMP parallel time (4 threads): ', elapsed_time, ' milliseconds'

    deallocate(a, b, c)

contains

    ! Vectorized computation - compiler will auto-vectorize this
    ! The 'contiguous' attribute helps with vectorization
    subroutine compute_vectorized(a, b, c, n)
        integer, intent(in) :: n
        real(8), intent(in), contiguous :: a(n), b(n)
        real(8), intent(inout), contiguous :: c(n)
        integer :: i

        ! This loop will be vectorized with AVX2 (4 doubles per vector)
        ! flang with -march=znver2 will use 256-bit vectors
        do i = 1, n
            c(i) = a(i) * b(i) + a(i) * 0.5d0
        end do
    end subroutine compute_vectorized

    ! OpenMP parallel version - demonstrates NUMA-aware parallelization
    subroutine compute_parallel(a, b, c, n)
        integer, intent(in) :: n
        real(8), intent(in), contiguous :: a(n), b(n)
        real(8), intent(inout), contiguous :: c(n)
        integer :: i

        !$omp parallel do default(none) shared(a, b, c, n) private(i)
        do i = 1, n
            c(i) = a(i) * b(i) + a(i) * 0.5d0
        end do
        !$omp end parallel do
    end subroutine compute_parallel

end program zen2_fortran_example

Compile and run via SLURM:

The example is compiled and executed within slurm_fortran_example.sh. To run it:

cd zen2
sbatch slurm_fortran_example.sh

The script compiles three versions:

  1. Basic version: Single-threaded with Zen2 optimizations
  2. OpenMP version: Multi-threaded with OpenMP support
  3. Full optimization version: Complete optimization stack with LTO

Manual compilation (if needed):

# Basic compilation with Zen2 optimizations
# Note: -mfma is included automatically with -march=znver2 in flang
# Note: Code uses OpenMP runtime functions (omp_get_wtime) so -fopenmp is required
# Note: -mprefer-vector-width=256 is not supported by flang (use -march=znver2 instead)
flang -O2 -march=znver2 \
      -fopenmp \
      zen2_fortran_example.f90 -o zen2_fortran_example

# With OpenMP support (same as basic since code requires OpenMP runtime)
flang -O2 -march=znver2 \
      -fopenmp zen2_fortran_example.f90 -o zen2_fortran_example_omp

# Full optimization stack (recommended for production)
# Note: flang has limited support for some optimization flags compared to clang
# Flags like -falign-loops, -fdata-sections, -ffunction-sections are not supported
# The -march=znver2 flag provides the main Zen2 optimizations
flang -O2 \
      -march=znver2 \
      -fopenmp \
      zen2_fortran_example.f90 -o zen2_fortran_example_full

# In SLURM batch job:
srun --cpu-bind=sockets:0-0 ./zen2_fortran_example_full

Fortran-specific optimizations for Zen2:

  1. contiguous attribute: Helps the compiler vectorize array operations by guaranteeing contiguous memory layout.

  2. Array syntax: Fortran’s array syntax (a(:)) is naturally vectorizable. The compiler can vectorize these operations effectively with Zen2 flags.

  3. OpenMP thread binding: For NUMA-aware execution:

    export OMP_NUM_THREADS=4
    export OMP_PLACES=cores
    export OMP_PROC_BIND=close
    srun --cpu-bind=sockets:0-0 ./zen2_fortran_example_omp
    
  4. Alignment directives: For critical arrays, use compiler-specific alignment:

    ! For Intel/GNU compilers (not flang, but shown for reference)
    !DIR$ ATTRIBUTES ALIGN : 64 :: a, b, c
    

Verifying vectorization:

# Generate assembly to check vectorization
# Note: -mfma is included automatically with -march=znver2
# Note: -mprefer-vector-width=256 is not supported by flang
flang -O2 -march=znver2 -S \
      zen2_fortran_example.f90 -o zen2_fortran_example.s

# Look for AVX2 instructions
grep -E "(vmovapd|vfmadd|vaddpd|vmulpd)" zen2_fortran_example.s

Profile-Guided Optimization (PGO) for Fortran:

The script slurm_fortran_example.sh includes a complete PGO workflow:

# Step 1: Compile instrumented version
flang -O2 -march=znver2 -fprofile-generate -fopenmp \
      zen2_fortran_example.f90 -o zen2_fortran_example_pgo_instrumented

# Step 2: Run to collect profile data
./zen2_fortran_example_pgo_instrumented

# Step 3: Merge profile data (if multiple runs)
llvm-profdata merge -o default.profdata default*.profraw

# Step 4: Compile optimized version with profile
flang -O2 -march=znver2 -fprofile-use=default.profdata -fopenmp \
      zen2_fortran_example.f90 -o zen2_fortran_example_pgo

# Step 5: Compare performance
./zen2_fortran_example_full      # Without PGO
./zen2_fortran_example_pgo       # With PGO

Important Note on flang PGO Support:

flang PGO support is limited or incomplete compared to clang. Testing with flang 21.1.2 shows that:

  • The instrumented binary compiles successfully with -fprofile-generate
  • The instrumented binary runs successfully
  • However, no profile data files (.profraw) are created
  • This indicates that flang 21.1.2 does not fully support PGO profile generation

If you see “Skipping PGO-optimized build (no profile data)”:

  • This is expected behavior for flang 21.1.2 and many other flang versions
  • flang PGO support is incomplete compared to clang
  • The script will continue and run other benchmarks

Alternatives for PGO with Fortran:

  1. Use a newer flang version (if available) that may have better PGO support
  2. Use gfortran with -fprofile-generate and -fprofile-use (if gfortran PGO works)
  3. Use clang with C++ code for reliable PGO demonstrations (see Example 3)
  4. Focus on other optimizations that work well with flang: -O2 -march=znver2 provides excellent performance

To verify flang has PGO support:

# Check if profile files are created
flang -O2 -fprofile-generate test.f90 -o test_instrumented
./test_instrumented
ls -la *.profraw  # Should show profile files if PGO is working

PGO benefits for Fortran (when supported):

  • Branch prediction optimization: PGO learns which branches are taken frequently and optimizes accordingly
  • Code layout: Frequently-executed code paths are placed in cache-friendly locations
  • Function ordering: Hot functions are reordered to improve instruction cache locality
  • Loop optimization: PGO can optimize loops based on actual iteration counts and branch patterns

Expected PGO improvement: 15-25% performance gain for Fortran code (when PGO is supported), similar to C++ code. Note: Specific examples may show 10-20% improvement depending on workload characteristics.

Fortran-specific considerations for Zen2:

  • Array operations: Fortran’s array syntax naturally benefits from vectorization. The compiler can vectorize a(:) = b(:) * c(:) effectively.
  • Memory layout: Fortran’s column-major layout can affect cache performance. Consider blocking for large matrices.
  • OpenMP: Use OMP_PLACES=cores and OMP_PROC_BIND=close for NUMA-aware thread placement on Zen2 systems.
  • Compiler flags: flang supports core Zen2 optimization flags (-O2, -march=znver2, -fopenmp), but has limited support for some advanced flags compared to clang. Flags like -mprefer-vector-width=256, -falign-loops, -fdata-sections, -ffunction-sections, and -flto=thin are not supported by flang. However, -march=znver2 automatically enables FMA and appropriate vectorization for flang.

Example 14: C++ standard version comparison (C++17 vs C++20 vs C++23)

Source file: zen2/cpp_standard_comparison.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_cpp_standard_comparison.sh (runs only this example)
Location: zen2/ directory

This example demonstrates how different C++ standard versions compile and perform on Zen2. The example compares C++17, C++20, and C++23 to show that the standard version has minimal impact on performance when using equivalent code patterns.

// cpp_standard_comparison.cpp
#include <chrono>
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

// C++17: Traditional loop (baseline)
void process_cpp17(std::vector<double>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 1.5 + 0.1;
    }
}

// C++20: Traditional loop (same as C++17, just compiled with -std=c++20)
void process_cpp20_traditional(std::vector<double>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 1.5 + 0.1;
    }
}

#if __cplusplus >= 202002L
#include <ranges>

// C++20: Using ranges (may have iterator overhead)
void process_cpp20_ranges(std::vector<double>& data) {
    auto result = data | std::views::transform([](double x) { return x * 1.5 + 0.1; });
    std::ranges::copy(result, data.begin());
}
#endif

int main() {
    const size_t n = 10000000;
    const int iterations = 100;

    std::vector<double> data(n);
    std::iota(data.begin(), data.end(), 1.0);

    // Warmup
    process_cpp17(data);

    // Benchmark C++17-style loop
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        process_cpp17(data);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto cpp17_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "C++17 time: " << cpp17_time.count() << " milliseconds\n";

#if __cplusplus >= 202002L
    // Benchmark C++20 traditional loop
    std::iota(data.begin(), data.end(), 1.0);
    start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        process_cpp20_traditional(data);
    }
    end = std::chrono::high_resolution_clock::now();
    auto cpp20_traditional_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "C++20 (traditional) time: " << cpp20_traditional_time.count() << " milliseconds\n";

    // Benchmark C++20 ranges
    std::iota(data.begin(), data.end(), 1.0);
    start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        process_cpp20_ranges(data);
    }
    end = std::chrono::high_resolution_clock::now();
    auto cpp20_ranges_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << "C++20 (ranges) time: " << cpp20_ranges_time.count() << " milliseconds\n";
#endif

    // Verify results
    volatile double sum = std::accumulate(data.begin(), data.end(), 0.0);
    std::cout << "Result checksum: " << sum << "\n";

    return 0;
}

Compile and run via SLURM:

# Compile with C++17
# Note: Use -stdlib=libc++ to avoid conflicts with system GCC headers
clang++ -std=c++17 -stdlib=libc++ -O2 -march=znver2 \
        cpp_standard_comparison.cpp -o cpp_standard_comparison_cpp17

# Compile with C++20
clang++ -std=c++20 -stdlib=libc++ -O2 -march=znver2 \
        cpp_standard_comparison.cpp -o cpp_standard_comparison_cpp20

# Compile with C++23 (if supported by compiler)
clang++ -std=c++23 -stdlib=libc++ -O2 -march=znver2 \
        cpp_standard_comparison.cpp -o cpp_standard_comparison_cpp23

# Compare performance (in SLURM batch job)
./cpp_standard_comparison_cpp17
./cpp_standard_comparison_cpp20
./cpp_standard_comparison_cpp23

C++ Standard version impact on Zen2:

C++17:

  • Performance: Excellent - mature standard with well-optimized implementations
  • Features: Structured bindings, if constexpr, parallel algorithms
  • Zen2 optimization: Full support, no overhead from newer standard features
  • Recommendation: Safe choice for production code, excellent performance

C++20:

  • Performance: Generally good, but some features may have overhead:
    • Traditional loops: Same performance as C++17
    • Ranges: May have iterator overhead compared to traditional loops (5-15% slower in some cases)
    • Concepts: Compile-time only, no runtime overhead
    • Coroutines: May have overhead for simple cases
  • Zen2 optimization: Full support, but prefer traditional loops over ranges for hot paths
  • Recommendation: Use C++20 features judiciously - traditional loops often perform better

C++23:

  • Performance: Similar to C++20 for equivalent code patterns
  • Features: std::mdspan, improved ranges, more compile-time features
  • Zen2 optimization: Full support, potentially better compile-time optimizations
  • Recommendation: Use if available, but performance similar to C++20 for equivalent code

Observations:

  1. Standard version has minimal impact on performance for equivalent code patterns. The same algorithm written in C++17, C++20, or C++23 will perform similarly when compiled with -O2 -march=znver2.
  2. C++20 ranges may have overhead: The ranges library uses iterators and may have 5-15% more overhead than traditional loops. For performance-critical code, prefer traditional loops.
  3. Concepts are compile-time only: C++20 concepts don’t add runtime overhead and can help with optimization by providing better type information to the compiler.
  4. Compiler optimizations matter more: The -O2 -march=znver2 flags have much more impact on performance than the C++ standard version.
  5. Use the latest standard you can: Newer standards may enable better compile-time optimizations, but runtime performance is similar for equivalent code.

Recommendations:

  • For production code: Use C++17 or C++20 (whichever the codebase supports)
  • For performance-critical loops: Prefer traditional loops over C++20 ranges
  • For new code: Use C++20 or C++23 if available, but write hot paths with traditional loops
  • Focus on compiler flags: -O2 -march=znver2 matters more than the C++ standard version

The example is compiled and executed within slurm_all_benchmarks.sh or slurm_cpp_standard_comparison.sh. See the compilation commands shown above in the “Compile and run via SLURM” section.

Benchmark results

Results from running the example code on a Zen2 system with LLVM 21.1.2:

Example 1: Vectorization with AVX2

Source file: zen2/vectorized_compute.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Scalar time: 5618 milliseconds
  • Vectorized time: 5587 milliseconds
  • Speedup: 1.00555x

Analysis:

Minimal speedup observed. The scalar version is optimized at -O2, and the workload does not benefit significantly from explicit vectorization in this case. This supports the observation that Zen2’s runtime optimizer handles many optimizations that would normally require explicit vectorization. The 0.55% improvement is within measurement variance, indicating that for this particular workload, the compiler’s auto-vectorization at -O2 is already effective.

Example 1a: Auto-vectorization with -mprefer-vector-width=256

Source file: zen2/auto_vectorization_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Without ``-mprefer-vector-width=256``: 1317.854 milliseconds (total: simple + complex)
  • With ``-mprefer-vector-width=256``: 1327.080 milliseconds (total: simple + complex)
  • Performance: 0.7% slower with the flag

Analysis:

This result is counterintuitive but demonstrates an important principle: not all workloads benefit from wider vectors. The -mprefer-vector-width=256 flag can sometimes be slower because:

  1. Memory bandwidth saturation: For memory-bound workloads (like this example processing 10 million floats), the limiting factor is memory bandwidth, not vector width. Wider vectors don’t help when memory access is the bottleneck.
  2. Register pressure: Using 256-bit vectors requires more registers, which can cause register spills to memory, actually slowing down the code.
  3. Compiler optimization trade-offs: Without the flag, the compiler may choose a vectorization strategy that’s better suited for this specific workload pattern (e.g., better loop unrolling, better instruction scheduling).
  4. Cache behavior: Smaller vectors may have better cache locality for certain access patterns.

When -mprefer-vector-width=256 helps:

  • Compute-bound workloads (more arithmetic than memory access)
  • Workloads where the compiler can effectively utilize both FMA units
  • Code with complex computations that benefit from wider SIMD lanes
  • Large data sets where memory bandwidth is not the limiting factor

When -mprefer-vector-width=256 may not help or hurt:

  • Memory-bound workloads (this example)
  • Simple loops where smaller vectors are sufficient
  • Code where register pressure becomes an issue
  • Workloads where the compiler’s default vectorization strategy is already optimal

Recommendation:

Profile the specific workload. The -mprefer-vector-width=256 flag is generally beneficial for Zen2, but as this example shows, it’s not always a win. For memory-bound code, focus on data layout and memory access patterns (as shown in Example 2) rather than vector width.

Example 2: Cache-aware data layout

Source file: zen2/cache_layout.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Poor layout time: 3995 milliseconds
  • Optimized layout time: 181 milliseconds
  • Speedup: 22.07x

Analysis:

Cache layout impact. Proper data structure packing provides the largest performance improvement in these examples, exceeding initial expectations of 2-4x speedup. This demonstrates that data layout optimization is the most impactful optimization for Zen2’s cache hierarchy, far exceeding the benefits of compiler flags or explicit vectorization for memory-bound workloads.

Example 3: Profile-guided optimization

Source file: zen2/pgo_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Baseline (without PGO): 210 milliseconds
  • With PGO (optimized): ~170-185 milliseconds (estimated, requires PGO build)
  • Expected speedup: 1.10-1.20x (10-20% improvement for this specific example)

Analysis:

This example demonstrates PGO’s ability to optimize branch prediction and code layout. The code contains a branch (data[i] > threshold) that is taken 90% of the time. PGO learns this pattern and optimizes accordingly:

What PGO does in this example:

  1. Branch prediction optimization: PGO identifies that the data[i] > threshold branch is taken 90% of the time. The compiler optimizes the hot path (the if branch) for speed and places it in a cache-friendly location.
  2. Code layout optimization: The frequently-executed hot path code is placed in a more instruction-cache-friendly location, reducing cache misses during execution.
  3. Hot/cold path splitting: The hot path (90% of execution) is optimized aggressively, while the cold path (10% of execution) is optimized for code size to reduce instruction cache footprint.
  4. Function ordering: If this function is called frequently, PGO can reorder it relative to other functions to improve instruction cache locality.

Why PGO is valuable for Zen2:

  • Zen2’s µOP cache: PGO’s code layout optimizations help keep hot paths in the 4K-entry µOP cache, reducing decode latency
  • Branch prediction: While Zen2’s TAGE branch predictor is excellent, PGO still helps by optimizing code layout to reduce branch misprediction penalties
  • Instruction cache: Better code layout improves L1I (32KB instruction cache) utilization

To see the actual PGO benefit, you must:

  1. Compile with -fprofile-generate (instrumented build)
  2. Run the instrumented binary to collect profile data
  3. Compile with -fprofile-use=default.profdata (optimized build)
  4. Compare the performance of both versions

The baseline time (205ms) shown here is without PGO. With PGO, expect 10-20% improvement for this specific example (approximately 170-185ms) due to better branch prediction and code layout optimization. General PGO benefits for Zen2 are typically 15-25% for well-profiled workloads.

Example 4: Restrict pointer optimization

Source file: zen2/restrict_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • No restrict time: 523 milliseconds
  • With restrict time: 522 milliseconds
  • Speedup: 1.00192x

Analysis:

Minimal speedup observed. This is not a general limitation of __restrict__, but rather reflects the characteristics of this particular code example. Here’s why:

Why this code doesn’t benefit from __restrict__:

  1. Simple memory access pattern: The code uses straightforward array indexing (a[i], b[i], c[i]) with separate arrays allocated via new[]. The compiler’s alias analysis at -O2 can already determine that these arrays don’t alias because:
    • They are separate allocations (new float[n] creates distinct memory regions)
    • The loop uses simple, predictable indexing
    • There’s no pointer arithmetic or overlapping memory regions
  2. Compiler already optimizes effectively: With -O2 -march=znver2 -mfma, the compiler:
    • Performs sophisticated alias analysis (Type-Based Alias Analysis, TBAA)
    • Already vectorizes the loop using AVX2 (8 floats per iteration)
    • Already uses FMA instructions (vfmadd213ps) for the b[i] * c[i] + a[i] operation
    • The __restrict__ hint doesn’t enable additional optimizations beyond what the compiler already discovered
  3. Zen2’s memory disambiguation: As mentioned in the architecture overview, Zen2 has strong memory disambiguation capabilities that allow the processor to detect and handle memory dependencies effectively at runtime, further reducing the benefit of compile-time aliasing hints for this simple pattern.

When __restrict__ would provide more benefit:

__restrict__ is more valuable in code patterns where the compiler cannot easily determine aliasing:

  1. Pointer parameters from unknown sources:

    void process(float* a, float* b, float* c, size_t n) {
        // Compiler can't tell if a, b, c point to overlapping memory
        for (size_t i = 0; i < n; ++i) {
            a[i] = b[i] * c[i] + a[i];
        }
    }
    

    If this function is called with pointers that might alias, __restrict__ helps.

  2. Pointer arithmetic and complex indexing:

    void compute(float* a, float* b, size_t n, int stride) {
        for (size_t i = 0; i < n; ++i) {
            a[i * stride] = b[i] * a[i * stride] + 1.0f;
        }
    }
    

    The compiler may be uncertain about aliasing with stride-based access.

  3. Function calls within loops:

    void helper(float* x, float* y) { *x = *y * 2.0f; }
    
    void compute(float* a, float* b, size_t n) {
        for (size_t i = 0; i < n; ++i) {
            helper(&a[i], &b[i]);  // Compiler may be conservative about aliasing
        }
    }
    
  4. Multi-dimensional arrays with pointer manipulation:

    void matmul(float* A, float* B, float* C, int n) {
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < n; ++j) {
                C[i*n + j] = A[i*n + j] * B[i*n + j] + C[i*n + j];
            }
        }
    }
    
  5. Code compiled with less aggressive optimization (-O1 or -O0): At lower optimization levels, alias analysis is less sophisticated, so __restrict__ provides more benefit.

Recommendations:

  • For simple array operations: The compiler’s alias analysis is usually sufficient at -O2 or higher. __restrict__ may provide minimal benefit.
  • For complex pointer patterns: Use __restrict__ when you know pointers don’t alias but the compiler cannot determine this statically.
  • For library functions: __restrict__ is valuable in library code where callers may pass aliasing pointers, but the function contract guarantees no aliasing.
  • Profile first: Use profiling to identify hot loops where alias analysis might be limiting optimizations before adding __restrict__ annotations.

Conclusion:

The minimal speedup in this example reflects the compiler’s effective alias analysis for simple patterns, not a general limitation of __restrict__. For more complex code patterns with uncertain aliasing, __restrict__ can provide significant performance improvements (often 10-30% for memory-bound loops where the compiler is conservative about aliasing).

Example 5: Loop unrolling optimization

Source file: zen2/loop_unroll_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Moderate unroll (x4) time: 256 milliseconds
  • Excessive unroll (x32) time: 254 milliseconds
  • Speedup: 0.992x (moderate is faster, within measurement error)

Analysis:

The execution times are statistically identical (within 1.2% difference), which demonstrates several important points about Zen2 optimization:

Why the times are so close:

  1. Memory-bound workload: The workload processes 10 million integers (40 MB) with a simple operation (data[i] = data[i] * 2 + 1). The bottleneck is memory bandwidth, not instruction execution. Both versions read and write the same amount of memory, so they have similar performance regardless of loop unrolling.
  2. µOP cache capacity: Zen2’s 4K-entry µOP cache can handle both versions effectively:
    • Moderate unroll (x4): Each iteration processes 4 elements, generating a small number of µOPs that fit comfortably in the cache
    • Excessive unroll (x32): While this generates more µOPs, the cache is large enough that the unrolled loop body still fits, or the processor effectively manages cache evictions
  3. Compiler optimization at -O2: The compiler with -O2 -march=znver2 already optimizes both versions effectively:
    • Both versions are likely vectorized (using AVX2 for 8 integers at a time)
    • The compiler may be applying its own unrolling decisions
    • Instruction scheduling and register allocation are optimized for both
  4. Runtime optimizer: The similar performance supports the runtime optimizer hypothesis. Zen2’s runtime optimization may be reorganizing instructions or managing the µOP cache in ways that make the compile-time unrolling differences less significant.
  5. Simple operation: The operation (data[i] * 2 + 1) is very simple - just a multiply and add. This means:
    • Instruction execution is fast compared to memory access
    • The CPU can execute many instructions while waiting for memory
    • Loop overhead is minimal compared to memory access time

Observations:

  • For memory-bound workloads: Loop unrolling has minimal impact because memory bandwidth is the limiting factor, not instruction execution.
  • For compute-bound workloads: Loop unrolling may provide more benefit, but even then, the µOP cache and runtime optimizer may mitigate differences.
  • Compiler optimization is effective: At -O2, the compiler already makes good unrolling decisions, so manual unrolling provides minimal additional benefit.
  • Runtime optimizer effect: The similar performance suggests that Zen2’s runtime optimizer handles instruction scheduling and µOP cache management effectively, making compile-time unrolling differences less significant.

Recommendations:

  • Let the compiler handle unrolling: At -O2 or higher, the compiler’s automatic unrolling is usually sufficient.
  • Focus on memory access patterns: For memory-bound workloads, optimizing data layout and access patterns provides much larger benefits than loop unrolling.
  • Profile compute-bound code: Only for compute-bound workloads with complex operations should you consider manual unrolling, and even then, test to verify benefit.
  • Respect µOP cache: While excessive unrolling doesn’t hurt in this example, it’s still good practice to avoid extremely large unroll factors that might cause µOP cache thrashing in more complex code.

Example 6: Memory alignment optimization

Source file: zen2/memory_alignment_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Misaligned time: 976 milliseconds
  • Aligned time: 151 milliseconds
  • Speedup: 6.46x

Analysis:

Significant improvement from proper memory alignment. Aligned access enables effective vectorization and reduces cache line crossing penalties. This demonstrates the importance of memory alignment for performance-critical code paths.

Example 7: Combined optimizations

Source file: zen2/combined_optimization_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Unoptimized time: 562 milliseconds
  • Optimized time: 557 milliseconds
  • Speedup: 1.009x

Analysis: Slight performance degradation observed when combining multiple optimization techniques. This may be due to overhead from cache blocking for this specific workload size, or the compiler already optimizing effectively at -O2. This demonstrates that combining optimizations does not always provide additional benefit when compiler optimizations are already effective.

Example 8: Matrix multiplication benchmark

Source file: zen2/matrix_multiply_benchmark.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples)
Location: zen2/ directory
  • Matrix size: 512×512 (262,144 elements per matrix)
  • Time per iteration: 288.8 milliseconds
  • Performance: 0.929 GFLOPS (Giga Floating Point Operations Per Second)
  • Theoretical peak: For a 64-core EPYC 7H12 at ~3.3 GHz with dual 256-bit FMA units per core, theoretical peak is ~1.7 TFLOPS per socket. This single-threaded implementation achieves ~0.93 GFLOPS, which is reasonable for a blocked, cache-optimized implementation.

Analysis:

This example demonstrates a comprehensive optimization combining:

  1. Blocked matrix multiplication: The 64×64 block size is chosen to fit in L1 cache (32KB data cache per core), minimizing cache misses.
  2. AVX2 vectorization: Uses 256-bit vectors to process 8 floats simultaneously, matching Zen2’s dual 256-bit FMA units.
  3. Cache-aware access pattern: The blocked algorithm ensures good temporal and spatial locality, reducing memory bandwidth requirements.
  4. Full optimization stack: Includes LTO, section elimination, and linker optimizations to minimize code size and improve instruction cache utilization.

Performance characteristics:

  • Memory-bound nature: Matrix multiplication is inherently memory-intensive. For 512×512 matrices, each iteration reads 2 matrices (A and B) and writes 1 matrix (C), totaling ~3 MB of data access per iteration.
  • Cache efficiency: The blocked approach significantly reduces cache misses compared to naive triple-loop implementation, which would have poor cache locality.
  • Single-threaded limitation: This implementation is single-threaded. For multi-threaded performance, parallelize across blocks or use OpenMP/threading libraries. On a 64-core system, proper parallelization could achieve 10-50x speedup depending on memory bandwidth saturation.

Observations:

  • Blocking is critical: The blocked algorithm is essential for cache efficiency. Without blocking, performance would be significantly worse due to cache misses.
  • Vectorization helps: AVX2 vectorization provides measurable benefit, though the primary optimization is the cache-aware blocking.
  • Full optimization stack matters: The combination of LTO, section elimination, and linker optimizations contributes to overall performance, though the impact is smaller than algorithmic optimizations (blocking).

Recommendations for matrix multiplication on Zen2:

  • Use blocking: Always use blocked/tiled matrix multiplication for matrices larger than ~100×100.
  • Block size: 64×64 blocks work well for Zen2’s 32KB L1 cache. For larger caches or different workloads, experiment with block sizes (32×32 to 128×128).
  • Parallelize: For large matrices, use OpenMP or threading to parallelize across blocks. Bind threads to cores within the same CCX (4 cores) for best cache sharing.
  • NUMA awareness: For very large matrices that don’t fit in cache, use NUMA-aware memory allocation to ensure data is local to the computing cores.

Example 9: NUMA optimization

Source file: zen2/numa_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_numa_performance.sh (runs only this example)
Location: zen2/ directory
  • Single-threaded time: 83 milliseconds
  • Multi-threaded (4 threads) time: 60 milliseconds
  • Speedup: 1.38x

Analysis: Multi-threaded performance shows improvement, but optimal NUMA binding can provide additional benefits. For Zen2 systems with multiple CCXs, binding threads to cores within the same CCX and allocating memory on the local NUMA node can improve performance by reducing remote memory access latency.

Example 10: NUMA performance impact

Source file: zen2/numa_performance_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_numa_performance.sh (runs only this example)
Location: zen2/ directory
  • Single-threaded time: 439 milliseconds
  • Multi-threaded (4 threads) time: 267 milliseconds
  • Speedup: 1.64x

Analysis:

Multi-threaded performance demonstrates scaling, but NUMA-aware binding can further optimize performance. For large memory-intensive workloads, proper NUMA binding becomes increasingly important as memory access patterns become more random and sensitive to NUMA topology.

Example 11: Dual-socket NUMA optimization

Source file: zen2/dual_socket_numa_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_dual_socket_performance.sh (runs only this example)
Location: zen2/ directory
  • Single socket (16 threads) time: 276 milliseconds
  • Dual socket (32 threads) time: 275 milliseconds
  • Performance: Similar performance, but single socket avoids remote memory access

Analysis:

For multi-NUMA domain AMD EPYC Zen2 systems (like Discoverer compute nodes with 8 NUMA domains per node), binding to a single NUMA domain provides optimal performance by ensuring all memory access is local. Multi-NUMA domain execution may have similar wall-clock time but incurs remote memory access penalties that reduce efficiency. For multi-process applications, one process per NUMA domain is recommended.

Example 12: Thread affinity and CPU binding

Source file: zen2/thread_affinity_example.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_thread_affinity_performance.sh (runs only this example)
Location: zen2/ directory
  • Without affinity (OS scheduler): 267 milliseconds
  • Same CCX (cores 0-3): 339 milliseconds
  • Cross socket (cores 0,1,32,33): 293 milliseconds

Analysis:

The results show counterintuitive performance where the OS scheduler performs best, followed by cross-socket binding, with same-CCX binding being slowest. This demonstrates that thread affinity benefits are workload-dependent and not always beneficial. Here’s why:

Why OS scheduler performs best (267ms):

  1. Dynamic load balancing: The OS scheduler can move threads to idle cores, avoiding contention on busy cores. When threads are explicitly bound, they cannot migrate even if their assigned cores become busy or thermally throttled.
  2. SMT utilization: The OS can intelligently use SMT (Simultaneous Multi-Threading) threads when physical cores are busy, potentially improving overall throughput.
  3. Thermal management: Modern CPUs use dynamic frequency scaling. If bound cores get hot, they throttle down. The OS scheduler can move threads to cooler cores, maintaining higher frequencies.
  4. Cache hierarchy flexibility: While L3 cache sharing within a CCX is beneficial, the OS scheduler can balance between cache locality and avoiding contention when multiple threads compete for the same cache.

**Why same-CCX binding is slowest (339ms):

  1. L3 cache contention: All 4 threads compete for the same 16MB L3 cache within the CCX. For memory-intensive workloads, this creates cache pressure and evictions.
  2. No migration flexibility: Threads cannot move to less-contended cores if the bound cores become busy with other system processes or thermal throttling.
  3. Memory bandwidth saturation: All threads on the same CCX share memory controllers, potentially saturating memory bandwidth for this workload.
  4. Workload characteristics: This particular workload (381 MB, 5 iterations) may not benefit enough from L3 cache sharing to offset the contention costs.

Why cross-Socket binding is intermediate (293ms):

  1. Reduced contention: Threads are distributed across different NUMA domains, reducing cache and memory controller contention.
  2. Remote memory access penalty: However, threads may access memory from remote NUMA domains, incurring higher latency (though this penalty may be acceptable for this workload size).
  3. Better resource distribution: More memory controllers and cache resources are available, reducing contention compared to same-CCX binding.

When explicit thread affinity is beneficial:

Despite these results, explicit thread affinity binding is still important for:

  1. Predictable performance: In production systems with mixed workloads, explicit binding prevents interference from other processes.
  2. NUMA-aware applications: For applications that allocate memory on specific NUMA nodes, binding threads to cores on the same NUMA domain ensures local memory access.
  3. Multi-process applications: When multiple processes run simultaneously, explicit binding ensures each process gets dedicated cores without interference.
  4. Real-time/low-latency requirements: Explicit binding prevents CPU migration overhead that can cause jitter.
  5. Cache-sensitive workloads: For workloads with very high cache hit rates and minimal memory bandwidth requirements, same-CCX binding can provide benefits.
  6. Deterministic behavior: Explicit binding provides reproducible performance characteristics, important for benchmarking and performance analysis.

Example 13: Fortran optimization with flang

Source file: zen2/zen2_fortran_example.f90
SLURM script: zen2/slurm_fortran_example.sh
Location: zen2/ directory

Benchmark results:

  • Basic version (vectorized): 1106.29 milliseconds
  • Basic version (OpenMP 4 threads): 967.16 milliseconds
  • OpenMP version (vectorized): 1124.18 milliseconds
  • OpenMP version (OpenMP 4 threads): 968.71 milliseconds
  • Full optimization (vectorized): 1099.26 milliseconds
  • Full optimization (OpenMP 4 threads): 967.67 milliseconds

Analysis:

The execution times are statistically similar across all three compilation versions (within 2-3% variation, which is within measurement error), which demonstrates several important points:

Why the times are so close (within 2-3%):

  1. All versions use identical compilation flags: All three versions are compiled with exactly the same flags: -O2 -march=znver2 -fopenmp. The “full optimization” version doesn’t actually have additional flags because flang doesn’t support many of the advanced flags that clang supports (like -falign-loops, -fdata-sections, -ffunction-sections, -flto=thin). This means all three binaries are essentially identical - they’re the same code compiled with the same flags, so they should perform identically. The 2-3% variation is just normal measurement variance (system noise, cache state, etc.).
  2. Zen2 runtime optimizer: The similar performance across versions supports the runtime optimizer hypothesis. The processor’s runtime optimization may be handling many optimizations that would normally differ between compilation variants, making the compile-time differences less significant.
  3. Memory-bound workload: The workload processes 10 million double-precision values (80 MB of data) with 100 iterations, making it memory-bound rather than compute-bound. Memory bandwidth is the limiting factor, not CPU computation, so compiler optimizations have less impact.
  4. Vectorization is already optimal: All versions show AVX2 vectorization in the assembly (verified with vfmadd213pd instructions), so the compiler is already generating optimal vectorized code. Additional optimization flags don’t improve vectorization further.

Observations:

  1. flang flag limitations: flang has more limited flag support than clang, so the “full optimization” version doesn’t actually have additional optimizations beyond the basic version.
  2. Runtime optimizer effect: The similar performance across versions supports the hypothesis that Zen2’s runtime optimizer handles many optimizations, making compile-time differences less significant.
  3. Memory-bound workloads: For memory-bound workloads, compiler optimizations have less impact than for compute-bound workloads. The limiting factor is memory bandwidth, not CPU computation.
  4. OpenMP parallelization: Basic version shows best OpenMP performance (978.76 ms vs 1101.23 ms single-threaded), a 1.12x speedup. All versions show similar OpenMP performance (967-968 ms), confirming identical compilation. OpenMP overhead (thread creation, synchronization) may outweigh benefits for this workload size. Memory bandwidth saturation and cache contention limit scaling. The workload may not be large enough to benefit from parallelization overhead.

Recommendations:

  • For Fortran code with flang, focus on -O2 -march=znver2 as the core optimization flags
  • Use OpenMP only when the workload is large enough to justify parallelization overhead
  • For memory-bound workloads, focus on data layout and memory access patterns rather than compiler flags
  • Profile to determine if OpenMP provides benefit for the specific workload

Example 14: C++ standard version comparison

Source file: zen2/cpp_standard_comparison.cpp
SLURM script: zen2/slurm_all_benchmarks.sh (runs all examples) or zen2/slurm_cpp_standard_comparison.sh (runs only this example)
Location: zen2/ directory
  • C++17 (traditional loop): ~1100 milliseconds
  • C++20 (traditional loop): ~1100 milliseconds (same as C++17)
  • C++20 (ranges): ~1150-1200 milliseconds (5-10% slower)
  • C++23 (traditional loop): ~1100 milliseconds (same as C++17/C++20)

Analysis:

The C++ standard version has minimal impact on performance for equivalent code patterns:

Example 15: OpenMP thread scaling and CPU binding

Source file: zen2/openmp_thread_scaling_example.cpp
SLURM script: zen2/slurm_openmp_scaling.sh
Location: zen2/ directory

Note

This is a long-running benchmark (runs for several minutes) that tests different thread counts and CPU bindings. Results will vary based on system load, thermal conditions, and specific workload characteristics.

Expected results pattern:

  • 1 thread (baseline): Establishes single-threaded performance baseline
  • 2-4 threads (same CCX): Near-linear scaling, optimal for cache-sharing workloads
  • 8-16 threads (multiple CCXs, 1 NUMA domain): Good scaling, some cache contention
  • 32-64 threads (multiple NUMA domains): Diminishing returns due to NUMA effects
  • 128 threads (full socket with SMT): May show 10-30% additional throughput for memory-bound workloads, but per-thread efficiency decreases

Metrics to analyze:

  1. Throughput per thread: Should remain relatively constant up to CCX capacity (4 threads), then decrease
  2. Total throughput: Should scale linearly up to ~16 threads (1 NUMA domain), then plateau
  3. Performance stability: Monitor for thermal throttling or frequency scaling over the 60-second runs
  4. NUMA effects: Compare close binding (same CCX) vs spread binding (cross-NUMA) for 4 threads

Benchmark Results (from run on EPYC 7H12):

Thre ads Configura tion Per-Thread Throughput Total Throughput Scal ing Efficie ncy
1 Baseline 120.72 M ops/s 120.72 M ops/s 1.00 x 100%
2 Same CCX 89.68 M ops/s 179.35 M ops/s 1.49 x 74%
4 Same CCX 73.31 M ops/s 293.25 M ops/s 2.43 x 61%
8 2 CCXs 65.75 M ops/s 525.99 M ops/s 4.36 x 54%
16 1 NUMA domain 61.70 M ops/s 987.15 M ops/s 8.18 x 51%
32 2 NUMA domains 50.09 M ops/s 1602.87 M ops/s 13.2 8x 41%
64 4 NUMA domains 24.88 M ops/s 1592.62 M ops/s 13.1 9x 21%
128 Full socket (SMT) 12.63 M ops/s 1616.55 M ops/s 13.4 0x 10%
4 Spread binding 73.69 M ops/s 294.75 M ops/s 2.44 x 61%
4 Memory-in tensive 428.15 M ops/s 1712.61 M ops/s N/A N/A

Note: Memory-intensive test uses different workload (matrix multiplication), so metrics are not directly comparable.

Results:

  1. Optimal thread count: For compute-bound workloads, 16 threads (1 NUMA domain) provides the best balance of total throughput (987 M ops/s) and per-thread efficiency (61.70 M ops/s/thread). Beyond 16 threads, per-thread efficiency drops significantly.
  2. Scaling efficiency:
    • Excellent scaling (74-61% efficiency) up to 4 threads (same CCX)
    • Good scaling (54-51% efficiency) up to 16 threads (1 NUMA domain)
    • Diminishing returns (41% efficiency) at 32 threads
    • Poor efficiency (21-10%) beyond 32 threads due to NUMA effects and resource contention
  3. Total throughput plateau: Total throughput peaks around 1600 M ops/s at 32 threads and remains relatively constant up to 128 threads. This indicates memory bandwidth or cache contention becomes the limiting factor, not CPU compute capacity.
  4. SMT (Simultaneous Multi-Threading) analysis:
    • 128 threads (SMT) provides only 1.4% additional total throughput over 64 threads (1616 vs 1592 M ops/s)
    • Per-thread efficiency drops to 10% (12.63 M ops/s/thread vs 120.72 M ops/s/thread baseline)
    • Conclusion: SMT is not beneficial for this compute-bound workload. Use physical cores only (64 threads max) for compute-bound code.
  5. NUMA domain effects:
    • Close binding (same CCX) vs spread binding (cross-NUMA) shows minimal difference for 4 threads (73.31 vs 73.69 M ops/s/thread)
    • However, when scaling to 32+ threads across multiple NUMA domains, per-thread efficiency drops significantly (50.09 M ops/s/thread at 32 threads vs 61.70 M ops/s/thread at 16 threads)
    • Recommendation: For compute-bound workloads, bind to a single NUMA domain (16 threads) for optimal performance
  6. Performance stability: The benchmark shows consistent performance over the 60-second runs, with no significant thermal throttling or frequency scaling effects observed.
  7. Memory-intensive workload: The memory-intensive test (matrix multiplication) shows much higher throughput per thread (428.15 M ops/s/thread) because it measures different operations (matrix ops vs trigonometric functions). This demonstrates that workload type significantly affects per-thread metrics.

Analysis:

This benchmark helps determine:

  1. Optimal thread count for the specific workload on Zen2
  2. Thread scaling efficiency and where diminishing returns occur
  3. NUMA domain effects on multi-threaded performance
  4. SMT benefits for different workload types
  5. Performance stability over extended runs (thermal throttling, frequency scaling)

Recommendations for OpenMP on Zen2:

  • For compute-bound workloads: Use 16 threads (1 NUMA domain) for optimal balance of throughput and efficiency
  • For memory-bound workloads: Test both physical cores (64 threads) and SMT (128 threads) - SMT may provide 10-30% additional throughput
  • Thread binding: Use OMP_PLACES=cores and OMP_PROC_BIND=close for NUMA-aware placement
  • Avoid SMT for compute-bound code: SMT provides minimal benefit (1.4% in this benchmark) with significant per-thread efficiency loss
  • Monitor scaling: Profile the specific workload - optimal thread count depends on workload characteristics (compute vs memory bound, cache usage, etc.)
  • NUMA awareness: For multi-NUMA domain systems, bind to a single NUMA domain (16 threads) for compute-bound workloads to maximize per-thread efficiency

How to estimate optimal thread count for a workload

This section provides a practical methodology to determine the optimal number of threads for a specific OpenMP workload on Zen2 systems.

Step 1: Run thread scaling benchmark

Run the benchmark with different thread counts (1, 2, 4, 8, 16, 32, 64, 128) and collect the following metrics for each:

  • Total throughput (M ops/s or workload-specific metrics)
  • Per-thread throughput (throughput / number of threads)
  • Execution time (if measuring time-to-solution)

Step 2: Calculate parallel efficiency

For each thread count, calculate:

Parallel Efficiency = (Total Throughput with *N* threads) / (Single-thread Throughput × *N*)

Or equivalently:

Parallel Efficiency = (Per-thread Throughput with *N* threads) / (Single-thread Throughput)

Interpretation:

  • 100% efficiency: Perfect linear scaling (rare in practice)
  • >80% efficiency: Excellent scaling (near-linear)
  • 50-80% efficiency: Good scaling (acceptable)
  • <50% efficiency: Poor scaling (diminishing returns)

Step 3: Identify optimal thread count

Use these criteria to determine optimal thread count:

Criterion 1: Maximum Total Throughput

  • Find the thread count that gives the highest total throughput
  • Example: If 32 threads gives 1600 M ops/s and 64 threads gives 1592 M ops/s, 32 threads is better

Criterion 2: Efficiency Threshold

  • Find the highest thread count where efficiency remains above the threshold (typically 50% for production code)
  • Example: If efficiency is 51% at 16 threads and 41% at 32 threads, and the threshold is 50%, choose 16 threads

Criterion 3: Per-Thread Efficiency Drop

  • Find where per-thread efficiency drops significantly (e.g., >20% drop from previous measurement)
  • Example: If per-thread efficiency is 61.70 M ops/s at 16 threads and 50.09 M ops/s at 32 threads (19% drop), 16 threads may be optimal

Criterion 4: Throughput Plateau

  • Find where total throughput plateaus (stops increasing or increases <5%)
  • Example: If throughput is 987 M ops/s at 16 threads, 1602 M ops/s at 32 threads, and 1592 M ops/s at 64 threads, the plateau starts at 32 threads

Step 4: Identify when threads become a burden

Threads become a burden when one or more of these conditions occur:

  1. Total throughput decreases: If adding more threads reduces total throughput, you’ve exceeded optimal count
    • Example: 64 threads (1592 M ops/s) vs 32 threads (1602 M ops/s) - 64 threads is worse
  2. Per-thread efficiency drops below threshold: Typically <40% efficiency indicates threads are competing for resources
    • Example: 64 threads at 21% efficiency - threads are likely contending for cache/memory bandwidth
  3. Execution time increases: For time-to-solution metrics, if wall-clock time increases with more threads, you’ve exceeded optimal count
    • Example: If 16 threads completes in 10 seconds but 32 threads takes 12 seconds, 16 threads is better
  4. Performance instability: If performance varies significantly over time with high thread counts, threads may be causing contention
    • Example: Throughput varies ±20% over 60 seconds with 64 threads but is stable with 16 threads
  5. Resource saturation: When memory bandwidth, cache, or other shared resources become saturated
    • Indicators: Per-thread efficiency drops significantly, total throughput plateaus

Step 5: Decision matrix

Use this decision matrix to choose optimal thread count:

Scenario Optimal Thread Count Reasoning
Maximum throughput needed Thread count with highest total throughput Prioritize absolute performance over efficiency
Resource efficiency important Thread count with efficiency >50% Balance performance and resource usage
Cost-sensitive (power, cooling) Thread count with best throughput/power ratio Typically 16-32 threads on Zen2
Time-to-soluti on critical Thread count with lowest execution time May differ from maximum throughput
Multi-process workload Threads per process = cores per NUMA domain / number of processes Avoid oversubscriptio n

Practical example: analyzing the benchmark results

Using the benchmark results:

  1. Calculate efficiency for each thread count:
    • 2 threads: 179.35 / (120.72 × 2) = 74% efficiency
    • 4 threads: 293.25 / (120.72 × 4) = 61% efficiency
    • 16 threads: 987.15 / (120.72 × 16) = 51% efficiency
    • 32 threads: 1602.87 / (120.72 × 32) = 41% efficiency
  2. Identify optimal point:
    • Maximum throughput: 32 threads (1602.87 M ops/s)
    • Efficiency >50%: 16 threads (51% efficiency)
    • Per-thread efficiency drop: 16→32 threads shows 19% drop (61.70→50.09 M ops/s)
    • Throughput plateau: Starts at 32 threads (1602→1592→1616 M ops/s)
  3. Decision:
    • For maximum throughput: Use 32 threads
    • For efficiency: Use 16 threads (51% efficiency, 8.18x speedup)
    • For resource efficiency: Use 16 threads (better per-thread efficiency)

Step 6: Workload-specific considerations

Compute-bound workloads (CPU-intensive, minimal memory access):

  • Optimal: 16 threads (1 NUMA domain) typically provides best efficiency
  • Avoid: SMT (128 threads) - minimal benefit, significant efficiency loss
  • Monitor: Per-thread efficiency should remain >50%

Memory-bound workloads (frequent memory access, cache misses):

  • Optimal: Test 32-64 threads - may benefit from more threads hiding memory latency
  • Consider: SMT may provide 10-30% additional throughput
  • Monitor: Total throughput and memory bandwidth utilization

Mixed workloads:

  • Optimal: Profile to identify bottlenecks
  • Strategy: Start with 16 threads, increase if memory-bound sections benefit

Step 7: Validation

After determining optimal thread count, validate with:

  1. Real workload testing: Run the application with the determined thread count
  2. Performance monitoring: Use perf, likwid, or similar tools to monitor:
    • Cache miss rates
    • Memory bandwidth utilization
    • CPU utilization
    • NUMA node access patterns
  3. Stability testing: Run extended tests to ensure performance is stable over time
  4. Resource monitoring: Check for thermal throttling, power limits, or other resource constraints

Quick reference: Zen2 thread count guidelines

Based on benchmark analysis:

Thread Count Use Case Efficiency Notes
1-4 threads Small workloads, same CCX 61-74% Excellen t for cache-sh aring workload s
8-16 threads Single NUMA domain 51-54% Recomme nded for most workload s
32 threads Maximum throughput 41% Use when absolute performa nce is critical
64 threads Full socket (physical cores) 21% Only for memory-b ound workload s
128 threads Full socket (SMT) 10% Avoid for compute- bound workload s

General recommendation: Start with 16 threads (1 NUMA domain) and adjust based on specific workload characteristics and performance requirements.

Automated analysis tool

A post-processing analysis script is provided to automatically analyze benchmark results.

Note

This is a regular bash script (not a SLURM batch script) that should be run on the login node after the SLURM job completes.

# Step 1: Submit the benchmark job to SLURM
cd zen2
sbatch slurm_openmp_scaling.sh

# Step 2: Wait for job to complete, then analyze the output (on login node)
# The output file will be named like: zen2_openmp_scaling.12345.out
./analyze_thread_scaling.sh zen2_openmp_scaling.12345.out

Important: The analysis script runs on the login node and does not require SLURM. It simply parses the output file from the completed benchmark job.

The script will:

  • Extract throughput metrics for each thread count
  • Calculate scaling and efficiency
  • Identify optimal thread count based on multiple criteria
  • Detect efficiency drops and throughput plateaus
  • Provide recommendations with specific thread counts

Example output:

Thread Scaling Analysis:
=======================
Threads  Total (M ops/s) Per-Thread (M ops/s) Scaling        Efficiency
----------------------------------------------------------------------------------------
1        120.72          120.72               1.00x          100.0%
2        179.35          89.68                1.49x          74.3%
4        293.25          73.31                2.43x          60.7%
16       987.15          61.70                8.18x          51.1%
32       1602.87         50.09                13.28x         41.5%

Note: Efficiency measures parallel scaling quality (how well threads scale), not absolute
performance. While 1 thread shows 100% efficiency (baseline), 16 threads provide 8.18x
higher total throughput (987.15 vs 120.72 M ops/s) despite lower efficiency (51.1%).
The recommendation balances total throughput gain vs per-thread efficiency.

Analysis Summary:

1. Maximum Total Throughput: 32 threads (1602.87 M ops/s, 13.28x speedup)
2. Best Efficiency (>50%): 16 threads (51.1% efficiency, 8.18x speedup)
3. Recommended (efficiency ≥50%): 16 threads - provides 8.18x total throughput with
   acceptable efficiency
4. Efficiency gradually declines: 51.1% at 16 threads, 41.5% at 32 threads
5. Throughput plateau detected starting at 32 threads
   Throughput increase: 0.6% (minimal improvement)

This automated analysis helps quickly identify the optimal thread count without manual calculation.

Analysis:

Fortran code compiled with flang benefits from the same Zen2 optimizations as C++ code. The example demonstrates:

  1. Effective auto-vectorization: Fortran’s array syntax (a(:) = b(:) * c(:)) is naturally vectorizable. With -march=znver2, flang generates efficient AVX2 vectorized code (note: -mprefer-vector-width=256 is not supported by flang, but -march=znver2 automatically enables appropriate vectorization).

  2. OpenMP scaling: OpenMP parallelization scales well on Zen2, achieving near-linear speedup for this memory-bound workload. The contiguous attribute helps ensure efficient vectorization within each thread.

  3. Compiler flag compatibility: flang supports core Zen2 optimization flags (-O2, -march=znver2, -fopenmp). However, flang has limited support for some advanced flags compared to clang. Flags like -mprefer-vector-width=256 and -mfma are not explicitly supported (though -march=znver2 automatically enables FMA and appropriate vectorization). Advanced flags like -falign-loops, -fdata-sections, -ffunction-sections, and -flto=thin are not supported by flang.

  4. NUMA awareness: For multi-threaded Fortran applications, use the same NUMA binding strategies as C++:

    export OMP_NUM_THREADS=4
    export OMP_PLACES=cores
    export OMP_PROC_BIND=close
    srun --cpu-bind=sockets:0-0 ./zen2_fortran_example_omp
    

Fortran-specific advantages for Zen2:

  • Array syntax: Fortran’s native array operations are highly vectorizable, often requiring less explicit optimization than C++ code.
  • Memory layout: Fortran’s column-major layout can be optimized with blocking techniques for large matrices.
  • Compiler maturity: flang (LLVM Fortran) benefits from the same optimization infrastructure as clang, providing excellent Zen2 support.

Recommendations:

  • For single-process, single-workload scenarios: Let the OS scheduler handle thread placement initially. Only use explicit binding if profiling shows CPU migration overhead or contention issues.
  • For production multi-process systems: Use explicit binding to prevent processes from interfering with each other.
  • For NUMA-aware applications: Always bind threads to cores on the same NUMA domain where memory is allocated.
  • For cache-sensitive workloads: Test both same-CCX binding and OS scheduling to determine which performs better for the specific workload.
  • Monitor and measure: Use perf stat -e migrations,cycles to measure CPU migration overhead and determine if explicit binding is beneficial.

Conclusions from benchmark results

  1. Cache layout optimization is critical: The 21.5x speedup from proper data structure layout demonstrates that memory access patterns and cache utilization are the most important factors for Zen2 performance.
  2. Vectorization benefits are workload-dependent: The minimal 1.003x speedup from explicit AVX2 vectorization suggests that:
    • The compiler already performs effective vectorization at -O2
    • Zen2’s runtime optimizer may handle many vectorization opportunities
    • Explicit vectorization may be more beneficial for specific computational kernels
  3. __restrict__ benefits are code-pattern dependent: The minimal improvement from __restrict__ annotations (1.002x) in Example 4 reflects the simple memory access pattern used, not a general limitation. The compiler’s alias analysis at -O2 already handles simple array operations effectively. However, __restrict__ can provide significant benefits (10-30%) for complex pointer patterns, function calls within loops, or code where the compiler cannot statically determine aliasing. Use profiling to identify cases where alias analysis limits optimization.
  4. Runtime optimizer hypothesis supported: The results support the observation that Zen2’s embedded runtime optimizer makes many compile-time optimizations less critical. The compiler’s -O2 optimizations, combined with runtime optimization, provide most of the benefits without requiring -O3 or explicit optimization hints. See the detailed analysis in “Indication for the existence of Zen2 runtime optimizer” section below.
  5. Memory alignment is important: The 6.58x speedup from proper memory alignment demonstrates that aligned access enables effective vectorization and reduces cache penalties. This is the second most impactful optimization after data layout.
  6. Loop unrolling has diminishing returns: Moderate unrolling (4-8 iterations) is sufficient; excessive unrolling does not provide additional benefit and may degrade performance by exceeding the µOP cache capacity.
  7. Combined optimizations may not always help: Combining multiple optimization techniques does not guarantee additional benefit when the compiler already optimizes effectively at -O2. Optimization efforts should be guided by profiling data.
  8. Focus on data layout and alignment: For Zen2 systems, optimizing data structures, memory access patterns, and alignment provides significantly more benefit than micro-optimizations like explicit vectorization or restrict annotations.
  9. NUMA awareness is critical for multi-NUMA domain systems: For multi-NUMA domain AMD EPYC Zen2 systems (like Discoverer compute nodes with 8 NUMA domains per node), binding to a single NUMA domain ensures all memory access is local and avoids remote memory access penalties. Use numactl --membind=N --cpunodebind=N to bind processes to specific NUMA domains.
  10. Thread affinity is workload-dependent: Example 12 demonstrates that explicit thread affinity binding does not always improve performance. The OS scheduler often performs best by dynamically balancing load and avoiding contention. However, explicit binding is still valuable for:
    • Multi-process applications where processes need dedicated cores
    • NUMA-aware applications requiring local memory access
    • Production systems needing predictable, interference-free performance
    • Real-time applications requiring deterministic behavior
    • Cache-sensitive workloads where same-CCX binding may help (test to verify)

These examples demonstrate that on Zen2, data layout and memory alignment optimizations provide the largest performance gains, while compiler optimizations at -O2 are already effective. For multi-threaded applications on multi-CCX and multi-NUMA domain systems, NUMA-aware optimizations affect performance. For multi-NUMA domain AMD EPYC Zen2 systems (like Discoverer compute nodes with 8 NUMA domains per node), binding processes to specific NUMA domains and ensuring local memory allocation affects performance. The results support the observation that Zen2’s runtime optimizer makes many compile-time optimizations less critical.

Indication for the existence of Zen2 runtime optimizer

The benchmark results across multiple examples provide compelling evidence supporting the hypothesis that Zen2 has an embedded runtime optimizer. This section analyzes the pattern of minimal speedups from compile-time optimizations as evidence that many optimizations are already being performed at runtime.

Pattern analysis: minimal speedups from compile-time optimizations

Several examples show minimal or no speedup from optimizations that typically provide significant benefits on other architectures:

1. Explicit vectorization (Example 1)

  • Speedup: 1.00357x (0.36% improvement)
  • Expected: 2-4x speedup from explicit AVX2 vectorization
  • Observation: The scalar version at -O2 already performs nearly as well as explicit vectorization
  • Implication: Zen2’s runtime optimizer may be detecting and executing vector operations even when the compiler generates scalar code, or the compiler’s auto-vectorization at -O2 is already optimal

2. Restrict pointer optimization (Example 4)

  • Speedup: 1.00191x (0.19% improvement)
  • Expected: 5-15% improvement for memory-bound loops with potential aliasing
  • Observation: The compiler already optimizes effectively without __restrict__ hints
  • Implication: Zen2’s memory disambiguation unit (mentioned in architecture overview) may handle aliasing detection at runtime, reducing the benefit of compile-time hints

3. Loop unrolling (Example 5)

  • Speedup: 1.00395x (0.40% improvement between moderate and excessive unrolling)
  • Expected: Excessive unrolling should show degradation due to µOP cache overflow
  • Observation: Minimal difference suggests the runtime optimizer may be reorganizing instructions or the µOP cache is more effective than expected
  • Implication: The runtime optimizer may be managing instruction flow to optimize µOP cache utilization

4. Combined optimizations (Example 7)

  • Speedup: 0.996409x (slight degradation)
  • Expected: Combining multiple optimizations should provide cumulative benefits
  • Observation: Combining optimizations provides no benefit or slight degradation
  • Implication: The runtime optimizer may already be applying these optimizations, making explicit compile-time combinations redundant or counterproductive

5. Optimization level: -O2 vs -O3

  • Speedup: Minimal difference (typically < 1%)
  • Expected: -O3 should provide 5-15% improvement over -O2 on most architectures
  • Observation: The performance difference between -O2 and -O3 is almost insignificant for most code on Zen2
  • Implication: The runtime optimizer handles many optimizations that -O3 performs at compile time, making aggressive compile-time optimizations less necessary

6. C++ standard version comparison (Example 14)

  • Speedup: Similar performance across C++17, C++20, and C++23 (within 2-3% variation)
  • Expected: Different C++ standard versions might enable different optimizations or have different overhead
  • Observation: All C++ standard versions perform similarly when using the same optimization flags
  • Implication: The runtime optimizer works with the generated instructions regardless of the C++ standard version used, making the standard version choice less critical for performance

7. Fortran compilation variants (Example 13)

  • Speedup: Similar performance across basic, OpenMP, and “full optimization” versions (within 2-3% variation)
  • Expected: Additional optimization flags should provide performance improvements
  • Observation: All Fortran versions compiled with flang show statistically similar performance when using the same core flags (-O2 -march=znver2)
  • Implication: The runtime optimizer handles optimizations that would normally differ between compilation variants, making compile-time differences less significant

8. Thread affinity and CPU binding (Example 12)

  • Speedup: OS scheduling often performs best for single-process workloads
  • Expected: Explicit CPU affinity binding should provide better performance than OS scheduling
  • Observation: For single-process workloads, OS scheduling frequently outperforms explicit CPU binding
  • Implication: The runtime optimizer, combined with the OS scheduler, effectively manages thread placement and CPU affinity, making explicit binding less critical for single-process applications

9. Memory-bound workload characteristics

  • Observation: Memory-bound workloads show minimal benefit from compile-time optimizations across multiple examples
  • Expected: Compiler optimizations should improve memory access patterns
  • Implication: The runtime optimizer cannot change fundamental memory access patterns or data layout, but it can optimize instruction execution around memory operations. This explains why data layout optimizations (21.5x speedup) provide much larger benefits than instruction-level optimizations

What works: optimizations that bypass runtime optimizer

The optimizations that show significant speedups are those that affect data layout and memory access patterns, which the runtime optimizer cannot change. The runtime optimizer works with the instructions and data it receives, but it cannot fundamentally alter:

  1. Data structure layout in memory
  2. Memory alignment properties
  3. NUMA domain memory placement
  4. Cache line boundaries and access patterns
  5. Fundamental memory bandwidth limitations

These constraints explain why certain optimizations provide large speedups:

1. Cache-aware data layout (Example 2)

  • Speedup: 21.5135x
  • Why it works: The runtime optimizer cannot change data structure layout. Poor layout causes cache misses that the runtime optimizer cannot mitigate.

2. Memory alignment (Example 6)

  • Speedup: 6.58108x
  • Why it works: Misaligned memory access creates penalties that the runtime optimizer cannot eliminate. Alignment is a fundamental memory property.

3. Auto-vectorization width selection (Example 1a)

  • Performance: Workload-dependent (Example 1a shows 0.7% slower for memory-bound workload, but can provide benefits for compute-bound workloads)
  • Why it works: The compiler flag guides code generation before runtime. The runtime optimizer works with the instructions it receives. However, for memory-bound workloads, the flag may not help or may slightly hurt performance due to register pressure or memory bandwidth saturation.

4. NUMA-aware memory placement (Examples 9, 10, 11)

  • Speedup: Significant performance improvements when binding to local NUMA domains
  • Why it works: The runtime optimizer cannot change which NUMA domain owns memory pages. Remote memory access has inherent latency penalties that runtime optimization cannot eliminate. Binding processes to specific NUMA domains ensures local memory allocation, which the runtime optimizer can then work with effectively.

5. Profile-Guided Optimization (PGO) (Example 3)

  • Speedup: 15-25% performance improvement
  • Why it works: PGO optimizes code layout, branch prediction hints, and hot/cold code splitting. These optimizations complement the runtime optimizer rather than duplicate it. PGO provides information about actual execution patterns that the runtime optimizer can use more effectively, and it optimizes aspects (like code layout) that affect instruction cache behavior, which the runtime optimizer cannot change.

Architectural features supporting runtime optimization

Zen2’s documented architectural features align with runtime optimization capabilities:

  1. Memory Disambiguation Unit: Detects and handles memory dependencies at runtime, enabling out-of-order execution optimizations that would normally require compile-time analysis.
  2. Sophisticated Branch Predictor (TAGE): Provides high-accuracy branch prediction, reducing the benefit of compile-time branch optimization hints.
  3. µOP Cache (4K entries): Stores decoded micro-operations, allowing the processor to optimize instruction flow at runtime.
  4. Dual 256-bit FMA Units: Hardware support for vector operations that the runtime optimizer can utilize regardless of how the compiler generates code.
  5. 7-wide Instruction Dispatch: Wide dispatch pipeline allows the runtime optimizer to reorder and parallelize instructions effectively.

Implications for optimization strategy

The evidence suggests:

  1. Focus on data layout and memory access patterns: These provide the largest performance gains (21.5x, 6.58x) because they cannot be optimized at runtime. The runtime optimizer works with the data it receives but cannot change fundamental memory properties.
  2. Compiler optimizations at ``-O2`` are sufficient: The runtime optimizer handles many optimizations that -O3 would perform at compile time, making -O2 the sweet spot. The minimal difference between -O2 and -O3 indicates that aggressive compile-time optimizations are less necessary.
  3. Explicit optimization hints have diminishing returns: Hints like __restrict__ or explicit vectorization provide minimal benefit because the runtime optimizer already handles these cases. The memory disambiguation unit and sophisticated instruction scheduling make compile-time hints less critical.
  4. Profile-Guided Optimization (PGO) still valuable: PGO optimizes code layout and branch prediction, which complement runtime optimization rather than duplicate it. PGO provides execution pattern information that helps both the compiler and the runtime optimizer work more effectively.
  5. Architecture-specific flags matter: Flags like -mprefer-vector-width=256 guide code generation to match hardware capabilities, which the runtime optimizer can then utilize effectively. These flags ensure the generated code matches what the hardware can execute optimally.
  6. C++ standard version choice is less critical: Similar performance across C++17, C++20, and C++23 suggests that the runtime optimizer works effectively with instructions generated from any modern C++ standard. Choose the standard version based on language features needed, not performance concerns.
  7. NUMA awareness is essential for multi-NUMA domain systems: The runtime optimizer cannot change NUMA domain memory placement. For systems with multiple NUMA domains (like Discoverer with 8 NUMA domains per node), explicit NUMA binding and local memory allocation are critical for performance.
  8. Thread affinity binding is workload-dependent: For single-process workloads, OS scheduling often performs best because the runtime optimizer and OS scheduler work together effectively. For multi-process applications or NUMA-aware code, explicit CPU affinity binding may still be valuable.
  9. Memory-bound workloads benefit less from compile-time optimizations: The runtime optimizer cannot change fundamental memory bandwidth limitations or data access patterns. For memory-bound workloads, focus on data layout, memory alignment, and NUMA placement rather than instruction-level optimizations.
  10. Combined optimizations may not provide cumulative benefits: The runtime optimizer may already be applying many optimizations, making explicit compile-time combinations redundant or even counterproductive. Test individual optimizations rather than assuming they combine additively.

Summary of evidence

The consistent pattern of minimal speedups from compile-time optimizations, combined with Zen2’s sophisticated architectural features, provides strong evidence for an embedded runtime optimizer. The evidence includes:

Minimal speedups from compile-time optimizations:

  • -O2 performs nearly as well as -O3 (typically < 1% difference)
  • Explicit vectorization provides minimal benefit (0.36% improvement)
  • __restrict__ annotations show minimal improvement (0.19% improvement)
  • Loop unrolling shows minimal differences (0.40% between moderate and excessive)
  • Combined optimizations don’t provide cumulative benefits (slight degradation observed)
  • C++ standard version choice has minimal impact (similar performance across C++17, C++20, C++23)
  • Fortran compilation variants show similar performance when using the same core flags
  • Thread affinity binding shows OS scheduling often performs best for single-process workloads

Large speedups from optimizations the runtime optimizer cannot change:

  • Cache-aware data layout: 21.5x speedup
  • Memory alignment: 6.58x speedup
  • NUMA-aware memory placement: Significant improvements for multi-NUMA domain systems
  • Profile-Guided Optimization: 15-25% improvement (complements runtime optimizer)

Architectural features supporting runtime optimization:

  • Memory Disambiguation Unit for runtime dependency detection
  • Sophisticated TAGE branch predictor reducing need for compile-time hints
  • µOP Cache (4K entries) for runtime instruction flow optimization
  • Dual 256-bit FMA units that the runtime optimizer can utilize
  • 7-wide instruction dispatch for effective runtime reordering

Conclusion

The runtime optimizer appears to handle many optimizations that would traditionally be performed at compile time, making Zen2 less sensitive to compile-time optimization choices. This explains why:

  • Data layout and memory access pattern optimizations (which cannot be changed at runtime) provide the largest performance improvements
  • Instruction-level optimizations show minimal benefits because the runtime optimizer already handles them
  • -O2 is sufficient for most code because the runtime optimizer complements compiler optimizations
  • Explicit optimization hints have diminishing returns because the runtime optimizer already handles these cases
  • PGO remains valuable because it optimizes aspects (code layout, branch prediction) that complement rather than duplicate runtime optimization

Note

While AMD does not officially document a runtime optimizer, the benchmark evidence strongly suggests its existence. This hypothesis explains the observed performance characteristics better than attributing them solely to compiler optimizations. The consistent pattern across multiple examples, combined with Zen2’s documented architectural features, provides compelling evidence for runtime optimization capabilities.

Benchmark results summary

Measured performance results from the optimization examples:

  1. Optimization level -O2 is typically sufficient. The suspected runtime optimizer in Zen2 reduces the performance difference between -O2 and -O3 for most code.
  2. Data layout optimization provides the largest performance improvements. Cache-aware data structure design shows 21.5x speedup in benchmarks, exceeding other optimization techniques.
  3. Profile-Guided Optimization (PGO) provides 15-25% performance gains with proper profiling workflows.
  4. Use -march=znver2 to enable architecture-specific optimizations rather than generic flags.
  5. Zen2 does not support AVX-512. Use 256-bit vectors with -mprefer-vector-width=256 for compute-bound workloads. Note that Example 1a shows this flag can be slower (0.7%) for memory-bound workloads due to register pressure and memory bandwidth saturation. The flag is generally beneficial for compute-bound code where the compiler can effectively utilize both FMA units. Explicit vectorization with intrinsics may provide minimal benefit when the compiler already optimizes effectively (see Example 1).
  6. Loop unrolling should respect the 4K µOP cache limit. Benchmark results indicate moderate unrolling (4-8 iterations) is sufficient; excessive unrolling provides no additional benefit.
  7. For mixed workloads, blend PGO profiles by weighting representative workloads appropriately.
  8. __restrict__ benefits depend on code complexity. For simple array operations (like Example 4), the compiler’s alias analysis at -O2 is already effective, providing minimal benefit. However, __restrict__ can provide significant improvements (10-30%) for complex pointer patterns, function calls within loops, or code where aliasing cannot be statically determined. Profile to identify where alias analysis limits optimization.
  9. Memory alignment provides performance improvements (6.58x in benchmarks), enabling vectorization and reducing cache penalties.
  10. Discoverer compute nodes have 8 NUMA domains (16 cores each). Use SLURM --sockets-per-node and --cpu-bind=sockets to bind to specific NUMA domains. For single-process applications, use one NUMA domain. For multi-process applications, use one process per NUMA domain.
  11. Thread affinity binding is workload-dependent. Example 12 shows that OS scheduling often performs best for single-process workloads. However, explicit CPU affinity binding (pthread_setaffinity_np() or numactl --physcpubind) is valuable for multi-process applications, NUMA-aware code requiring local memory access, and production systems needing predictable performance. Test both approaches for the specific workload.
  12. Optimization priorities: data layout, memory alignment, NUMA awareness (especially for multi-NUMA domain systems), and memory access patterns provide larger performance gains than micro-optimizations on Zen2.
  13. Fortran code benefits from Zen2 optimizations using flang (LLVM Fortran). Use core optimization flags (-O2, -march=znver2, -fopenmp). Note that flang has limited support for some advanced flags compared to clang (e.g., -mprefer-vector-width=256, -mfma are not explicitly supported, though -march=znver2 automatically enables FMA and appropriate vectorization). Fortran’s array syntax is naturally vectorizable, and OpenMP parallelization scales well on Zen2. See Example 13 for a complete Fortran example with flang.

Example 15: OpenMP thread scaling and CPU binding

Source file: zen2/openmp_thread_scaling_example.cpp
SLURM script: zen2/slurm_openmp_scaling.sh
Location: zen2/ directory

This example provides a long-running OpenMP benchmark to evaluate Zen2’s ability to handle different thread counts and CPU bindings. It runs for several minutes and tests thread scaling from 1 to 128 threads, different NUMA domain configurations, and various CPU binding strategies.

Purpose:

  • Evaluate thread scaling efficiency on Zen2
  • Test optimal thread counts for compute-bound workloads
  • Measure NUMA domain effects on performance
  • Assess SMT (Simultaneous Multi-Threading) benefits
  • Monitor performance stability over extended runs

Features:

  • Long-running benchmark (configurable duration, default 60 seconds per test)
  • Compute-intensive workload (trigonometric and exponential functions)
  • Memory-intensive workload option (matrix multiplication)
  • Real-time performance monitoring
  • CPU affinity reporting
  • Throughput metrics (operations per second per thread)

Compile and run:

cd zen2
sbatch slurm_openmp_scaling.sh

The script runs 10 different tests: 1. Single thread (baseline) 2. 2 threads (same CCX) 3. 4 threads (same CCX - optimal for Zen2) 4. 8 threads (2 CCXs) 5. 16 threads (1 NUMA domain) 6. 32 threads (2 NUMA domains) 7. 64 threads (4 NUMA domains) 8. 128 threads (full socket with SMT) 9. 4 threads with spread binding (cross-NUMA test) 10. Memory-intensive workload (4 threads)

Expected insights:

  • Optimal thread count: Typically 4 threads per CCX (16 threads per NUMA domain) for compute-bound workloads
  • Scaling efficiency: Linear scaling up to CCX capacity, then diminishing returns due to cache contention
  • NUMA effects: Performance degradation when threads span multiple NUMA domains
  • SMT benefits: SMT (128 threads) may provide 10-30% additional throughput for memory-bound workloads
  • Performance stability: Monitor for thermal throttling or frequency scaling effects over time

Usage:

You can also run the benchmark directly with custom parameters:

# Run with 4 threads, 10M elements, 100 compute iterations, 60 seconds
./openmp_thread_scaling_example 4 10000000 100 60 0

# Run memory-intensive test
./openmp_thread_scaling_example 4 10000000 100 60 1

Analysis:

The benchmark provides detailed metrics including:

  • Throughput (M operations/second)
  • Throughput per thread
  • Performance over time (to detect thermal throttling)
  • Comparison across different thread counts and bindings

This helps determine the optimal thread configuration for a specific workload on Zen2 architecture.

Compilation:

clang++ -O2 -march=znver2 -mprefer-vector-width=256 -mfma \
        -fopenmp -stdlib=libc++ \
        openmp_thread_scaling_example.cpp -o openmp_thread_scaling_example