GROMACS (on CPU)¶

Table of Contents

GROMACS (on CPU)

Versions and build types available ¶

Warning

This document describes running GROMACS on Discoverer CPU cluster.

Supported versions ¶

Note

The versions of GROMACS installed in the software repository are built and supported by the Discoverer HPC team. The MPI builds should be employed for running the actual simulations (mdrun) and deriving trajectories, while the threadMPI ones should be regarded mostly a tool set for trajectory post-processing.

To check which GROMACS versions are currently supported on Discoverer, execute on the login node:

module avail gromacs

The following environment module naming convention is applied for the modules servicing the access to the software repository:

gromacs/MAJOR_N/MAJOR_N.MINOR_N-comp-num_lib-gpuavail-mpi_lib

where:

MAJOR_N - the major number of the GROMACS version (example: 2022)

MINOR_N - the minor number of the GROMACS version (example: 1, which stands for 2022.1)

comp - the compiler collection employed for compiling the source code (example: intel)

num_lib - the numerical methods’ library providing BLAS and FFTW the libgromacs is linked against (example: openblas)

gpuavail - shows if the build supports GPU acceleration (example: nogpu, which means no GPU support)

mpi_lib - the MPI library the GROMACS code is linked against (example: openmpi, which implies the use of Open MPI library)

The installed versions are compiled based on the following recipes:

https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/gromacs

Two different builds available:

Discoverer provides two different GROMACS installations optimized for different use cases (see also Choosing the right build):

Thread-MPI build (single-node optimized)

module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

Use this build when:

Running simulations on a single compute node

Want maximum performance for single-node simulations

Need to run analysis tools (with -ntmpi 1)

Working with AMD EPYC processors

Running CPU-only simulations

Features:

Optimized for single-node performance

Can run analysis tools by setting -ntmpi 1

Excellent NUMA optimization

Lower memory overhead

Faster startup times

Executable name: gmx

Example usage:

# Single-node simulation
gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix

# Analysis tool (single thread-MPI rank)
gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr
gmx mdrun -ntmpi 1 -s npt.tpr -deffnm npt

For more details see Single-Node Thread-MPI Script.

External MPI build (Multi-CPU-core and multi-node capable)

module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi

Use this build when:

Running simulations across multiple compute nodes

Need multi-node parallelization

Using OpenMPI for distributed computing

Large-scale simulations requiring multiple nodes

Features:

Supports multi-node simulations

Uses OpenMPI for inter-node communication

Compatible with SLURM multi-node job submission

Can handle larger systems across multiple nodes

Executable name: gmx_mpi

Example usage:

# Multi-node simulation (on 2 nodes - with 128 CPU Cores per node)
mpirun -np 256 gmx_mpi mdrun -ntomp 2 -pin auto -s prefix.tpr -deffnm prefix

For more details see Multi-Node External MPI Script.

User-supported versions ¶

Users are welcome to bring, or compile, and use their own builds of GROMACS but those builds will not be supported by Discoverer HPC team.

Running simulations (mdrun)¶

Running simulations means invoking mdrun for generating trajectories based on a given TPR file.

Warning

You MUST NOT execute simulation directly upon the login node (login.discoverer.bg). You have to run your simulations as Slurm jobs only.

Warning

Write your trajectories and result of analysis only inside your Per-project scratch and storage folder and DO NOT use for that purpose (under any circumstances) your Home folder (/home/username)!

Single-node thread-MPI script ¶

#!/bin/bash
#
#SBATCH --partition=cn         ### Partition (you may need to change this)
#SBATCH --job-name=gromacs_single_node
#SBATCH --time=512:00:00       ### WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes           1    # MUST BE 1 for thread-MPI
#SBATCH --ntasks-per-node 1    # MUST BE 1 for thread-MPI
#SBATCH --cpus-per-task 256    # N MPI threads x M OpenMP threads (128 * 2 for AMD EPYC 7H12)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

# AMD EPYC 7H12 optimization: 2 threads per core
export NTOMP=2
export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP))  # 256 / 2 = 128

# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY

cd $SLURM_SUBMIT_DIR

gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -v -s prefix.tpr -deffnm prefix -pin auto

Note

Thread-MPI NUMA Configuration: Unlike external MPI builds, thread-MPI ones cannot use --ntasks-per-socket because thread-MPI runs as a single process with internal thread management. Thread-MPI allocates all 256 logical CPUs to one process and relies on GROMACS’s internal -pin auto mechanism to optimize thread placement across NUMA domains. This provides less explicit control over NUMA domain usage compared to external MPI, but simplifies resource management for single-node simulations.

Specify the parameters and resources required for successfully running and completing the job:

Slurm partition (--partition): Specifies which group of compute nodes (partition) to use. For GROMACS on Discoverer, use cn partition which contains the CPU-optimized nodes with AMD EPYC processors. This partition provides the best performance for molecular dynamics simulations.

Job name (--job-name): A descriptive identifier for your job that will appear in the queue and job listings. Use meaningful names like gromacs_protein_sim, gromacs_membrane_run, or gromacs_equilibration to easily identify your simulations.

Wall time (--time): Maximum time your job is allowed to run before being terminated. Format is HH:MM:SS (e.g., 48:00:00 for 48 hours, 12:30:00 for 12.5 hours). Set this based on your simulation size, expected runtime, and queue policies. Underestimating may cause job termination before completion.

Number of compute nodes (--nodes): How many physical nodes to allocate for your simulation. For single-node Thread-MPI simulations, always use 1. For multi-node external MPI simulations, this determines the total computational power and memory available.

Number of MPI processes per node (--ntasks-per-node): Critical parameter for GROMACS performance. For Thread-MPI builds, must be 1 (single process). For external MPI builds on Discoverer with 8 NUMA domains per node, use 128 to get 16 MPI tasks per NUMA domain for optimal memory locality and cache utilization.

Number of OpenMP threads per MPI process (--cpus-per-task): Controls hybrid parallelism by specifying how many logical CPUs each MPI process can use. For Thread-MPI: use 256 (all available CPUs). For external MPI: use 2 to utilize hyperthreading while maintaining good NUMA performance with 2 OpenMP threads per MPI task.

GROMACS version (module load): Choose the appropriate version and build based on your simulation requirements. Thread-MPI builds are optimized for single-node simulations, while external MPI builds support multi-node scaling. See Supported versions for available builds and their specific characteristics.

Multi-node external MPI script ¶

Note

Here the term “external” means the MPI library is not the native one included into the code. That means the user MPI library it is not the Thread-MPI library mentioned above.

This script is used for multi-node external MPI simulations. It is based on our build of GROMACS that uses Open MPI as the MPI library.

#!/bin/bash
#
#SBATCH --partition=cn         # Partition (you may need to change this)
#SBATCH --job-name=gromacs_multi_node # Job name
#SBATCH --time=512:00:00       # WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes=2              # Number of nodes
#SBATCH --ntasks-per-node=128  # Number of MPI tasks to run upon each node
#SBATCH --ntasks-per-socket=16 # Number of tasks per NUMA-bound socket
#SBATCH --cpus-per-task=2      # Two threads per MPI rank
#SBATCH --ntasks-per-core=1    # Each MPI rank is bound to a CPU core
#SBATCH --mem=251G             # Do not exceed this on Discoverer CPU cluster

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=false

# Optimize InfiniBand communication
export UCX_NET_DEVICES=mlx5_0:1

cd ${SLURM_SUBMIT_DIR}

mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
   gmx_mpi mdrun -ntomp ${OMP_NUM_THREADS} -v \
      -s prefix.tpr -deffnm prefix

In the scripts above, edit the parameters and resources required for successfully running and completing the job:

Slurm partition of compute nodes (--partition): Specifies which group of nodes (partition) to use. For GROMACS on Discoverer, use cn partition which contains the CPU-optimized nodes.

Job name (--job-name): A descriptive name for your job that will appear in the queue. Use meaningful names like gromacs_protein_sim or gromacs_membrane_run.

Wall time (--time): Maximum time your job can run. Format is HH:MM:SS (e.g., 48:00:00 for 48 hours). Set this based on your simulation size and expected runtime.

Number of compute nodes (--nodes): How many physical nodes to allocate. For multi-node GROMACS simulations, this determines the total computational power available.

Number of MPI processes per node (--ntasks-per-node): Critical for GROMACS performance. On Discoverer with 8 NUMA domains per node, use 128 MPI tasks to get 16 tasks per NUMA domain for optimal memory locality.

Number of MPI tasks per NUMA domain (--ntasks-per-socket): Essential for NUMA-aware performance. Set to 16 to place exactly 16 MPI tasks per NUMA domain (128 total tasks ÷ 8 NUMA domains = 16 per domain). This ensures optimal memory access patterns and cache utilization within each NUMA boundary.

Number of OpenMP threads per MPI process (--cpus-per-task): Controls hybrid parallelism. Use 2 threads per MPI task to utilize hyperthreading while maintaining good NUMA performance.

GROMACS version (module load): Choose the appropriate version based on your simulation requirements. See Supported versions for available builds and their characteristics.

Save the complete Slurm job description as a file, for example /valhalla/projects/<your_slurm_project_account_name>/run_gromacs/run_gromacs.sh, and submit it to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/run_gromacs/
sbatch run_gromacs.sh

Upon successful submission, the standard output will be directed by Slurm into the file /valhalla/projects/<your_slurm_project_account_name>/run_gromacs/slurm.%j.out (where %j stands for the Slurm job ID), while the standard error output will be stored in /valhalla/projects/<your_slurm_project_account_name>/run_gromacs/slurm.%j.err.

Running GROMACS tools ¶

Script for executing single-threaded non-interactive GROMACS tools ¶

Use this for single-threaded GROMACS tools like grompp, editconf, etc.:

#!/bin/bash
#
#SBATCH --partition=cn         ### Partition (you may need to change this)
#SBATCH --job-name=gromacs_single_thread
#SBATCH --time=00:30:00        ### WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes           1    # Single node
#SBATCH --ntasks-per-node 1    # Single task
#SBATCH --cpus-per-task   2    # 1 CPU core (2 threads on AMD EPYC)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

cd $SLURM_SUBMIT_DIR

# Single-threaded tools (no -ntmpi needed)
gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr
gmx editconf -f protein.gro -o protein_box.gro -c -d 1.0 -bt cubic
gmx solvate -cp protein_box.gro -cs spc216.gro -o solv.gro -p topol.top
gmx grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tpr

Script for executing interactive GROMACS tools in non-interactive mode ¶

Use this for GROMACS tools like cluster, rms, gyrate, hbond, do_dssp, etc.:

#!/bin/bash
#
#SBATCH --partition=cn         ### Partition (you may need to change this)
#SBATCH --job-name=gromacs_interactive
#SBATCH --time=01:00:00        ### WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes           1    # Single node
#SBATCH --ntasks-per-node 1    # Single task
#SBATCH --cpus-per-task   2    # 1 CPU core (2 threads on AMD EPYC)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

cd ${SLURM_SUBMIT_DIR}

# Interactive tools using echo pipes for input
# Format: echo -e "input1\ninput2\n..." | gmx tool_name [options]
#
# How echo pipes simulate interactive input:
# echo -e "4\n4" simulates: Type "4", press Enter, type "4", press Enter
# So "4\n4" replaces the interactive sequence: 4 [Enter] 4 [Enter]

# Example 1: Cluster analysis
echo -e "1\n1" | gmx cluster -f trajectory.trr -s structure.tpr -n index.ndx \
    -cutoff 0.15 -method jarvis-patrick -M 0 \
    -o cluster_output -g cluster.log -dist cluster_dist \
    -cl cluster.pdb -nst 250 -wcl 10000

# Example 2: RMSD analysis
echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr \
    -o rmsd.xvg -tu ns

# Example 3: Radius of gyration
echo -e "1\n1" | gmx gyrate -f trajectory.trr -s structure.tpr \
    -o gyrate.xvg -p -n index.ndx

# Example 4: Hydrogen bond analysis
echo -e "1\n1" | gmx hbond -f trajectory.trr -s structure.tpr \
    -num hbond.xvg -tu ns

# Example 5: Secondary structure analysis
echo -e "1\n1" | gmx do_dssp -f trajectory.trr -s structure.tpr \
    -o ss.xpm -sc scount.xvg

Save the complete Slurm job description as a file, for example /valhalla/projects/<your_slurm_project_account_name>/run_gromacs/gromacs_tools.sh, and submit it to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/run_gromacs/
sbatch gromacs_tools.sh

Common Interactive GROMACS Tools and Their Input Patterns¶
Tool	Purpose	Typical Input	What You’d Type Interactively	Example Command
`gmx cluster`	Cluster analysis	`"1\n1"`	Type “1”, press Enter, type “1”, press Enter	`echo -e "1\n1" \\| gmx cluster ...`
`gmx rms`	RMSD calculation	`"4\n1"`	Type “4”, press Enter, type “4”, press Enter	`echo -e "4\n4" \\| gmx rms ...`
`gmx gyrate`	Radius of gyration	`"1\n1"`	Type “1”, press Enter, type “1”, press Enter	`echo -e "1\n1" \\| gmx gyrate ...`
`gmx hbond`	Hydrogen bonds	`"1\n1"`	Type “1”, press Enter, type “1”, press Enter	`echo -e "1\n1" \\| gmx hbond ...`
`gmx do_dssp`	Secondary structure	`"1\n1"`	Type “1”, press Enter, type “1”, press Enter	`echo -e "1\n1" \\| gmx do_dssp ...`
`gmx trjconv`	Trajectory conversion	`"0"`	Type “0”, press Enter	`echo -e "0" \\| gmx trjconv ...`
`gmx select`	Atom selection	`"1\n1"`	Type “1”, press Enter, type “1”, press Enter	`echo -e "1\n1" \\| gmx select ...`

Understanding the table columns:

“Typical Input”: The echo pipe string that simulates interactive input in SLURM
“What You’d Type Interactively”: The exact keystrokes you’d make if running the tool on a personal workstation

How to convert interactive commands to batch commands:

Step-by-step translation process:

Interactive session (on personal workstation):

$ gmx rms
Select group for least squares fit (1-4):
4
Select group for RMSD calculation (1-4):
1

Batch session (in SLURM script):

echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr -o rmsd.xvg

Translation rules:

Each number you type → becomes part of the echo string

Each Enter key press → becomes \n (newline)

Multiple inputs → separated by \n

Final Enter → usually not needed (tool processes automatically)

Translation of `"4\n1"`¶
Interactive Action	Echo String	Explanation
Type “4”, press Enter	`"4\n"`	First input with newline
Type “1”, press Enter	`"1"`	Second input with newline
Combined	`"4\n1"`	Both inputs in one string

Common group numbers:

“0”: System (all atoms)

“1”: Protein

“2”: Non-protein

“3”: Water

“4”: Backbone (protein backbone only)

Tips for converting interactive commands:

Test interactively first: Run the command on your workstation to see what inputs are needed

Count the inputs: Note how many numbers you need to type

Add newlines: Put \n between each input

Use echo -e: The -e flag enables \n interpretation

Pipe to command: Use | to feed the input to the GROMACS tool

Tips for interactive tools:

Test locally first: Run the command interactively to see what inputs are needed

Use echo -e: The -e flag enables interpretation of backslash escapes like \n

Check group numbers: Use gmx make_ndx to see available groups and their numbers

Multiple inputs: Separate multiple inputs with \n for newlines

Error handling: Check the log files for any input errors

Technical details ¶

Choosing the right build ¶

Common Interactive GROMACS Tools and Their Input Patterns¶
Scenario	Recommended Build	Module to Load
Single-node simulation	Thread-MPI	`gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi`
Analysis tools	Thread-MPI	`gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi`
Multi-node simulation	External MPI	`gromacs/2 025/2025.2-llvm-fftw3-openblas-nogpu-openmpi`

Note

AMD EPYC optimization applies to both builds. The 2:1 thread-to-core ratio and other AMD EPYC-specific optimizations work with both thread-MPI and external MPI builds. The choice between builds is based on single-node vs. multi-node requirements, not processor optimization.

Performance comparison on discoverer:

Thread-MPI: 10-20% faster for single-node simulations

External MPI: Required for multi-node, but slower for single-node

Memory Usage: Thread-MPI uses ~30% less memory per node

Important notes:

Thread-MPI cannot run across multiple nodes

External MPI can run on single nodes but with performance penalty

Analysis tools work with thread-MPI when using -ntmpi 1

Both builds support the same GROMACS features (except multi-node for thread-MPI)

Understanding thread-MPI ¶

Important

Thread-MPI is GROMACS’s internal threading library that implements a subset of the MPI 1.1 specification using system threads instead of separate processes. Based on the source code analysis, here’s what makes it special:

Technical details from GROMACS built-in threading support:

Built-in implementation: Thread-MPI is included directly in the GROMACS source tree (src/external/thread_mpi/) and is the default parallelization mode

Cross-platform threading: Uses POSIX pthreads on Linux/Unix and Windows threads on Windows

Shared memory optimization: Unlike external MPI which uses separate processes, thread-MPI uses threads within a single process, enabling:

Direct shared memory access

Lower communication overhead

Better cache utilization

Reduced memory footprint

Why thread-MPI is superior for single-node simulations:

Performance benefits:
- Lower Latency: No inter-process communication overhead
- Better Memory Access: Direct shared memory access between threads
- Optimized for NUMA: Thread-MPI can be optimized for NUMA-aware memory placement
- Reduced Context Switching: Threads within same process vs. separate processes
Resource efficiency:
- Memory Sharing: Threads share the same address space, reducing memory usage
- Faster Startup: No process spawning overhead
- Better Cache Coherence: Shared L3 cache utilization
GROMACS-specific optimizations:
- Integrated Thread Affinity: Thread-MPI works seamlessly with GROMACS’s internal thread pinning system
- Domain Decomposition: Optimized for GROMACS’s domain decomposition algorithms - Load Balancing: Better load balancing within single-node scenarios

Thread-MPI vs external MPI comparison:

Aspect	Thread-MPI	External MPI
Scope	Single node only	Multi-node capable
Co mmunication	Shared memory (fast)	Network/Inter-process (slower)
Memory Usage	Shared address space	Separate process memory
Startup Time	Fast (thread creation)	Slower (process spawning)
NUMA O ptimization	Excellent	Limited
GROMACS Integration	Native, optimized	Generic

When to use thread-MPI and when external MPI:

Use thread-MPI when:

Running on a single compute node

Want maximum performance for single-node simulations

Need to run GROMACS analysis tools (with -ntmpi 1)

Working with AMD EPYC processors (excellent NUMA optimization)

Running CPU-only simulations

Use external MPI when:

Need multi-node simulations

Running across multiple compute nodes

Using specialized MPI features not supported by thread-MPI

Thread-MPI configuration best practices:

# Optimal thread-MPI setup for AMD EPYC 7H12 (128 cores, 256 threads)
export NTOMP=2      # 2 OpenMP threads per MPI rank
export NTMPI=128    # 128 thread-MPI ranks
# Total: 128 × 2 = 256 threads (matches 256 logical threads)

# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY

gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix

Pinning and thread counts work together ¶

Warning

-pin auto and -ntomp are complementary, not alternatives!

A common misconception is that using thread pinning (-pin auto) means you can omit the -ntomp parameter. This is incorrect. Here’s how they work together:

What each parameter does:

-ntomp: Specifies the number of OpenMP threads per MPI rank
-pin auto: Controls how GROMACS maps those threads to CPU cores

Why you need both:

# CORRECT: Both parameters work together
gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix
# Result: 128 MPI ranks × 2 OpenMP threads = 256 total threads
# GROMACS pins each of these 256 threads to specific CPU cores

# INCORRECT: Omitting -ntomp
gmx mdrun -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix
# Result: GROMACS may use default thread count, not optimal for your hardware

How GROMACS uses both parameters:

From the source code analysis, GROMACS’s thread affinity system:

First: Determines total threads = -ntmpi × -ntomp

Then: Maps each thread to a specific core using hardware topology

Finally: Applies pinning based on -pin auto settings

Example thread distribution:

Rank 0: Thread 0 → Core 0 (pinned)
Rank 0: Thread 1 → Core 1 (pinned)
Rank 1: Thread 0 → Core 2 (pinned)
Rank 1: Thread 1 → Core 3 (pinned)
...and so on

Best practice: Always specify both

# For AMD EPYC 7H12 (256 cores)
export NTOMP=2
export NTMPI=128
gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix

This ensures optimal thread distribution and core pinning for your specific hardware.

AMD EPYC thread optimization: the 2:1 rule ¶

Important

AMD EPYC processors benefit from 2 threads per core!

Based on performance testing and GROMACS source code analysis, AMD EPYC processors (including the EPYC 7H12 on Discoverer) show optimal performance when using 2 OpenMP threads per physical core rather than 1:1 or higher ratios.

Why 2:1 thread-to-core ratio works best:

AMD EPYC Architecture: Each EPYC core has 2 hardware threads (SMT - Simultaneous Multithreading)
Memory Bandwidth: AMD EPYC has excellent memory bandwidth that can sustain 2 threads per core
Cache Efficiency: Shared L3 cache benefits from 2 threads working on related data
NUMA Optimization: 2 threads per core better utilize the NUMA topology

Optimal configuration for AMD EPYC 7H12 (128 cores, 256 threads):

# CORRECT: 2 threads per core
export NTOMP=2      # 2 OpenMP threads per MPI rank
export NTMPI=128    # 128 thread-MPI ranks
# Total: 128 × 2 = 256 threads (matches 256 logical threads)

# INCORRECT: 1 thread per core (wastes SMT capability)
export NTOMP=1
export NTMPI=256
# Result: Poorer performance, underutilized hardware

# INCORRECT: 4 threads per core (oversubscription)
export NTOMP=4
export NTMPI=64
# Result: Context switching overhead, cache thrashing

Performance impact:

Thread ratio	Performance	Memory usage	CPU utilisation
1:1 (1 thread/core)	~70% of optimal	Lower	~50%
2:1 (2 threads/core)	100% (optimal)	Optimal	~95%
4:1 (4 threads/core)	~60% of optimal	Higher	~90%

Why this matters for GROMACS:

Domain decomposition: GROMACS’s domain decomposition algorithm benefits from having more MPI ranks (128 vs 64)
Load balancing: More MPI ranks provide better load balancing across the system
Communication overlap: 2 threads per core allows better overlap of computation and communication
Memory access patterns: AMD EPYC’s memory subsystem is optimized for 2 threads per core

Implementation in your SLURM scripts:

#!/bin/bash
#SBATCH --nodes           1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 256

module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

# AMD EPYC 7H12 optimization: 2 threads per core (128 cores, 256 threads)
export NTOMP=2
export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP))  # 256 / 2 = 128

# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY

gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix

Note for other processors:

Intel Xeon: Often benefits from 1:1 or 2:1 depending on generation
AMD EPYC: Consistently benefits from 2:1 ratio
ARM: Varies by implementation, typically 1:1

This 2:1 optimization is specific to AMD EPYC’s architecture and should be applied consistently across all single-node GROMACS simulations on Discoverer.

SLURM resource allocation and accounting for GROMACS tools ¶

Why GROMACS tools must use 1 CPU Core (2 Threads):

GROMACS tools (like grompp, cluster, rms, etc.) are designed to run as single-threaded processes. However, for proper SLURM accounting and resource management on AMD EPYC processors, they must be allocated 1 CPU core, which corresponds to 2 threads due to AMD’s SMT (Simultaneous Multithreading) architecture.

SLURM resource allocation requirements on Discoverer:

#SBATCH --ntasks-per-node 1    # Single process
#SBATCH --cpus-per-task   2    # 1 CPU core (2 threads on AMD EPYC)

Why this configuration is mandatory:

SLURM accounting accuracy:
- SLURM tracks resource usage per CPU core
- 1 core = 2 threads on AMD EPYC 7H12
- Tools must be allocated complete cores for proper billing
- Partial core allocation can cause accounting errors
AMD EPYC architecture:
- Each physical core has 2 logical threads (SMT)
- Tools cannot use “half a core” - they get the full core
- Even single-threaded tools occupy 1 complete core
- This ensures consistent resource tracking
Resource management benefits:
- Accurate billing: Users are charged for exactly 1 CPU core
- Fair usage: Prevents resource over-allocation
- Predictable performance: Tools get dedicated core resources
- SLURM compliance: Follows proper resource allocation patterns

Accounting impact¶
Configuration	SLURM billing	Resource usage	Accounting status
`--ntasks-per-node=1` and `--cpus-per-task=1`	[x] Incorrect	[x] Incomplete 1 CPU core	[x] Accounting error
`--ntasks-per-node=1` and `--cpus-per-task=2`	[V] Correct	[V] 1 complete CPU core	[V] Proper billing

Tool categories and resource allocation¶
Tool type	SLURM configuration	#CPU cores utilised	#CPU threads utilised	Purpose
MD integrator (`mdrun`)	`--ntasks-per-node=128` and `--cpus-per-task=2`	128	256	Full node utilisation
Single process execition	`--ntasks-per-node=1` and `--cpus-per-task=2`	1	2	Running GROMACS tools on 1 CPU core

Best practices for resource allocation on Discoverer:

Always use complete cores: --cpus-per-task=2 for GROMACS tools (unless GROMACS documentation says otherwise)
Avoid partial CPU core allocation: Avoid --cpus-per-task=1 when --ntasks-per-node=1
Match AMD EPYC architecture: 2 CPU threads per core allocation for high performance
Ensure proper accounting: Complete CPU core allocation for billing accuracy

Why this matters for Discoverer:

Cost Control: Accurate billing prevents unexpected charges
Resource Efficiency: Tools receives exactly the resources what they require
Fair Usage: All users follow the same allocation rules
Performance Predictability: Consistent resource availability

This resource allocation strategy ensures that GROMACS tools are properly accounted for in the SLURM system while making efficient use of the AMD EPYC processor architecture.

CPU Thread affinity and pinning ¶

GROMACS has its own internal CPU thread affinity management system (see gmxomp.cpp):

Automatically sets thread affinity by default when using all CPU cores on a compte node
Detects CPU thread-related environment variables (OMP_PROC_BIND, GOMP_CPU_AFFINITY, KMP_AFFINITY)
Disables its own affinity setting when these environment variables are set to avoid conflicts

Official GROMACS recommendation:

# Let GROMACS handle thread affinity internally (recommended)
gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK}

# Or explicitly enable GROMACS thread pinning
gmx_mpi mdrun -pin on -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK}

# For multi-node simulations
mpirun gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK}

Thread Affinity Options:

-pin auto (default): GROMACS automatically sets thread affinity when using all node cores

-pin on: Force GROMACS to set thread affinity

-pin off: Disable GROMACS thread affinity setting

-pinoffset N: Specify starting core for thread pinning

-pinstride N: Specify stride between pinned cores

Note

The existing Discoverer documentation uses OpenMP environment variables, but GROMACS source code suggests letting GROMACS manage CPU thread affinity internally itself for optimal performance.

Domain decomposition guidelines ¶

Grid selection guidelines:

When using multi-node MPI simulations, consider these factors:

System size guidelines:

Small systems (<50k atoms): 1-2 nodes sufficient

Medium systems (50k-100k atoms): 2-4 nodes recommended

Large systems (>100k atoms): 4-8 nodes optimal

Very large systems (>200k atoms): 8+ nodes required

Communication overhead considerations:

More nodes = more MPI communication overhead

Balance parallelization benefits against communication costs

Monitor network utilization by setting export UCX_NET_DEVICES=mlx5_0:1

Grid optimization rules:

GROMACS works natively with the following grid configurations:

Nodes	PP Ranks	Grid	Description
1	1	1x1x1	Single node, no DD
2	2	2x1x1	2 PP ranks
4	4	2x2x1	4 PP ranks
8	8	2x2x2	8 PP ranks
16	16	4x2x2	16 PP ranks
32	32	4x4x2	32 PP ranks

Getting help ¶

See Getting help