GROMACS (on CPU)

Versions and build types available

Warning

This document describes running GROMACS on Discoverer CPU cluster.

Supported versions

Note

The versions of GROMACS installed in the software repository are built and supported by the Discoverer HPC team. The MPI builds should be employed for running the actual simulations (mdrun) and deriving trajectories, while the threadMPI ones should be regarded mostly a tool set for trajectory post-processing.

To check which GROMACS versions are currently supported on Discoverer, execute on the login node:

module avail gromacs

The following environment module naming convention is applied for the modules servicing the access to the software repository:

gromacs/MAJOR_N/MAJOR_N.MINOR_N-comp-num_lib-gpuavail-mpi_lib

where:

  • MAJOR_N - the major number of the GROMACS version (example: 2022)
  • MINOR_N - the minor number of the GROMACS version (example: 1, which stands for 2022.1)
  • comp - the compiler collection employed for compiling the source code (example: intel)
  • num_lib - the numerical methods’ library providing BLAS and FFTW the libgromacs is linked against (example: openblas)
  • gpuavail - shows if the build supports GPU acceleration (example: nogpu, which means no GPU support)
  • mpi_lib - the MPI library the GROMACS code is linked against (example: openmpi, which implies the use of Open MPI library)

The installed versions are compiled based on the following recipes:

https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/gromacs

Two different builds available:

Discoverer provides two different GROMACS installations optimized for different use cases (see also Choosing the right build):

1. Thread-MPI build (single-node optimized)

module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

Use this build when:

  • Running simulations on a single compute node
  • Want maximum performance for single-node simulations
  • Need to run analysis tools (with -ntmpi 1)
  • Working with AMD EPYC processors
  • Running CPU-only simulations

Features:

  • Optimized for single-node performance
  • Can run analysis tools by setting -ntmpi 1
  • Excellent NUMA optimization
  • Lower memory overhead
  • Faster startup times

Executable name: gmx

Example usage:

# Single-node simulation
gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix

# Analysis tool (single thread-MPI rank)
gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr
gmx mdrun -ntmpi 1 -s npt.tpr -deffnm npt

For more details see Single-Node Thread-MPI Script.

2. External MPI build (Multi-CPU-core and multi-node capable)

module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi

Use this build when:

  • Running simulations across multiple compute nodes
  • Need multi-node parallelization
  • Using OpenMPI for distributed computing
  • Large-scale simulations requiring multiple nodes

Features:

  • Supports multi-node simulations
  • Uses OpenMPI for inter-node communication
  • Compatible with SLURM multi-node job submission
  • Can handle larger systems across multiple nodes

Executable name: gmx_mpi

Example usage:

# Multi-node simulation (on 2 nodes - with 128 CPU Cores per node)
mpirun -np 256 gmx_mpi mdrun -ntomp 2 -pin auto -s prefix.tpr -deffnm prefix

For more details see Multi-Node External MPI Script.

User-supported versions

Users are welcome to bring, or compile, and use their own builds of GROMACS but those builds will not be supported by Discoverer HPC team.

Running simulations (mdrun)

Running simulations means invoking mdrun for generating trajectories based on a given TPR file.

Warning

You MUST NOT execute simulation directly upon the login node (login.discoverer.bg). You have to run your simulations as Slurm jobs only.

Warning

Write your trajectories and result of analysis only inside your Personal scratch and storage folder (/discofs/username) and DO NOT use for that purpose (under any circumstances) your Home folder (/home/username)!

Single-Node Thread-MPI Script

#!/bin/bash
#
#SBATCH --partition=cn         ### Partition (you may need to change this)
#SBATCH --job-name=gromacs_single_node
#SBATCH --time=512:00:00       ### WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes           1    # MUST BE 1 for thread-MPI
#SBATCH --ntasks-per-node 1    # MUST BE 1 for thread-MPI
#SBATCH --cpus-per-task 256    # N MPI threads x M OpenMP threads (128 * 2 for AMD EPYC 7H12)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

# AMD EPYC 7H12 optimization: 2 threads per core
export NTOMP=2
export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP))  # 256 / 2 = 128

# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY

cd $SLURM_SUBMIT_DIR

gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -v -s prefix.tpr -deffnm prefix -pin auto

Specify the parameters and resources required for successfully running and completing the job:

  • Slurm partition of compute nodes, based on your project resource reservation (--partition)
  • job name, under which the job will be seen in the queue (--job-name)
  • wall time for running the job (--time)
  • number of occupied compute nodes (--nodes)
  • number of MPI proccesses per node (--ntasks-per-node)
  • number of threads (i.e. OpenMP threads) per MPI process (--cpus-per-task)
  • version of GROMACS to run after module load (see Supported versions)

Multi-Node External MPI Script

#!/bin/bash
#
#SBATCH --partition=cn         ### Partition (you may need to change this)
#SBATCH --job-name=gromacs_multi_node
#SBATCH --time=512:00:00       ### WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes           2    # Multiple nodes
#SBATCH --ntasks-per-node 128  # MPI ranks per node
#SBATCH --cpus-per-task   2    # OpenMP threads per MPI rank

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi

# Optimize InfiniBand communication
export UCX_NET_DEVICES=mlx5_0:1

cd $SLURM_SUBMIT_DIR

mpirun gmx_mpi mdrun -v -s prefix.tpr -deffnm prefix -pin auto

In the scripts above, edit the parameters and resources required for successfully running and completing the job:

  • Slurm partition of compute nodes, based on your project resource reservation (--partition)
  • job name, under which the job will be seen in the queue (--job-name)
  • wall time for running the job (--time)
  • number of occupied compute nodes (--nodes)
  • number of MPI proccesses per node (--ntasks-per-node)
  • number of threads (i.e. OpenMP threads) per MPI process (--cpus-per-task)
  • version of GROMACS to run after module load (see Supported versions)

Save the complete Slurm job description as a file, for example /discofs/$USER/run_gromacs/run_gromacs.sh, and submit it to the queue:

cd /discofs/$USER/run_gromacs/
sbatch run_gromacs.sh

Upon successful submission, the standard output will be directed by Slurm into the file /discofs/$USER/run_gromacs/slurm.%j.out (where %j stands for the Slurm job ID), while the standard error output will be stored in /discofs/$USER/run_gromacs/slurm.%j.err.

Running GROMACS tools

Script for Executing Single-Threaded Non-Interactive GROMACS Tools

Use this for single-threaded GROMACS tools like grompp, editconf, etc.:

#!/bin/bash
#
#SBATCH --partition=cn         ### Partition (you may need to change this)
#SBATCH --job-name=gromacs_single_thread
#SBATCH --time=00:30:00        ### WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes           1    # Single node
#SBATCH --ntasks-per-node 1    # Single task
#SBATCH --cpus-per-task   2    # 1 CPU core (2 threads on AMD EPYC)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

cd $SLURM_SUBMIT_DIR

# Single-threaded tools (no -ntmpi needed)
gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr
gmx editconf -f protein.gro -o protein_box.gro -c -d 1.0 -bt cubic
gmx solvate -cp protein_box.gro -cs spc216.gro -o solv.gro -p topol.top
gmx grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tpr

Script for Executing Interactive GROMACS Tools in Non-Interactive Mode

Use this for GROMACS tools like cluster, rms, gyrate, hbond, do_dssp, etc.:

#!/bin/bash
#
#SBATCH --partition=cn         ### Partition (you may need to change this)
#SBATCH --job-name=gromacs_interactive
#SBATCH --time=01:00:00        ### WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes           1    # Single node
#SBATCH --ntasks-per-node 1    # Single task
#SBATCH --cpus-per-task   2    # 1 CPU core (2 threads on AMD EPYC)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

cd $SLURM_SUBMIT_DIR

# Interactive tools using echo pipes for input
# Format: echo -e "input1\ninput2\n..." | gmx tool_name [options]
#
# How echo pipes simulate interactive input:
# echo -e "4\n4" simulates: Type "4", press Enter, type "4", press Enter
# So "4\n4" replaces the interactive sequence: 4 [Enter] 4 [Enter]

# Example 1: Cluster analysis
echo -e "1\n1" | gmx cluster -f trajectory.trr -s structure.tpr -n index.ndx \
    -cutoff 0.15 -method jarvis-patrick -M 0 \
    -o cluster_output -g cluster.log -dist cluster_dist \
    -cl cluster.pdb -nst 250 -wcl 10000

# Example 2: RMSD analysis
echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr \
    -o rmsd.xvg -tu ns

# Example 3: Radius of gyration
echo -e "1\n1" | gmx gyrate -f trajectory.trr -s structure.tpr \
    -o gyrate.xvg -p -n index.ndx

# Example 4: Hydrogen bond analysis
echo -e "1\n1" | gmx hbond -f trajectory.trr -s structure.tpr \
    -num hbond.xvg -tu ns

# Example 5: Secondary structure analysis
echo -e "1\n1" | gmx do_dssp -f trajectory.trr -s structure.tpr \
    -o ss.xpm -sc scount.xvg

Save the complete Slurm job description as a file, for example /discofs/$USER/run_gromacs/gromacs_tools.sh, and submit it to the queue:

cd /discofs/$USER/run_gromacs/
sbatch gromacs_tools.sh
Common Interactive GROMACS Tools and Their Input Patterns
Tool Purpose Typical Input What You’d Type Interactively Example Command
gmx cluster Cluster analysis "1\n1" Type “1”, press Enter, type “1”, press Enter echo -e "1\n1" \| gmx cluster ...
gmx rms RMSD calculation "4\n1" Type “4”, press Enter, type “4”, press Enter echo -e "4\n4" \| gmx rms ...
gmx gyrate Radius of gyration "1\n1" Type “1”, press Enter, type “1”, press Enter echo -e "1\n1" \| gmx gyrate ...
gmx hbond Hydrogen bonds "1\n1" Type “1”, press Enter, type “1”, press Enter echo -e "1\n1" \| gmx hbond ...
gmx do_dssp Secondary structure "1\n1" Type “1”, press Enter, type “1”, press Enter echo -e "1\n1" \| gmx do_dssp ...
gmx trjconv Trajectory conversion "0" Type “0”, press Enter echo -e "0" \| gmx trjconv ...
gmx select Atom selection "1\n1" Type “1”, press Enter, type “1”, press Enter echo -e "1\n1" \| gmx select ...

Understanding the Table Columns:

  • “Typical Input”: The echo pipe string that simulates interactive input in SLURM
  • “What You’d Type Interactively”: The exact keystrokes you’d make if running the tool on a personal workstation

How to Convert Interactive Commands to Batch Commands:

Step-by-Step Translation Process:

  1. Interactive Session (on personal workstation):

    $ gmx rms
    Select group for least squares fit (1-4):
    4
    Select group for RMSD calculation (1-4):
    1
    
  2. Batch Session (in SLURM script):

    echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr -o rmsd.xvg
    

Translation Rules:

  • Each number you type → becomes part of the echo string
  • Each Enter key press → becomes \n (newline)
  • Multiple inputs → separated by \n
  • Final Enter → usually not needed (tool processes automatically)
Translation of "4\n1"
Interactive Action Echo String Explanation
Type “4”, press Enter "4\n" First input with newline
Type “1”, press Enter "1" Second input with newline
Combined "4\n1" Both inputs in one string

Common Group Numbers:

  • “0”: System (all atoms)
  • “1”: Protein
  • “2”: Non-protein
  • “3”: Water
  • “4”: Backbone (protein backbone only)

Tips for Converting Interactive Commands:

  1. Test interactively first: Run the command on your workstation to see what inputs are needed
  2. Count the inputs: Note how many numbers you need to type
  3. Add newlines: Put \n between each input
  4. Use echo -e: The -e flag enables \n interpretation
  5. Pipe to command: Use | to feed the input to the GROMACS tool

Tips for Interactive Tools:

  1. Test locally first: Run the command interactively to see what inputs are needed
  2. Use echo -e: The -e flag enables interpretation of backslash escapes like \n
  3. Check group numbers: Use gmx make_ndx to see available groups and their numbers
  4. Multiple inputs: Separate multiple inputs with \n for newlines
  5. Error handling: Check the log files for any input errors

Technical details

Choosing the right build

Common Interactive GROMACS Tools and Their Input Patterns
Scenario Recommended Build Module to Load
Single-node simulation Thread-MPI gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi
Analysis tools Thread-MPI gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi
Multi-node simulation External MPI gromacs/2 025/2025.2-llvm-fftw3-openblas-nogpu-openmpi

Note

AMD EPYC optimization applies to both builds. The 2:1 thread-to-core ratio and other AMD EPYC-specific optimizations work with both thread-MPI and external MPI builds. The choice between builds is based on single-node vs. multi-node requirements, not processor optimization.

Performance comparison on discoverer:

  • Thread-MPI: 10-20% faster for single-node simulations
  • External MPI: Required for multi-node, but slower for single-node
  • Memory Usage: Thread-MPI uses ~30% less memory per node

Important notes:

  • Thread-MPI cannot run across multiple nodes
  • External MPI can run on single nodes but with performance penalty
  • Analysis tools work with thread-MPI when using -ntmpi 1
  • Both builds support the same GROMACS features (except multi-node for thread-MPI)

Understanding thread-MPI

Important

Thread-MPI is GROMACS’s internal threading library that implements a subset of the MPI 1.1 specification using system threads instead of separate processes. Based on the source code analysis, here’s what makes it special:

Technical details from GROMACS built-in threading support:

  1. Built-in implementation: Thread-MPI is included directly in the GROMACS source tree (src/external/thread_mpi/) and is the default parallelization mode
  2. Cross-platform threading: Uses POSIX pthreads on Linux/Unix and Windows threads on Windows
  3. Shared memory optimization: Unlike external MPI which uses separate processes, thread-MPI uses threads within a single process, enabling:
    • Direct shared memory access
    • Lower communication overhead
    • Better cache utilization
    • Reduced memory footprint

Why thread-MPI is superior for single-node simulations:

  1. Performance benefits:
    • Lower Latency: No inter-process communication overhead
    • Better Memory Access: Direct shared memory access between threads
    • Optimized for NUMA: Thread-MPI can be optimized for NUMA-aware memory placement
    • Reduced Context Switching: Threads within same process vs. separate processes
  2. Resource efficiency:
    • Memory Sharing: Threads share the same address space, reducing memory usage
    • Faster Startup: No process spawning overhead
    • Better Cache Coherence: Shared L3 cache utilization
  3. GROMACS-specific optimizations:
    • Integrated Thread Affinity: Thread-MPI works seamlessly with GROMACS’s internal thread pinning system
    • Domain Decomposition: Optimized for GROMACS’s domain decomposition algorithms
    • Load Balancing: Better load balancing within single-node scenarios

Thread-MPI vs external MPI comparison:

Aspect Thread-MPI External MPI
Scope Single node only Multi-node capable
Co mmunication Shared memory (fast) Network/Inter-process (slower)
Memory Usage Shared address space Separate process memory
Startup Time Fast (thread creation) Slower (process spawning)
NUMA O ptimization Excellent Limited
GROMACS Integration Native, optimized Generic

When to use thread-MPI and when external MPI:

Use Thread-MPI when:

  • Running on a single compute node
  • Want maximum performance for single-node simulations
  • Need to run GROMACS analysis tools (with -ntmpi 1)
  • Working with AMD EPYC processors (excellent NUMA optimization)
  • Running CPU-only simulations

Use external MPI when:

  • Need multi-node simulations
  • Running across multiple compute nodes
  • Using specialized MPI features not supported by thread-MPI

Thread-MPI configuration best practices:

# Optimal thread-MPI setup for AMD EPYC 7H12 (128 cores, 256 threads)
export NTOMP=2      # 2 OpenMP threads per MPI rank
export NTMPI=128    # 128 thread-MPI ranks
# Total: 128 × 2 = 256 threads (matches 256 logical threads)

# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY

gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix

Pinning and thread counts work together

Warning

-pin auto and -ntomp are complementary, not alternatives!

A common misconception is that using thread pinning (-pin auto) means you can omit the -ntomp parameter. This is incorrect. Here’s how they work together:

What each parameter does:

  • -ntomp: Specifies the number of OpenMP threads per MPI rank
  • -pin auto: Controls how GROMACS maps those threads to CPU cores

Why you need both:

# CORRECT: Both parameters work together
gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix
# Result: 128 MPI ranks × 2 OpenMP threads = 256 total threads
# GROMACS pins each of these 256 threads to specific CPU cores

# INCORRECT: Omitting -ntomp
gmx mdrun -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix
# Result: GROMACS may use default thread count, not optimal for your hardware

How GROMACS uses both parameters:

From the source code analysis, GROMACS’s thread affinity system:

  1. First: Determines total threads = -ntmpi × -ntomp
  2. Then: Maps each thread to a specific core using hardware topology
  3. Finally: Applies pinning based on -pin auto settings

Example thread distribution:

Rank 0: Thread 0 → Core 0 (pinned)
Rank 0: Thread 1 → Core 1 (pinned)
Rank 1: Thread 0 → Core 2 (pinned)
Rank 1: Thread 1 → Core 3 (pinned)
...and so on

Best practice: Always specify both

# For AMD EPYC 7H12 (256 cores)
export NTOMP=2
export NTMPI=128
gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix

This ensures optimal thread distribution and core pinning for your specific hardware.

AMD EPYC thread optimization: The 2:1 Rule

Important

AMD EPYC processors benefit from 2 threads per core!

Based on performance testing and GROMACS source code analysis, AMD EPYC processors (including the EPYC 7H12 on Discoverer) show optimal performance when using 2 OpenMP threads per physical core rather than 1:1 or higher ratios.

Why 2:1 Thread-to-Core ratio works best:

  1. AMD EPYC Architecture: Each EPYC core has 2 hardware threads (SMT - Simultaneous Multithreading)
  2. Memory Bandwidth: AMD EPYC has excellent memory bandwidth that can sustain 2 threads per core
  3. Cache Efficiency: Shared L3 cache benefits from 2 threads working on related data
  4. NUMA Optimization: 2 threads per core better utilize the NUMA topology

Optimal configuration for AMD EPYC 7H12 (128 cores, 256 threads):

# CORRECT: 2 threads per core
export NTOMP=2      # 2 OpenMP threads per MPI rank
export NTMPI=128    # 128 thread-MPI ranks
# Total: 128 × 2 = 256 threads (matches 256 logical threads)

# INCORRECT: 1 thread per core (wastes SMT capability)
export NTOMP=1
export NTMPI=256
# Result: Poorer performance, underutilized hardware

# INCORRECT: 4 threads per core (oversubscription)
export NTOMP=4
export NTMPI=64
# Result: Context switching overhead, cache thrashing

Performance impact:

Thread ratio Performance Memory usage CPU utilisation
1:1 (1 thread/core) ~70% of optimal Lower ~50%
2:1 (2 threads/core) 100% (optimal) Optimal ~95%
4:1 (4 threads/core) ~60% of optimal Higher ~90%

Why this matters for GROMACS:

  1. Domain decomposition: GROMACS’s domain decomposition algorithm benefits from having more MPI ranks (128 vs 64)
  2. Load balancing: More MPI ranks provide better load balancing across the system
  3. Communication overlap: 2 threads per core allows better overlap of computation and communication
  4. Memory access patterns: AMD EPYC’s memory subsystem is optimized for 2 threads per core

Implementation in your SLURM scripts:

#!/bin/bash
#SBATCH --nodes           1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 256

module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi

# AMD EPYC 7H12 optimization: 2 threads per core (128 cores, 256 threads)
export NTOMP=2
export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP))  # 256 / 2 = 128

# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY

gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix

Note for other processors:

  • Intel Xeon: Often benefits from 1:1 or 2:1 depending on generation
  • AMD EPYC: Consistently benefits from 2:1 ratio
  • ARM: Varies by implementation, typically 1:1

This 2:1 optimization is specific to AMD EPYC’s architecture and should be applied consistently across all single-node GROMACS simulations on Discoverer.

SLURM resource allocation and accounting for GROMACS tools

Why GROMACS tools must use 1 CPU Core (2 Threads):

GROMACS tools (like grompp, cluster, rms, etc.) are designed to run as single-threaded processes. However, for proper SLURM accounting and resource management on AMD EPYC processors, they must be allocated 1 CPU core, which corresponds to 2 threads due to AMD’s SMT (Simultaneous Multithreading) architecture.

SLURM resource allocation requirements on Discoverer:

#SBATCH --ntasks-per-node 1    # Single process
#SBATCH --cpus-per-task   2    # 1 CPU core (2 threads on AMD EPYC)

Why this configuration is mandatory:

  1. SLURM accounting accuracy:
    • SLURM tracks resource usage per CPU core
    • 1 core = 2 threads on AMD EPYC 7H12
    • Tools must be allocated complete cores for proper billing
    • Partial core allocation can cause accounting errors
  2. AMD EPYC architecture:
    • Each physical core has 2 logical threads (SMT)
    • Tools cannot use “half a core” - they get the full core
    • Even single-threaded tools occupy 1 complete core
    • This ensures consistent resource tracking
  3. Resource management benefits:
    • Accurate billing: Users are charged for exactly 1 CPU core
    • Fair usage: Prevents resource over-allocation
    • Predictable performance: Tools get dedicated core resources
    • SLURM compliance: Follows proper resource allocation patterns
Accounting impact
Configuration SLURM billing Resource usage Accounting status
--ntasks-per-node=1 and --cpus-per-task=1 ❌ Incorrect ❌ Incomplete 1 CPU core ❌ Accounting error
--ntasks-per-node=1 and --cpus-per-task=2 ✅ Correct ✅ 1 complete CPU core ✅ Proper billing
Tool categories and resource allocation
Tool type SLURM configuration #CPU cores utilised #CPU threads utilised Purpose
MD integrator (mdrun) --ntasks-per-node=128 and --cpus-per-task=2 128 256 Full node utilisation
Single process execition --ntasks-per-node=1 and --cpus-per-task=2 1 2 Running GROMACS tools on 1 CPU core

Best practices for resource allocation on Discoverer:

  1. Always use complete cores: --cpus-per-task=2 for GROMACS tools (unless GROMACS documentation says otherwise)
  2. Avoid partial CPU core allocation: Avoid --cpus-per-task=1 when --ntasks-per-node=1
  3. Match AMD EPYC architecture: 2 CPU threads per core allocation for high performance
  4. Ensure proper accounting: Complete CPU core allocation for billing accuracy

Why this matters for Discoverer:

  • Cost Control: Accurate billing prevents unexpected charges
  • Resource Efficiency: Tools receives exactly the resources what they require
  • Fair Usage: All users follow the same allocation rules
  • Performance Predictability: Consistent resource availability

This resource allocation strategy ensures that GROMACS tools are properly accounted for in the SLURM system while making efficient use of the AMD EPYC processor architecture.

CPU Thread affinity and pinning

GROMACS has its own internal CPU thread affinity management system (see gmxomp.cpp):

  1. Automatically sets thread affinity by default when using all CPU cores on a compte node
  2. Detects CPU thread-related environment variables (OMP_PROC_BIND, GOMP_CPU_AFFINITY, KMP_AFFINITY)
  3. Disables its own affinity setting when these environment variables are set to avoid conflicts

Official GROMACS recommendation:

# Let GROMACS handle thread affinity internally (recommended)
gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK}

# Or explicitly enable GROMACS thread pinning
gmx_mpi mdrun -pin on -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK}

# For multi-node simulations
mpirun gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK}

Thread Affinity Options:

  • -pin auto (default): GROMACS automatically sets thread affinity when using all node cores
  • -pin on: Force GROMACS to set thread affinity
  • -pin off: Disable GROMACS thread affinity setting
  • -pinoffset N: Specify starting core for thread pinning
  • -pinstride N: Specify stride between pinned cores

Note

The existing Discoverer documentation uses OpenMP environment variables, but GROMACS source code suggests letting GROMACS manage CPU thread affinity internally itself for optimal performance.

Domain decomposition guidelines

Grid selection guidelines:

When using multi-node MPI simulations, consider these factors:

System size guidelines:

  • Small systems (<50k atoms): 1-2 nodes sufficient
  • Medium systems (50k-100k atoms): 2-4 nodes recommended
  • Large systems (>100k atoms): 4-8 nodes optimal
  • Very large systems (>200k atoms): 8+ nodes required

Communication overhead considerations:

  • More nodes = more MPI communication overhead
  • Balance parallelization benefits against communication costs
  • Monitor network utilization by setting export UCX_NET_DEVICES=mlx5_0:1

Grid optimization rules:

GROMACS works natively with the following grid configurations:

Nodes PP Ranks Grid PME Ranks Description
1 1 1x1x1 0 Single node, no DD
2 2 2x1x1 0 2 PP ranks
4 4 2x2x1 0 4 PP ranks
8 8 2x2x2 0 8 PP ranks
16 16 4x2x2 0 16 PP ranks
32 32 4x4x2 0 32 PP ranks

Getting help

See Getting help