GROMACS (on CPU)¶
Versions and build types available¶
Warning
This document describes running GROMACS on Discoverer CPU cluster.
Supported versions¶
Note
The versions of GROMACS installed in the software repository are built and supported by the Discoverer HPC team. The MPI builds should be employed for running the actual simulations (mdrun) and deriving trajectories, while the threadMPI ones should be regarded mostly a tool set for trajectory post-processing.
To check which GROMACS versions are currently supported on Discoverer, execute on the login node:
module avail gromacs
The following environment module naming convention is applied for the modules servicing the access to the software repository:
gromacs/MAJOR_N/MAJOR_N.MINOR_N-comp-num_lib-gpuavail-mpi_lib
where:
MAJOR_N
- the major number of the GROMACS version (example: 2022)MINOR_N
- the minor number of the GROMACS version (example: 1, which stands for 2022.1)comp
- the compiler collection employed for compiling the source code (example: intel)num_lib
- the numerical methods’ library providing BLAS and FFTW the libgromacs is linked against (example: openblas)gpuavail
- shows if the build supports GPU acceleration (example: nogpu, which means no GPU support)mpi_lib
- the MPI library the GROMACS code is linked against (example: openmpi, which implies the use of Open MPI library)
The installed versions are compiled based on the following recipes:
https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/gromacs
Two different builds available:
Discoverer provides two different GROMACS installations optimized for different use cases (see also Choosing the right build):
1. Thread-MPI build (single-node optimized)
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi
Use this build when:
- Running simulations on a single compute node
- Want maximum performance for single-node simulations
- Need to run analysis tools (with
-ntmpi 1
)- Working with AMD EPYC processors
- Running CPU-only simulations
Features:
- Optimized for single-node performance
- Can run analysis tools by setting
-ntmpi 1
- Excellent NUMA optimization
- Lower memory overhead
- Faster startup times
Executable name: gmx
Example usage:
# Single-node simulation gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix # Analysis tool (single thread-MPI rank) gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr gmx mdrun -ntmpi 1 -s npt.tpr -deffnm npt
For more details see Single-Node Thread-MPI Script.
2. External MPI build (Multi-CPU-core and multi-node capable)
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi
Use this build when:
- Running simulations across multiple compute nodes
- Need multi-node parallelization
- Using OpenMPI for distributed computing
- Large-scale simulations requiring multiple nodes
Features:
- Supports multi-node simulations
- Uses OpenMPI for inter-node communication
- Compatible with SLURM multi-node job submission
- Can handle larger systems across multiple nodes
Executable name: gmx_mpi
Example usage:
# Multi-node simulation (on 2 nodes - with 128 CPU Cores per node) mpirun -np 256 gmx_mpi mdrun -ntomp 2 -pin auto -s prefix.tpr -deffnm prefix
For more details see Multi-Node External MPI Script.
User-supported versions¶
Users are welcome to bring, or compile, and use their own builds of GROMACS but those builds will not be supported by Discoverer HPC team.
Running simulations (mdrun)¶
Running simulations means invoking mdrun
for generating trajectories based on a given TPR file.
Warning
You MUST NOT execute simulation directly upon the login node (login.discoverer.bg). You have to run your simulations as Slurm jobs only.
Warning
Write your trajectories and result of analysis only inside your Personal scratch and storage folder (/discofs/username) and DO NOT use for that purpose (under any circumstances) your Home folder (/home/username)!
Single-Node Thread-MPI Script¶
#!/bin/bash # #SBATCH --partition=cn ### Partition (you may need to change this) #SBATCH --job-name=gromacs_single_node #SBATCH --time=512:00:00 ### WallTime - set it accordingly #SBATCH --account=<specify_your_slurm_account_name_here> #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account> #SBATCH --nodes 1 # MUST BE 1 for thread-MPI #SBATCH --ntasks-per-node 1 # MUST BE 1 for thread-MPI #SBATCH --cpus-per-task 256 # N MPI threads x M OpenMP threads (128 * 2 for AMD EPYC 7H12) #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi # AMD EPYC 7H12 optimization: 2 threads per core export NTOMP=2 export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP)) # 256 / 2 = 128 # Let GROMACS handle thread affinity unset OMP_PROC_BIND unset GOMP_CPU_AFFINITY unset KMP_AFFINITY cd $SLURM_SUBMIT_DIR gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -v -s prefix.tpr -deffnm prefix -pin auto
Specify the parameters and resources required for successfully running and completing the job:
- Slurm partition of compute nodes, based on your project resource reservation (
--partition
)- job name, under which the job will be seen in the queue (
--job-name
)- wall time for running the job (
--time
)- number of occupied compute nodes (
--nodes
)- number of MPI proccesses per node (
--ntasks-per-node
)- number of threads (i.e. OpenMP threads) per MPI process (
--cpus-per-task
)- version of GROMACS to run after
module load
(see Supported versions)
Multi-Node External MPI Script¶
#!/bin/bash # #SBATCH --partition=cn ### Partition (you may need to change this) #SBATCH --job-name=gromacs_multi_node #SBATCH --time=512:00:00 ### WallTime - set it accordingly #SBATCH --account=<specify_your_slurm_account_name_here> #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account> #SBATCH --nodes 2 # Multiple nodes #SBATCH --ntasks-per-node 128 # MPI ranks per node #SBATCH --cpus-per-task 2 # OpenMP threads per MPI rank #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi # Optimize InfiniBand communication export UCX_NET_DEVICES=mlx5_0:1 cd $SLURM_SUBMIT_DIR mpirun gmx_mpi mdrun -v -s prefix.tpr -deffnm prefix -pin auto
In the scripts above, edit the parameters and resources required for successfully running and completing the job:
- Slurm partition of compute nodes, based on your project resource reservation (
--partition
)- job name, under which the job will be seen in the queue (
--job-name
)- wall time for running the job (
--time
)- number of occupied compute nodes (
--nodes
)- number of MPI proccesses per node (
--ntasks-per-node
)- number of threads (i.e. OpenMP threads) per MPI process (
--cpus-per-task
)- version of GROMACS to run after
module load
(see Supported versions)
Save the complete Slurm job description as a file, for example /discofs/$USER/run_gromacs/run_gromacs.sh
, and submit it to the queue:
cd /discofs/$USER/run_gromacs/ sbatch run_gromacs.sh
Upon successful submission, the standard output will be directed by Slurm into the file /discofs/$USER/run_gromacs/slurm.%j.out
(where %j
stands for the Slurm job ID), while the standard error output will be stored in /discofs/$USER/run_gromacs/slurm.%j.err
.
Running GROMACS tools¶
Script for Executing Single-Threaded Non-Interactive GROMACS Tools¶
Use this for single-threaded GROMACS tools like grompp, editconf, etc.:
#!/bin/bash # #SBATCH --partition=cn ### Partition (you may need to change this) #SBATCH --job-name=gromacs_single_thread #SBATCH --time=00:30:00 ### WallTime - set it accordingly #SBATCH --account=<specify_your_slurm_account_name_here> #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account> #SBATCH --nodes 1 # Single node #SBATCH --ntasks-per-node 1 # Single task #SBATCH --cpus-per-task 2 # 1 CPU core (2 threads on AMD EPYC) #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi cd $SLURM_SUBMIT_DIR # Single-threaded tools (no -ntmpi needed) gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr gmx editconf -f protein.gro -o protein_box.gro -c -d 1.0 -bt cubic gmx solvate -cp protein_box.gro -cs spc216.gro -o solv.gro -p topol.top gmx grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tpr
Script for Executing Interactive GROMACS Tools in Non-Interactive Mode¶
Use this for GROMACS tools like cluster, rms, gyrate, hbond, do_dssp, etc.:
#!/bin/bash # #SBATCH --partition=cn ### Partition (you may need to change this) #SBATCH --job-name=gromacs_interactive #SBATCH --time=01:00:00 ### WallTime - set it accordingly #SBATCH --account=<specify_your_slurm_account_name_here> #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account> #SBATCH --nodes 1 # Single node #SBATCH --ntasks-per-node 1 # Single task #SBATCH --cpus-per-task 2 # 1 CPU core (2 threads on AMD EPYC) #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi cd $SLURM_SUBMIT_DIR # Interactive tools using echo pipes for input # Format: echo -e "input1\ninput2\n..." | gmx tool_name [options] # # How echo pipes simulate interactive input: # echo -e "4\n4" simulates: Type "4", press Enter, type "4", press Enter # So "4\n4" replaces the interactive sequence: 4 [Enter] 4 [Enter] # Example 1: Cluster analysis echo -e "1\n1" | gmx cluster -f trajectory.trr -s structure.tpr -n index.ndx \ -cutoff 0.15 -method jarvis-patrick -M 0 \ -o cluster_output -g cluster.log -dist cluster_dist \ -cl cluster.pdb -nst 250 -wcl 10000 # Example 2: RMSD analysis echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr \ -o rmsd.xvg -tu ns # Example 3: Radius of gyration echo -e "1\n1" | gmx gyrate -f trajectory.trr -s structure.tpr \ -o gyrate.xvg -p -n index.ndx # Example 4: Hydrogen bond analysis echo -e "1\n1" | gmx hbond -f trajectory.trr -s structure.tpr \ -num hbond.xvg -tu ns # Example 5: Secondary structure analysis echo -e "1\n1" | gmx do_dssp -f trajectory.trr -s structure.tpr \ -o ss.xpm -sc scount.xvg
Save the complete Slurm job description as a file, for example /discofs/$USER/run_gromacs/gromacs_tools.sh
, and submit it to the queue:
cd /discofs/$USER/run_gromacs/ sbatch gromacs_tools.sh
Tool | Purpose | Typical Input | What You’d Type Interactively | Example Command |
---|---|---|---|---|
gmx cluster |
Cluster analysis | "1\n1" |
Type “1”, press Enter, type “1”, press Enter | echo -e "1\n1" \| gmx cluster ... |
gmx rms |
RMSD calculation | "4\n1" |
Type “4”, press Enter, type “4”, press Enter | echo -e "4\n4" \| gmx rms ... |
gmx gyrate |
Radius of gyration | "1\n1" |
Type “1”, press Enter, type “1”, press Enter | echo -e "1\n1" \| gmx gyrate ... |
gmx hbond |
Hydrogen bonds | "1\n1" |
Type “1”, press Enter, type “1”, press Enter | echo -e "1\n1" \| gmx hbond ... |
gmx do_dssp |
Secondary structure | "1\n1" |
Type “1”, press Enter, type “1”, press Enter | echo -e "1\n1" \| gmx do_dssp ... |
gmx trjconv |
Trajectory conversion | "0" |
Type “0”, press Enter | echo -e "0" \| gmx trjconv ... |
gmx select |
Atom selection | "1\n1" |
Type “1”, press Enter, type “1”, press Enter | echo -e "1\n1" \| gmx select ... |
Understanding the Table Columns:
- “Typical Input”: The echo pipe string that simulates interactive input in SLURM
- “What You’d Type Interactively”: The exact keystrokes you’d make if running the tool on a personal workstation
How to Convert Interactive Commands to Batch Commands:
Step-by-Step Translation Process:
Interactive Session (on personal workstation):
$ gmx rms Select group for least squares fit (1-4): 4 Select group for RMSD calculation (1-4): 1
Batch Session (in SLURM script):
echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr -o rmsd.xvg
Translation Rules:
- Each number you type → becomes part of the echo string
- Each Enter key press → becomes
\n
(newline)- Multiple inputs → separated by
\n
- Final Enter → usually not needed (tool processes automatically)
Interactive Action | Echo String | Explanation |
---|---|---|
Type “4”, press Enter | "4\n" |
First input with newline |
Type “1”, press Enter | "1" |
Second input with newline |
Combined | "4\n1" |
Both inputs in one string |
Common Group Numbers:
- “0”: System (all atoms)
- “1”: Protein
- “2”: Non-protein
- “3”: Water
- “4”: Backbone (protein backbone only)
Tips for Converting Interactive Commands:
- Test interactively first: Run the command on your workstation to see what inputs are needed
- Count the inputs: Note how many numbers you need to type
- Add newlines: Put
\n
between each input- Use echo -e: The
-e
flag enables\n
interpretation- Pipe to command: Use
|
to feed the input to the GROMACS tool
Tips for Interactive Tools:
- Test locally first: Run the command interactively to see what inputs are needed
- Use echo -e: The
-e
flag enables interpretation of backslash escapes like\n
- Check group numbers: Use
gmx make_ndx
to see available groups and their numbers- Multiple inputs: Separate multiple inputs with
\n
for newlines- Error handling: Check the log files for any input errors
Technical details¶
Choosing the right build¶
Scenario | Recommended Build | Module to Load |
---|---|---|
Single-node simulation | Thread-MPI | gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi |
Analysis tools | Thread-MPI | gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi |
Multi-node simulation | External MPI | gromacs/2 025/2025.2-llvm-fftw3-openblas-nogpu-openmpi |
Note
AMD EPYC optimization applies to both builds. The 2:1 thread-to-core ratio and other AMD EPYC-specific optimizations work with both thread-MPI and external MPI builds. The choice between builds is based on single-node vs. multi-node requirements, not processor optimization.
Performance comparison on discoverer:
- Thread-MPI: 10-20% faster for single-node simulations
- External MPI: Required for multi-node, but slower for single-node
- Memory Usage: Thread-MPI uses ~30% less memory per node
Important notes:
- Thread-MPI cannot run across multiple nodes
- External MPI can run on single nodes but with performance penalty
- Analysis tools work with thread-MPI when using
-ntmpi 1
- Both builds support the same GROMACS features (except multi-node for thread-MPI)
Understanding thread-MPI¶
Important
Thread-MPI is GROMACS’s internal threading library that implements a subset of the MPI 1.1 specification using system threads instead of separate processes. Based on the source code analysis, here’s what makes it special:
Technical details from GROMACS built-in threading support:
- Built-in implementation: Thread-MPI is included directly in the GROMACS source tree (
src/external/thread_mpi/
) and is the default parallelization mode- Cross-platform threading: Uses POSIX pthreads on Linux/Unix and Windows threads on Windows
- Shared memory optimization: Unlike external MPI which uses separate processes, thread-MPI uses threads within a single process, enabling:
- Direct shared memory access
- Lower communication overhead
- Better cache utilization
- Reduced memory footprint
Why thread-MPI is superior for single-node simulations:
- Performance benefits:
- Lower Latency: No inter-process communication overhead
- Better Memory Access: Direct shared memory access between threads
- Optimized for NUMA: Thread-MPI can be optimized for NUMA-aware memory placement
- Reduced Context Switching: Threads within same process vs. separate processes
- Resource efficiency:
- Memory Sharing: Threads share the same address space, reducing memory usage
- Faster Startup: No process spawning overhead
- Better Cache Coherence: Shared L3 cache utilization
- GROMACS-specific optimizations:
- Integrated Thread Affinity: Thread-MPI works seamlessly with GROMACS’s internal thread pinning system
- Domain Decomposition: Optimized for GROMACS’s domain decomposition algorithms
- Load Balancing: Better load balancing within single-node scenarios
Thread-MPI vs external MPI comparison:
Aspect | Thread-MPI | External MPI |
---|---|---|
Scope | Single node only | Multi-node capable |
Co mmunication | Shared memory (fast) | Network/Inter-process (slower) |
Memory Usage | Shared address space | Separate process memory |
Startup Time | Fast (thread creation) | Slower (process spawning) |
NUMA O ptimization | Excellent | Limited |
GROMACS Integration | Native, optimized | Generic |
When to use thread-MPI and when external MPI:
Use Thread-MPI when:
- Running on a single compute node
- Want maximum performance for single-node simulations
- Need to run GROMACS analysis tools (with
-ntmpi 1
)- Working with AMD EPYC processors (excellent NUMA optimization)
- Running CPU-only simulations
Use external MPI when:
- Need multi-node simulations
- Running across multiple compute nodes
- Using specialized MPI features not supported by thread-MPI
Thread-MPI configuration best practices:
# Optimal thread-MPI setup for AMD EPYC 7H12 (128 cores, 256 threads)
export NTOMP=2 # 2 OpenMP threads per MPI rank
export NTMPI=128 # 128 thread-MPI ranks
# Total: 128 × 2 = 256 threads (matches 256 logical threads)
# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY
gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix
Pinning and thread counts work together¶
Warning
-pin auto
and -ntomp
are complementary, not alternatives!
A common misconception is that using thread pinning (-pin auto
) means you can omit the -ntomp
parameter. This is incorrect. Here’s how they work together:
What each parameter does:
-ntomp
: Specifies the number of OpenMP threads per MPI rank-pin auto
: Controls how GROMACS maps those threads to CPU cores
Why you need both:
# CORRECT: Both parameters work together gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix # Result: 128 MPI ranks × 2 OpenMP threads = 256 total threads # GROMACS pins each of these 256 threads to specific CPU cores # INCORRECT: Omitting -ntomp gmx mdrun -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix # Result: GROMACS may use default thread count, not optimal for your hardware
How GROMACS uses both parameters:
From the source code analysis, GROMACS’s thread affinity system:
- First: Determines total threads =
-ntmpi × -ntomp
- Then: Maps each thread to a specific core using hardware topology
- Finally: Applies pinning based on
-pin auto
settings
Example thread distribution:
Rank 0: Thread 0 → Core 0 (pinned) Rank 0: Thread 1 → Core 1 (pinned) Rank 1: Thread 0 → Core 2 (pinned) Rank 1: Thread 1 → Core 3 (pinned) ...and so on
Best practice: Always specify both
# For AMD EPYC 7H12 (256 cores) export NTOMP=2 export NTMPI=128 gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix
This ensures optimal thread distribution and core pinning for your specific hardware.
AMD EPYC thread optimization: The 2:1 Rule¶
Important
AMD EPYC processors benefit from 2 threads per core!
Based on performance testing and GROMACS source code analysis, AMD EPYC processors (including the EPYC 7H12 on Discoverer) show optimal performance when using 2 OpenMP threads per physical core rather than 1:1 or higher ratios.
Why 2:1 Thread-to-Core ratio works best:
- AMD EPYC Architecture: Each EPYC core has 2 hardware threads (SMT - Simultaneous Multithreading)
- Memory Bandwidth: AMD EPYC has excellent memory bandwidth that can sustain 2 threads per core
- Cache Efficiency: Shared L3 cache benefits from 2 threads working on related data
- NUMA Optimization: 2 threads per core better utilize the NUMA topology
Optimal configuration for AMD EPYC 7H12 (128 cores, 256 threads):
# CORRECT: 2 threads per core export NTOMP=2 # 2 OpenMP threads per MPI rank export NTMPI=128 # 128 thread-MPI ranks # Total: 128 × 2 = 256 threads (matches 256 logical threads) # INCORRECT: 1 thread per core (wastes SMT capability) export NTOMP=1 export NTMPI=256 # Result: Poorer performance, underutilized hardware # INCORRECT: 4 threads per core (oversubscription) export NTOMP=4 export NTMPI=64 # Result: Context switching overhead, cache thrashing
Performance impact:
Thread ratio | Performance | Memory usage | CPU utilisation |
---|---|---|---|
1:1 (1 thread/core) | ~70% of optimal | Lower | ~50% |
2:1 (2 threads/core) | 100% (optimal) | Optimal | ~95% |
4:1 (4 threads/core) | ~60% of optimal | Higher | ~90% |
Why this matters for GROMACS:
- Domain decomposition: GROMACS’s domain decomposition algorithm benefits from having more MPI ranks (128 vs 64)
- Load balancing: More MPI ranks provide better load balancing across the system
- Communication overlap: 2 threads per core allows better overlap of computation and communication
- Memory access patterns: AMD EPYC’s memory subsystem is optimized for 2 threads per core
Implementation in your SLURM scripts:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 256
module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi
# AMD EPYC 7H12 optimization: 2 threads per core (128 cores, 256 threads)
export NTOMP=2
export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP)) # 256 / 2 = 128
# Let GROMACS handle thread affinity
unset OMP_PROC_BIND
unset GOMP_CPU_AFFINITY
unset KMP_AFFINITY
gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix
Note for other processors:
- Intel Xeon: Often benefits from 1:1 or 2:1 depending on generation
- AMD EPYC: Consistently benefits from 2:1 ratio
- ARM: Varies by implementation, typically 1:1
This 2:1 optimization is specific to AMD EPYC’s architecture and should be applied consistently across all single-node GROMACS simulations on Discoverer.
SLURM resource allocation and accounting for GROMACS tools¶
Why GROMACS tools must use 1 CPU Core (2 Threads):
GROMACS tools (like grompp
, cluster
, rms
, etc.) are designed to run as single-threaded processes. However, for proper SLURM accounting and resource management on AMD EPYC processors, they must be allocated 1 CPU core, which corresponds to 2 threads due to AMD’s SMT (Simultaneous Multithreading) architecture.
SLURM resource allocation requirements on Discoverer:
#SBATCH --ntasks-per-node 1 # Single process #SBATCH --cpus-per-task 2 # 1 CPU core (2 threads on AMD EPYC)
Why this configuration is mandatory:
- SLURM accounting accuracy:
- SLURM tracks resource usage per CPU core
- 1 core = 2 threads on AMD EPYC 7H12
- Tools must be allocated complete cores for proper billing
- Partial core allocation can cause accounting errors
- AMD EPYC architecture:
- Each physical core has 2 logical threads (SMT)
- Tools cannot use “half a core” - they get the full core
- Even single-threaded tools occupy 1 complete core
- This ensures consistent resource tracking
- Resource management benefits:
- Accurate billing: Users are charged for exactly 1 CPU core
- Fair usage: Prevents resource over-allocation
- Predictable performance: Tools get dedicated core resources
- SLURM compliance: Follows proper resource allocation patterns
Configuration | SLURM billing | Resource usage | Accounting status |
---|---|---|---|
--ntasks-per-node=1 and --cpus-per-task=1 |
❌ Incorrect | ❌ Incomplete 1 CPU core | ❌ Accounting error |
--ntasks-per-node=1 and --cpus-per-task=2 |
✅ Correct | ✅ 1 complete CPU core | ✅ Proper billing |
Tool type | SLURM configuration | #CPU cores utilised | #CPU threads utilised | Purpose |
---|---|---|---|---|
MD integrator (mdrun ) |
--ntasks-per-node=128 and --cpus-per-task=2 |
128 | 256 | Full node utilisation |
Single process execition | --ntasks-per-node=1 and --cpus-per-task=2 |
1 | 2 | Running GROMACS tools on 1 CPU core |
Best practices for resource allocation on Discoverer:
- Always use complete cores:
--cpus-per-task=2
for GROMACS tools (unless GROMACS documentation says otherwise) - Avoid partial CPU core allocation: Avoid
--cpus-per-task=1
when--ntasks-per-node=1
- Match AMD EPYC architecture: 2 CPU threads per core allocation for high performance
- Ensure proper accounting: Complete CPU core allocation for billing accuracy
Why this matters for Discoverer:
- Cost Control: Accurate billing prevents unexpected charges
- Resource Efficiency: Tools receives exactly the resources what they require
- Fair Usage: All users follow the same allocation rules
- Performance Predictability: Consistent resource availability
This resource allocation strategy ensures that GROMACS tools are properly accounted for in the SLURM system while making efficient use of the AMD EPYC processor architecture.
CPU Thread affinity and pinning¶
GROMACS has its own internal CPU thread affinity management system (see gmxomp.cpp
):
- Automatically sets thread affinity by default when using all CPU cores on a compte node
- Detects CPU thread-related environment variables (
OMP_PROC_BIND
,GOMP_CPU_AFFINITY
,KMP_AFFINITY
) - Disables its own affinity setting when these environment variables are set to avoid conflicts
Official GROMACS recommendation:
# Let GROMACS handle thread affinity internally (recommended) gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK} # Or explicitly enable GROMACS thread pinning gmx_mpi mdrun -pin on -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK} # For multi-node simulations mpirun gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK}
Thread Affinity Options:
-pin auto
(default): GROMACS automatically sets thread affinity when using all node cores-pin on
: Force GROMACS to set thread affinity-pin off
: Disable GROMACS thread affinity setting-pinoffset N
: Specify starting core for thread pinning-pinstride N
: Specify stride between pinned cores
Note
The existing Discoverer documentation uses OpenMP environment variables, but GROMACS source code suggests letting GROMACS manage CPU thread affinity internally itself for optimal performance.
Domain decomposition guidelines¶
Grid selection guidelines:
When using multi-node MPI simulations, consider these factors:
System size guidelines:
- Small systems (<50k atoms): 1-2 nodes sufficient
- Medium systems (50k-100k atoms): 2-4 nodes recommended
- Large systems (>100k atoms): 4-8 nodes optimal
- Very large systems (>200k atoms): 8+ nodes required
Communication overhead considerations:
- More nodes = more MPI communication overhead
- Balance parallelization benefits against communication costs
- Monitor network utilization by setting
export UCX_NET_DEVICES=mlx5_0:1
Grid optimization rules:
GROMACS works natively with the following grid configurations:
Nodes | PP Ranks | Grid | PME Ranks | Description |
---|---|---|---|---|
1 | 1 | 1x1x1 | 0 | Single node, no DD |
2 | 2 | 2x1x1 | 0 | 2 PP ranks |
4 | 4 | 2x2x1 | 0 | 4 PP ranks |
8 | 8 | 2x2x2 | 0 | 8 PP ranks |
16 | 16 | 4x2x2 | 0 | 16 PP ranks |
32 | 32 | 4x4x2 | 0 | 32 PP ranks |
Getting help¶
See Getting help