GROMACS (CPU and GPU) ===================== .. toctree:: :maxdepth: 1 :caption: Contents: .. contents:: Table of Contents :depth: 3 Versions and build types available ---------------------------------- .. warning:: This document describes running GROMACS on Discoverer CPU and GPU clusters. .. note:: The versions of GROMACS installed in the software repository are built and supported by the Discoverer HPC team. The MPI builds should be employed for running the actual simulations (mdrun) and deriving trajectories, while the threadMPI ones should be regarded mostly a tool set for trajectory post-processing. To check which GROMACS versions are currently supported on Discoverer CPU and GPU clusters, execute on the login node: .. code-block:: bash module avail gromacs The following environment module naming convention is applied for the modules servicing the access to the software repository on both CPU and GPU clusters: .. code-block:: bash gromacs/MAJOR_N/MAJOR_N.MINOR_N-comp-num_lib-gpuavail-mpi_lib where: - ``MAJOR_N`` - the major number of the GROMACS version (example: 2022) - ``MINOR_N`` - the minor number of the GROMACS version (example: 1, which stands for 2022.1) - ``comp`` - the compiler collection employed for compiling the source code (example: intel) - ``num_lib`` - the numerical methods' library providing BLAS and FFTW the libgromacs is linked against (example: openblas) - ``gpuavail`` - shows if the build supports GPU acceleration (example: nogpu, which means no GPU support) - ``mpi_lib`` - the MPI library the GROMACS code is linked against (example: openmpi, which implies the use of Open MPI library) The installed versions are compiled based on the following recipes: https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/gromacs CPU versions (Discoverer CPU cluster) ..................................... The following are CPU-only GROMACS builds for the Discoverer (CPU) cluster (AMD EPYC nodes). The module name includes ``nogpu`` in the ``gpuavail`` field. Two builds are available. These builds are for the Discoverer CPU cluster only; they must not be run on Discoverer+. Two different builds available: Discoverer provides two different GROMACS installations optimized for different use cases (see also `Choosing the right build`_): 1. Thread-MPI build (single-node optimized) .. warning:: This build does not support PLUMED. .. code:: bash module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi Use this build when: - Running simulations on a single compute node - Want maximum performance for single-node simulations - Need to run analysis tools (with ``-ntmpi 1``) - Working with AMD EPYC processors - Running CPU-only simulations Features: - Optimized for single-node performance - Can run analysis tools by setting ``-ntmpi 1`` - Excellent NUMA optimization - Lower memory overhead - Faster startup times Executable name: ``gmx`` Example usage inside a SLURM job script: .. code:: bash # Single-node simulation gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix # Analysis tool (single thread-MPI rank) gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr gmx mdrun -ntmpi 1 -s npt.tpr -deffnm npt For more details see `Single-Node Thread-MPI Script`_. 2. External MPI build (Multi-CPU-core and multi-node capable) .. note:: This version supports interaction with PLUMED. .. code:: bash module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi Use this build when: - Running simulations across multiple compute nodes - Need multi-node parallelization - Using OpenMPI for distributed computing - Large-scale simulations requiring multiple nodes Features: - Supports multi-node simulations - Uses OpenMPI for inter-node communication - Compatible with SLURM multi-node job submission - Can handle larger systems across multiple nodes Executable name: ``gmx_mpi`` Example usage inside a SLURM job script: .. code:: bash # Multi-node simulation (on 2 nodes - with 128 CPU Cores per node) mpirun -np 256 gmx_mpi mdrun -ntomp 2 -pin auto -s prefix.tpr -deffnm prefix For more details see `Multi-Node External MPI Script`_. GPU versions (Discoverer+) .......................... .. warning:: No CPU-only GROMACS build is available or may be run on Discoverer+ under any circumstances. If you need to run CPU-only GROMACS, you must request an account on the Discoverer CPU cluster and run your CPU-only jobs there. No exceptions. .. warning:: One GPU is sufficient for most GROMACS GPU runs. Request more than one GPU (e.g. ``--gres=gpu:2``) only if you have confirmed that your system and workload require it. Over-requesting GPUs wastes resources and can delay your job and others. The following GROMACS modules provide CUDA-accelerated builds for the Discoverer+ (GPU+CPU) cluster. They run on DGX H200 nodes (NVIDIA H200 GPUs, Hopper architecture). For cluster and partition details, see the `Resource overview `_. The modules on the GPU cluster load a CUDA-aware MPI stack (Open MPI built with CUDA support). Jobs that use the MPI build (``gmx_mpi``) may run on multiple nodes; the thread-MPI build (``gmx``) runs on a single node only. Two GPU builds are available. Run ``module avail gromacs`` on Discoverer+ to see the exact module names (e.g. ``gromacs/2026/-cuda-mpi`` and ``gromacs/2026/-cuda-threadmpi``). 1. Thread-MPI build (single-node, GPU-accelerated) .. code-block:: bash module load gromacs/2026/-cuda-threadmpi Use this build when: - Running GPU-accelerated simulations on a single compute node - Need to run GROMACS analysis tools (with ``-ntmpi 1``) - Prefer lower overhead than the MPI build when one node is enough Features: - Single process (thread-MPI); one node only - CUDA used for non-bonded and PME workload - Lower memory and startup overhead than the MPI build on one node Executable name: ``gmx`` Example usage inside a SLURM job script: .. code-block:: bash gmx mdrun -s prefix.tpr -deffnm prefix gmx mdrun -ntmpi 1 -s prefix.tpr -deffnm prefix # analysis tool Single-node GPU thread-MPI job (Discoverer+) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Example SLURM script for the CUDA thread-MPI build on one node: one process (thread-MPI), multiple thread-MPI ranks, one GPU. Use ``-ntmpi N`` for the number of thread-MPI ranks and ``-npme 1`` so that one rank handles PME on the GPU (required with multiple ranks when PME is on GPU). Use ``-update cpu`` if your system has three or more consecutively coupled constraints. The variables ``GMX_FORCE_GPU_AWARE_MPI`` and ``GMX_GPU_PME_DECOMPOSITION`` are relevant for the external MPI build; they are irrelevant for thread-MPI but harmless if set. ``GMX_ENABLE_DIRECT_GPU_COMM`` can be useful for multi-rank thread-MPI GPU runs. Replace the account and QOS with your own. .. code-block:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=gromacs_gpu_threadmpi #SBATCH --time=02:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --gres=gpu:1 #SBATCH --nodes 1 #SBATCH --ntasks-per-node 1 #SBATCH --cpus-per-task 8 #SBATCH -o slurm.%j.out #SBATCH -e slurm.%j.err module purge || exit module load gromacs/2026/-cuda-threadmpi || exit export OMP_NUM_THREADS=1 export OMP_PROC_BIND=true export OMP_PLACES=cores export GMX_ENABLE_DIRECT_GPU_COMM=1 export GMX_FORCE_GPU_AWARE_MPI=1 export GMX_GPU_PME_DECOMPOSITION=1 cd $SLURM_SUBMIT_DIR gmx mdrun -v -s prefix.tpr -deffnm prefix \ -ntmpi 8 -npme 1 -ntomp 1 \ -pme gpu -pmefft gpu -bonded gpu -nb gpu -update cpu 2. MPI build (multi-node capable, CUDA-aware MPI) .. code-block:: bash module load gromacs/2026/-cuda-mpi Use this build when: - Running GPU-accelerated simulations across multiple compute nodes - Need the external MPI library (Open MPI) for your workflow - Running on one or multiple hosts with CUDA-aware MPI Features: - Uses external Open MPI built with CUDA support (CUDA-aware MPI) - Jobs running ``gmx_mpi`` may run on multiple nodes - CUDA used for GPU acceleration of the non-bonded and PME workload - Compatible with SLURM multi-node job submission on Discoverer+ Executable name: ``gmx_mpi`` Example usage inside a SLURM job script: .. code-block:: bash # Multi-node GPU run (example: 2 nodes, N total MPI ranks) srun --mpi=pmix gmx_mpi mdrun -s prefix.tpr -deffnm prefix Single-node GPU MPI job (Discoverer+) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Example SLURM script for the CUDA MPI build on one node with 8 MPI ranks, 1 GPU, and domain decomposition with one PME rank. Use ``srun --mpi=pmix`` so that the correct number of ranks is launched. Set ``-npme 1`` (or match your PP rank count) and ensure the number of MPI ranks matches the domain decomposition (e.g. 8 ranks → 7 PP + 1 PME). Replace the account and QOS with your own. .. code-block:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=test_gromacs #SBATCH --time=02:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --gres=gpu:1 #SBATCH --nodes 1 #SBATCH --ntasks-per-node 8 #SBATCH --ntasks-per-core 1 #SBATCH --cpus-per-task 2 #SBATCH -o slurm.%j.out #SBATCH -e slurm.%j.err module purge || exit module load gromacs/2026/-cuda-mpi || exit export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} export OMP_PROC_BIND=true export OMP_PLACES=cores export OMP_SCHEDULE=dynamic export GMX_ENABLE_DIRECT_GPU_COMM=1 export GMX_FORCE_GPU_AWARE_MPI=1 export GMX_GPU_PME_DECOMPOSITION=1 cd $SLURM_SUBMIT_DIR srun --mpi=pmix gmx_mpi mdrun -v -s prefix.tpr -deffnm prefix \ -npme 1 -pme gpu -pmefft gpu -bonded gpu -nb gpu -update cpu User-supported versions ....................... Users are welcome to bring, or compile, and use their own builds of GROMACS but those builds will not be supported by the Discoverer HPC team. Running simulations (mdrun) --------------------------- Running simulations means invoking ``mdrun`` for generating trajectories based on a given TPR file. .. warning:: You MUST NOT execute simulation directly upon the login node (login.discoverer.bg). You have to run your simulations as SLURM jobs only. .. warning:: Write your trajectories and result of analysis only inside your :doc:`projectfolder` and DO NOT use for that purpose (under any circumstances) your :doc:`homefolder`! Single-node thread-MPI script ............................. .. code:: bash #!/bin/bash # #SBATCH --partition=cn ### Partition (you may need to change this) #SBATCH --job-name=gromacs_single_node #SBATCH --time=512:00:00 ### WallTime - set it accordingly #SBATCH --account= #SBATCH --qos= #SBATCH --nodes 1 # MUST BE 1 for thread-MPI #SBATCH --ntasks-per-node 1 # MUST BE 1 for thread-MPI #SBATCH --cpus-per-task 256 # N MPI threads x M OpenMP threads (128 * 2 for AMD EPYC 7H12) #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi # AMD EPYC 7H12 optimization: 2 threads per core export NTOMP=2 export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP)) # 256 / 2 = 128 # Let GROMACS handle thread affinity unset OMP_PROC_BIND unset GOMP_CPU_AFFINITY unset KMP_AFFINITY cd $SLURM_SUBMIT_DIR gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -v -s prefix.tpr -deffnm prefix -pin auto .. note:: Thread-MPI NUMA Configuration: Unlike external MPI builds, thread-MPI ones cannot use ``--ntasks-per-socket`` because thread-MPI runs as a single process with internal thread management. Thread-MPI allocates all 256 logical CPUs to one process and relies on GROMACS's internal ``-pin auto`` mechanism to optimize thread placement across NUMA domains. This provides less explicit control over NUMA domain usage compared to external MPI, but simplifies resource management for single-node simulations. Specify the parameters and resources required for successfully running and completing the job: - SLURM partition (``--partition``): Specifies which group of compute nodes (partition) to use. For GROMACS on Discoverer, use ``cn`` partition which contains the CPU-optimized nodes with AMD EPYC processors. This partition provides the best performance for molecular dynamics simulations. - Job name (``--job-name``): A descriptive identifier for your job that will appear in the queue and job listings. Use meaningful names like ``gromacs_protein_sim``, ``gromacs_membrane_run``, or ``gromacs_equilibration`` to easily identify your simulations. - Wall time (``--time``): Maximum time your job is allowed to run before being terminated. Format is ``HH:MM:SS`` (e.g., ``48:00:00`` for 48 hours, ``12:30:00`` for 12.5 hours). Set this based on your simulation size, expected runtime, and queue policies. Underestimating may cause job termination before completion. - Number of compute nodes (``--nodes``): How many physical nodes to allocate for your simulation. For single-node Thread-MPI simulations, always use ``1``. For multi-node external MPI simulations, this determines the total computational power and memory available. - Number of MPI processes per node (``--ntasks-per-node``): Critical parameter for GROMACS performance. For Thread-MPI builds, must be ``1`` (single process). For external MPI builds on Discoverer with 8 NUMA domains per node, use ``128`` to get 16 MPI tasks per NUMA domain for optimal memory locality and cache utilization. - Number of OpenMP threads per MPI process (``--cpus-per-task``): Controls hybrid parallelism by specifying how many logical CPUs each MPI process can use. For Thread-MPI: use ``256`` (all available CPUs). For external MPI: use ``2`` to utilize hyperthreading while maintaining good NUMA performance with 2 OpenMP threads per MPI task. - GROMACS version (``module load``): Choose the appropriate version and build based on your simulation requirements. Thread-MPI builds are optimized for single-node simulations, while external MPI builds support multi-node scaling. See `Versions and build types available`_ for available builds and their specific characteristics. Multi-node external MPI script .............................. .. note:: Here the term "external" means the MPI library is not the native one included into the code. That means the user MPI library it is not the Thread-MPI library mentioned above. This script is used for multi-node external MPI simulations. It is based on our build of GROMACS that uses Open MPI as the MPI library. .. code:: bash #!/bin/bash # #SBATCH --partition=cn # Partition (you may need to change this) #SBATCH --job-name=gromacs_multi_node # Job name #SBATCH --time=512:00:00 # WallTime - set it accordingly #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=2 # Number of nodes #SBATCH --ntasks-per-node=128 # Number of MPI tasks to run upon each node #SBATCH --ntasks-per-socket=16 # Number of tasks per NUMA-bound socket #SBATCH --cpus-per-task=2 # Two threads per MPI rank #SBATCH --ntasks-per-core=1 # Each MPI rank is bound to a CPU core #SBATCH --mem=251G # Do not exceed this on Discoverer CPU cluster #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} export OMP_PROC_BIND=false # Optimize InfiniBand communication export UCX_NET_DEVICES=mlx5_0:1 cd ${SLURM_SUBMIT_DIR} mpirun --map-by socket:PE=${OMP_NUM_THREADS} \ gmx_mpi mdrun -ntomp ${OMP_NUM_THREADS} -v \ -s prefix.tpr -deffnm prefix In the scripts above, edit the parameters and resources required for successfully running and completing the job: - SLURM partition of compute nodes (``--partition``): Specifies which group of nodes (partition) to use. For GROMACS on Discoverer, use ``cn`` partition which contains the CPU-optimized nodes. - Job name (``--job-name``): A descriptive name for your job that will appear in the queue. Use meaningful names like ``gromacs_protein_sim`` or ``gromacs_membrane_run``. - Wall time (``--time``): Maximum time your job can run. Format is ``HH:MM:SS`` (e.g., ``48:00:00`` for 48 hours). Set this based on your simulation size and expected runtime. - Number of compute nodes (``--nodes``): How many physical nodes to allocate. For multi-node GROMACS simulations, this determines the total computational power available. - Number of MPI processes per node (``--ntasks-per-node``): Critical for GROMACS performance. On Discoverer with 8 NUMA domains per node, use 128 MPI tasks to get 16 tasks per NUMA domain for optimal memory locality. - Number of MPI tasks per NUMA domain (``--ntasks-per-socket``): Essential for NUMA-aware performance. Set to 16 to place exactly 16 MPI tasks per NUMA domain (128 total tasks ÷ 8 NUMA domains = 16 per domain). This ensures optimal memory access patterns and cache utilization within each NUMA boundary. - Number of OpenMP threads per MPI process (``--cpus-per-task``): Controls hybrid parallelism. Use 2 threads per MPI task to utilize hyperthreading while maintaining good NUMA performance. - GROMACS version (``module load``): Choose the appropriate version based on your simulation requirements. See `Versions and build types available`_ for available builds and their characteristics. Save the complete SLURM job description as a file, for example ``/valhalla/projects//run_gromacs/run_gromacs.sh``, and submit it to the queue: .. code:: bash cd /valhalla/projects//run_gromacs/ sbatch run_gromacs.sh Upon successful submission, the standard output will be directed by SLURM into the file ``/valhalla/projects//run_gromacs/slurm.%j.out`` (where ``%j`` stands for the SLURM job ID), while the standard error output will be stored in ``/valhalla/projects//run_gromacs/slurm.%j.err``. Running GROMACS tools --------------------- Script for executing single-threaded non-interactive GROMACS tools .................................................................. Use this for single-threaded GROMACS tools like grompp, editconf, etc.: .. code:: bash #!/bin/bash # #SBATCH --partition=cn ### Partition (you may need to change this) #SBATCH --job-name=gromacs_single_thread #SBATCH --time=00:30:00 ### WallTime - set it accordingly #SBATCH --account= #SBATCH --qos= #SBATCH --nodes 1 # Single node #SBATCH --ntasks-per-node 1 # Single task #SBATCH --cpus-per-task 2 # 1 CPU core (2 threads on AMD EPYC) #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi cd $SLURM_SUBMIT_DIR # Single-threaded tools (no -ntmpi needed) gmx grompp -f npt.mdp -c nvt.gro -p topol.top -o npt.tpr gmx editconf -f protein.gro -o protein_box.gro -c -d 1.0 -bt cubic gmx solvate -cp protein_box.gro -cs spc216.gro -o solv.gro -p topol.top gmx grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tpr Script for executing interactive GROMACS tools in non-interactive mode ...................................................................... Use this for GROMACS tools like cluster, rms, gyrate, hbond, do_dssp, etc.: .. code:: bash #!/bin/bash # #SBATCH --partition=cn ### Partition (you may need to change this) #SBATCH --job-name=gromacs_interactive #SBATCH --time=01:00:00 ### WallTime - set it accordingly #SBATCH --account= #SBATCH --qos= #SBATCH --nodes 1 # Single node #SBATCH --ntasks-per-node 1 # Single task #SBATCH --cpus-per-task 2 # 1 CPU core (2 threads on AMD EPYC) #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR module purge module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi cd ${SLURM_SUBMIT_DIR} # Interactive tools using echo pipes for input # Format: echo -e "input1\ninput2\n..." | gmx tool_name [options] # # How echo pipes simulate interactive input: # echo -e "4\n4" simulates: Type "4", press Enter, type "4", press Enter # So "4\n4" replaces the interactive sequence: 4 [Enter] 4 [Enter] # Example 1: Cluster analysis echo -e "1\n1" | gmx cluster -f trajectory.trr -s structure.tpr -n index.ndx \ -cutoff 0.15 -method jarvis-patrick -M 0 \ -o cluster_output -g cluster.log -dist cluster_dist \ -cl cluster.pdb -nst 250 -wcl 10000 # Example 2: RMSD analysis echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr \ -o rmsd.xvg -tu ns # Example 3: Radius of gyration echo -e "1\n1" | gmx gyrate -f trajectory.trr -s structure.tpr \ -o gyrate.xvg -p -n index.ndx # Example 4: Hydrogen bond analysis echo -e "1\n1" | gmx hbond -f trajectory.trr -s structure.tpr \ -num hbond.xvg -tu ns # Example 5: Secondary structure analysis echo -e "1\n1" | gmx do_dssp -f trajectory.trr -s structure.tpr \ -o ss.xpm -sc scount.xvg Save the complete SLURM job description as a file, for example ``/valhalla/projects//run_gromacs/gromacs_tools.sh``, and submit it to the queue: .. code:: bash cd /valhalla/projects//run_gromacs/ sbatch gromacs_tools.sh .. list-table:: Common Interactive GROMACS Tools and Their Input Patterns :header-rows: 1 * - Tool - Purpose - Typical Input - What You'd Type Interactively - Example Command * - ``gmx cluster`` - Cluster analysis - ``"1\n1"`` - Type "1", press Enter, type "1", press Enter - ``echo -e "1\n1" \| gmx cluster ...`` * - ``gmx rms`` - RMSD calculation - ``"4\n1"`` - Type "4", press Enter, type "4", press Enter - ``echo -e "4\n4" \| gmx rms ...`` * - ``gmx gyrate`` - Radius of gyration - ``"1\n1"`` - Type "1", press Enter, type "1", press Enter - ``echo -e "1\n1" \| gmx gyrate ...`` * - ``gmx hbond`` - Hydrogen bonds - ``"1\n1"`` - Type "1", press Enter, type "1", press Enter - ``echo -e "1\n1" \| gmx hbond ...`` * - ``gmx do_dssp`` - Secondary structure - ``"1\n1"`` - Type "1", press Enter, type "1", press Enter - ``echo -e "1\n1" \| gmx do_dssp ...`` * - ``gmx trjconv`` - Trajectory conversion - ``"0"`` - Type "0", press Enter - ``echo -e "0" \| gmx trjconv ...`` * - ``gmx select`` - Atom selection - ``"1\n1"`` - Type "1", press Enter, type "1", press Enter - ``echo -e "1\n1" \| gmx select ...`` Understanding the table columns: - “Typical Input”: The echo pipe string that simulates interactive input in SLURM - “What You’d Type Interactively”: The exact keystrokes you’d make if running the tool on a personal workstation How to convert interactive commands to batch commands: Step-by-step translation process: 1. Interactive session (on personal workstation): .. code:: bash $ gmx rms Select group for least squares fit (1-4): 4 Select group for RMSD calculation (1-4): 1 2. Batch session (in SLURM script): .. code:: bash echo -e "4\n1" | gmx rms -f trajectory.trr -s structure.tpr -o rmsd.xvg Translation rules: - Each number you type → becomes part of the echo string - Each Enter key press → becomes ``\n`` (newline) - Multiple inputs → separated by ``\n`` - Final Enter → usually not needed (tool processes automatically) .. list-table:: Translation of ``"4\n1"`` :widths: 25 25 50 :header-rows: 1 * - Interactive Action - Echo String - Explanation * - Type "4", press Enter - ``"4\n"`` - First input with newline * - Type "1", press Enter - ``"1"`` - Second input with newline * - Combined - ``"4\n1"`` - Both inputs in one string Common group numbers: - “0”: System (all atoms) - “1”: Protein - “2”: Non-protein - “3”: Water - “4”: Backbone (protein backbone only) Tips for converting interactive commands: 1. Test interactively first: Run the command on your workstation to see what inputs are needed 2. Count the inputs: Note how many numbers you need to type 3. Add newlines: Put ``\n`` between each input 4. Use echo -e: The ``-e`` flag enables ``\n`` interpretation 5. Pipe to command: Use ``|`` to feed the input to the GROMACS tool Tips for interactive tools: 1. Test locally first: Run the command interactively to see what inputs are needed 2. Use echo -e: The ``-e`` flag enables interpretation of backslash escapes like ``\n`` 3. Check group numbers: Use ``gmx make_ndx`` to see available groups and their numbers 4. Multiple inputs: Separate multiple inputs with ``\n`` for newlines 5. Error handling: Check the log files for any input errors Technical details ----------------- This section describes build choices and tuning. Where it refers to partitions, NUMA, or CPU-only behaviour, it applies to the Discoverer CPU cluster. For Discoverer+ (GPU cluster), use the GPU builds described under `GPU versions (Discoverer+)`_; the "Choosing the right build" tables below summarise options on both clusters. Choosing the right build ........................ On Discoverer (CPU cluster) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 * - Scenario - Recommended Build - Module to Load * - Single-node simulation - Thread-MPI - ``gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi`` * - Analysis tools - Thread-MPI - ``gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi`` * - Multi-node simulation - External MPI - ``gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-openmpi`` On Discoverer+ (GPU cluster) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 * - Scenario - Recommended Build - Module / Executable * - Single-node GPU simulation, analysis tools - Thread-MPI (CUDA) - ``gromacs/2026/-cuda-threadmpi``; executable ``gmx`` * - Multi-node GPU simulation or when external MPI is required - MPI (CUDA-aware) - ``gromacs/2026/-cuda-mpi``; executable ``gmx_mpi`` Run ``module avail gromacs`` on Discoverer+ to see the exact module names. No CPU-only GROMACS may be run on Discoverer+; use the Discoverer CPU cluster for CPU-only jobs. .. note:: AMD EPYC optimization applies to both CPU builds on Discoverer. The 2:1 thread-to-core ratio and other AMD EPYC-specific optimizations work with both thread-MPI and external MPI builds. The choice between builds is based on single-node vs. multi-node requirements, not processor optimization. Performance comparison on Discoverer (CPU cluster) - Thread-MPI: 10-20% faster for single-node simulations - External MPI: Required for multi-node, but slower for single-node - Memory Usage: Thread-MPI uses ~30% less memory per node Important notes - Thread-MPI cannot run across multiple nodes (on either cluster) - External MPI can run on single nodes but with performance penalty (CPU cluster) - On Discoverer+, only the CUDA builds are available; ``gmx_mpi`` may run on multiple nodes with CUDA-aware MPI - Analysis tools work with thread-MPI when using ``-ntmpi 1`` (CPU or GPU thread-MPI build) - Both CPU builds support the same GROMACS features (except multi-node for thread-MPI) Understanding thread-MPI ........................ .. important:: Thread-MPI is GROMACS’s internal threading library that implements a subset of the MPI 1.1 specification using system threads instead of separate processes. Based on the source code analysis, here’s what makes it special: Technical details from GROMACS built-in threading support: 1. Built-in implementation: Thread-MPI is included directly in the GROMACS source tree (``src/external/thread_mpi/``) and is the default parallelization mode 2. Cross-platform threading: Uses POSIX pthreads on Linux/Unix and Windows threads on Windows 3. Shared memory optimization: Unlike external MPI which uses separate processes, thread-MPI uses threads within a single process, enabling: - Direct shared memory access - Lower communication overhead - Better cache utilization - Reduced memory footprint Why thread-MPI is superior for single-node simulations: 1. Performance benefits: - Lower Latency: No inter-process communication overhead - Better Memory Access: Direct shared memory access between threads - Optimized for NUMA: Thread-MPI can be optimized for NUMA-aware memory placement - Reduced Context Switching: Threads within same process vs. separate processes 2. Resource efficiency: - Memory Sharing: Threads share the same address space, reducing memory usage - Faster Startup: No process spawning overhead - Better Cache Coherence: Shared L3 cache utilization 3. GROMACS-specific optimizations: - Integrated Thread Affinity: Thread-MPI works seamlessly with GROMACS’s internal thread pinning system - Domain Decomposition: Optimized for GROMACS’s domain decomposition algorithms - Load Balancing: Better load balancing within single-node scenarios Thread-MPI vs external MPI comparison: .. list-table:: :header-rows: 1 :widths: 22 28 38 * - Aspect - Thread-MPI - External MPI * - Scope - Single node only - Multi-node capable * - Communication - Shared memory (fast) - Network/Inter-process (slower) * - Memory Usage - Shared address space - Separate process memory * - Startup Time - Fast (thread creation) - Slower (process spawning) * - NUMA Optimization - Excellent - Limited * - GROMACS Integration - Native, optimized - Generic When to use thread-MPI and when external MPI: Use thread-MPI when: - Running on a single compute node - Want maximum performance for single-node simulations - Need to run GROMACS analysis tools (with ``-ntmpi 1``) - Working with AMD EPYC processors (excellent NUMA optimization) - Running CPU-only simulations Use external MPI when: - Need multi-node simulations - Running across multiple compute nodes - Using specialized MPI features not supported by thread-MPI Thread-MPI configuration best practices: .. code:: bash # Optimal thread-MPI setup for AMD EPYC 7H12 (128 cores, 256 threads) export NTOMP=2 # 2 OpenMP threads per MPI rank export NTMPI=128 # 128 thread-MPI ranks # Total: 128 × 2 = 256 threads (matches 256 logical threads) # Let GROMACS handle thread affinity unset OMP_PROC_BIND unset GOMP_CPU_AFFINITY unset KMP_AFFINITY gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix Pinning and thread counts work together ....................................... .. warning:: ``-pin auto`` and ``-ntomp`` are complementary, not alternatives! A common misconception is that using thread pinning (``-pin auto``) means you can omit the ``-ntomp`` parameter. This is incorrect. Here’s how they work together: What each parameter does: - ``-ntomp``: Specifies the number of OpenMP threads per MPI rank - ``-pin auto``: Controls how GROMACS maps those threads to CPU cores Why you need both: .. code:: bash # CORRECT: Both parameters work together gmx mdrun -ntomp 2 -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix # Result: 128 MPI ranks × 2 OpenMP threads = 256 total threads # GROMACS pins each of these 256 threads to specific CPU cores # INCORRECT: Omitting -ntomp gmx mdrun -ntmpi 128 -pin auto -s prefix.tpr -deffnm prefix # Result: GROMACS may use default thread count, not optimal for your hardware How GROMACS uses both parameters: From the source code analysis, GROMACS’s thread affinity system: 1. First: Determines total threads = ``-ntmpi × -ntomp`` 2. Then: Maps each thread to a specific core using hardware topology 3. Finally: Applies pinning based on ``-pin auto`` settings Example thread distribution: :: Rank 0: Thread 0 → Core 0 (pinned) Rank 0: Thread 1 → Core 1 (pinned) Rank 1: Thread 0 → Core 2 (pinned) Rank 1: Thread 1 → Core 3 (pinned) ...and so on Best practice: Always specify both .. code:: bash # For AMD EPYC 7H12 (256 cores) export NTOMP=2 export NTMPI=128 gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix This ensures optimal thread distribution and core pinning for your specific hardware. AMD EPYC thread optimization: the 2:1 rule .......................................... .. important:: AMD EPYC processors benefit from 2 threads per core! Based on performance testing and GROMACS source code analysis, AMD EPYC processors (including the EPYC 7H12 on Discoverer) show optimal performance when using 2 OpenMP threads per physical core rather than 1:1 or higher ratios. Why 2:1 thread-to-core ratio works best: 1. AMD EPYC Architecture: Each EPYC core has 2 hardware threads (SMT - Simultaneous Multithreading) 2. Memory Bandwidth: AMD EPYC has excellent memory bandwidth that can sustain 2 threads per core 3. Cache Efficiency: Shared L3 cache benefits from 2 threads working on related data 4. NUMA Optimization: 2 threads per core better utilize the NUMA topology Optimal configuration for AMD EPYC 7H12 (128 cores, 256 threads): .. code:: bash # CORRECT: 2 threads per core export NTOMP=2 # 2 OpenMP threads per MPI rank export NTMPI=128 # 128 thread-MPI ranks # Total: 128 × 2 = 256 threads (matches 256 logical threads) # INCORRECT: 1 thread per core (wastes SMT capability) export NTOMP=1 export NTMPI=256 # Result: Poorer performance, underutilized hardware # INCORRECT: 4 threads per core (oversubscription) export NTOMP=4 export NTMPI=64 # Result: Context switching overhead, cache thrashing Performance impact: ======================== ================== ============ =============== Thread ratio Performance Memory usage CPU utilisation ======================== ================== ============ =============== 1:1 (1 thread/core) ~70% of optimal Lower ~50% 2:1 (2 threads/core) 100% (optimal) Optimal ~95% 4:1 (4 threads/core) ~60% of optimal Higher ~90% ======================== ================== ============ =============== Why this matters for GROMACS: 1. Domain decomposition: GROMACS’s domain decomposition algorithm benefits from having more MPI ranks (128 vs 64) 2. Load balancing: More MPI ranks provide better load balancing across the system 3. Communication overlap: 2 threads per core allows better overlap of computation and communication 4. Memory access patterns: AMD EPYC’s memory subsystem is optimized for 2 threads per core Implementation in your SLURM scripts: .. code:: bash #!/bin/bash #SBATCH --nodes 1 #SBATCH --ntasks-per-node 1 #SBATCH --cpus-per-task 256 module load gromacs/2025/2025.2-llvm-fftw3-openblas-nogpu-threadmpi # AMD EPYC 7H12 optimization: 2 threads per core (128 cores, 256 threads) export NTOMP=2 export NTMPI=$((SLURM_CPUS_PER_TASK / NTOMP)) # 256 / 2 = 128 # Let GROMACS handle thread affinity unset OMP_PROC_BIND unset GOMP_CPU_AFFINITY unset KMP_AFFINITY gmx mdrun -ntomp ${NTOMP} -ntmpi ${NTMPI} -pin auto -s prefix.tpr -deffnm prefix Note for other processors: - Intel Xeon: Often benefits from 1:1 or 2:1 depending on generation - AMD EPYC: Consistently benefits from 2:1 ratio - ARM: Varies by implementation, typically 1:1 This 2:1 optimization is specific to AMD EPYC’s architecture and should be applied consistently across all single-node GROMACS simulations on Discoverer. SLURM resource allocation and accounting for GROMACS tools .......................................................... The requirements below differ by cluster: Discoverer (CPU cluster) uses CPU-core-based accounting and AMD EPYC core/thread rules; Discoverer+ (GPU cluster) requires GPU allocation for runs and follows partition-specific GPU and CPU accounting. On Discoverer (CPU cluster) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Why GROMACS tools must use 1 CPU Core (2 Threads): GROMACS tools (like ``grompp``, ``cluster``, ``rms``, etc.) are designed to run as single-threaded processes. However, for proper SLURM accounting and resource management on AMD EPYC processors, they must be allocated 1 CPU core, which corresponds to 2 threads due to AMD’s SMT (Simultaneous Multithreading) architecture. SLURM resource allocation requirements on Discoverer: .. code:: bash #SBATCH --ntasks-per-node 1 # Single process #SBATCH --cpus-per-task 2 # 1 CPU core (2 threads on AMD EPYC) Why this configuration is mandatory: 1. SLURM accounting accuracy: - SLURM tracks resource usage per CPU core - 1 core = 2 threads on AMD EPYC 7H12 - Tools must be allocated complete cores for proper billing - Partial core allocation can cause accounting errors 2. AMD EPYC architecture: - Each physical core has 2 logical threads (SMT) - Tools cannot use “half a core” - they get the full core - Even single-threaded tools occupy 1 complete core - This ensures consistent resource tracking 3. Resource management benefits: - Accurate billing: Users are charged for exactly 1 CPU core - Fair usage: Prevents resource over-allocation - Predictable performance: Tools get dedicated core resources - SLURM compliance: Follows proper resource allocation patterns .. list-table:: Accounting impact :header-rows: 1 * - Configuration - SLURM billing - Resource usage - Accounting status * - ``--ntasks-per-node=1`` and ``--cpus-per-task=1`` - [x] Incorrect - [x] Incomplete 1 CPU core - [x] Accounting error * - ``--ntasks-per-node=1`` and ``--cpus-per-task=2`` - [V] Correct - [V] 1 complete CPU core - [V] Proper billing .. list-table:: Tool categories and resource allocation :header-rows: 1 * - Tool type - SLURM configuration - #CPU cores utilised - #CPU threads utilised - Purpose * - MD integrator (``mdrun``) - ``--ntasks-per-node=128`` and ``--cpus-per-task=2`` - 128 - 256 - Full node utilisation * - Single process execution - ``--ntasks-per-node=1`` and ``--cpus-per-task=2`` - 1 - 2 - Running GROMACS tools on 1 CPU core Best practices for resource allocation on Discoverer: 1. Always use complete cores: ``--cpus-per-task=2`` for GROMACS tools (unless GROMACS documentation says otherwise) 2. Avoid partial CPU core allocation: Avoid ``--cpus-per-task=1`` when ``--ntasks-per-node=1`` 3. Match AMD EPYC architecture: 2 CPU threads per core allocation for high performance 4. Ensure proper accounting: Complete CPU core allocation for billing accuracy Why this matters for Discoverer: - Cost Control: Accurate billing prevents unexpected charges - Resource Efficiency: Tools receives exactly the resources what they require - Fair Usage: All users follow the same allocation rules - Performance Predictability: Consistent resource availability On Discoverer+ (GPU cluster) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On Discoverer+, SLURM accounts for both GPU and CPU usage. You must request GPU resources explicitly for any job that uses the GROMACS CUDA builds (see `GPU versions (Discoverer+)`_). For GPU-enabled ``mdrun`` (``gmx`` or ``gmx_mpi``): - Request at least one GPU per process (or per node, depending on the run) using ``--gres=gpu:N`` (or the partition-specific GRES name; check the partition documentation). - Request sufficient CPUs per task for the MPI/OpenMP layout you use; partition and node topology may differ from the CPU cluster. - Use the partition(s) that provide GPU nodes (e.g. DGX H200); see the resource overview and partition documentation for exact names and limits. For GROMACS tools (e.g. ``grompp``, ``trjconv``, ``cluster``, ``rms``) on Discoverer+: - These tools typically run on the CPU only even when a CUDA module is loaded. Allocate at least 1 CPU core (and, if the partition requires it, 1 GPU so the job can run on a GPU node). Follow the partition’s policy for “tool” or “prep” jobs that need minimal or no GPU. - SLURM accounting on Discoverer+ will reflect both GPU and CPU allocation; request only what you need to avoid over-allocation and correct billing. For exact ``#SBATCH`` options, partition names, and GRES syntax on Discoverer+, consult the cluster’s resource overview and SLURM/partition documentation. This resource allocation strategy ensures that GROMACS tools are properly accounted for in the SLURM system while making efficient use of the AMD EPYC processor architecture. CPU Thread affinity and pinning ............................... GROMACS has its own internal CPU thread affinity management system (see ``gmxomp.cpp``): 1. Automatically sets thread affinity by default when using all CPU cores on a compte node 2. Detects CPU thread-related environment variables (``OMP_PROC_BIND``, ``GOMP_CPU_AFFINITY``, ``KMP_AFFINITY``) 3. Disables its own affinity setting when these environment variables are set to avoid conflicts Official GROMACS recommendation: .. code:: bash # Let GROMACS handle thread affinity internally (recommended) gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK} # Or explicitly enable GROMACS thread pinning gmx_mpi mdrun -pin on -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK} # For multi-node simulations mpirun gmx_mpi mdrun -pin auto -s prefix.tpr -deffnm prefix -ntomp ${SLURM_CPUS_PER_TASK} Thread Affinity Options: - ``-pin auto`` (default): GROMACS automatically sets thread affinity when using all node cores - ``-pin on``: Force GROMACS to set thread affinity - ``-pin off``: Disable GROMACS thread affinity setting - ``-pinoffset N``: Specify starting core for thread pinning - ``-pinstride N``: Specify stride between pinned cores .. note:: The existing Discoverer documentation uses OpenMP environment variables, but GROMACS source code suggests letting GROMACS manage CPU thread affinity internally itself for optimal performance. Domain decomposition guidelines ............................... Grid selection guidelines: When using multi-node MPI simulations, consider these factors: System size guidelines: - Small systems (<50k atoms): 1-2 nodes sufficient - Medium systems (50k-100k atoms): 2-4 nodes recommended - Large systems (>100k atoms): 4-8 nodes optimal - Very large systems (>200k atoms): 8+ nodes required Communication overhead considerations: - More nodes = more MPI communication overhead - Balance parallelization benefits against communication costs - Monitor network utilization by setting ``export UCX_NET_DEVICES=mlx5_0:1`` Grid optimization rules: GROMACS works natively with the following grid configurations: ===== ======== ===== ========= ================== Nodes PP Ranks Grid PME Ranks Description ===== ======== ===== ========= ================== 1 1 1x1x1 0 Single node, no DD 2 2 2x1x1 0 2 PP ranks 4 4 2x2x1 0 4 PP ranks 8 8 2x2x2 0 8 PP ranks 16 16 4x2x2 0 16 PP ranks 32 32 4x4x2 0 32 PP ranks ===== ======== ===== ========= ================== Getting help ------------ See :doc:`help` .. _`GROMACS 2021 multi-node`: https://gitlab.discoverer.bg/vkolev/recipes/-/blob/main/gromacs/2021/gromacs-2021.slurm.multi-node.mpi.batch .. _`GROMACS 2022 multi-node`: https://gitlab.discoverer.bg/vkolev/recipes/-/blob/main/gromacs/2022/gromacs-2022.slurm.multi-node.mpi.batch .. _`GROMACS 2021 thread-MPI`: https://gitlab.discoverer.bg/vkolev/recipes/-/blob/main/gromacs/2021/gromacs-2021.slurm.thread.mpi.batch .. _`GROMACS 2022 thread-MPI`: https://gitlab.discoverer.bg/vkolev/recipes/-/blob/main/gromacs/2022/gromacs-2022.slurm.thread.mpi.batch .. _`GROMACS tools`: https://manual.gromacs.org/documentation/current/user-guide/cmdline.html#commands-by-name .. _`GROMACS tool`: https://manual.gromacs.org/documentation/current/user-guide/cmdline.html#commands-by-name .. _`propylene glycol`: https://en.wiktionary.org/wiki/propylene_glycol .. _`TIP3P water model`: https://en.wikipedia.org/wiki/Water_model .. _`1-monoolein`: https://pubchem.ncbi.nlm.nih.gov/compound/Glyceryl-monooleate .. _`GRO formatted`: https://manual.gromacs.org/documentation/current/reference-manual/file-formats.html#gro .. _`NVIDIA V100 tensor core GPU`: https://www.nvidia.com/en-us/data-center/v100/ .. _`Intel Xeon Gold 6226`: https://ark.intel.com/content/www/us/en/ark/products/193957/intel-xeon-gold-6226-processor-19-25m-cache-2-70-ghz.html .. _`UCX`: https://openucx.org/ .. _`Open MPI`: https://www.open-mpi.org/ .. _`CUDA`: https://developer.nvidia.com/cuda-toolkit .. _`GROMACS 2023 code`: https://ftp.gromacs.org/gromacs/gromacs-2023.tar.gz .. _`Intel oneAPI LLVM compilers`: https://www.intel.com/content/www/us/en/developer/articles/technical/adoption-of-llvm-complete-icx.html