AmberTools (CPU)¶

Table of Contents

About ¶

According to the AmberTools website, AmberTools is a comprehensive suite of biomolecular simulation tools that works alongside the AMBER molecular dynamics package. It provides a collection of programs for setting up, running, and analysing molecular dynamics simulations, with a focus on biomolecular systems such as proteins, nucleic acids, and small molecules.

AmberTools is freely available and open-source, providing extensive functionality for preparing simulations, analysing trajectories, and performing computational chemistry calculations. Unlike AMBER’s pmemd.MPI, AmberTools has no licensing restrictions and can be used by both academic and commercial users on Discoverer CPU cluster.

This document describes running AmberTools on Discoverer CPU cluster.

Documentation about how to use AmberTools is available here: https://ambermd.org/Manuals.php

Note

If you are looking for running pmemd.MPI on Discoverer CPU cluster, see pmemd.MPI (CPU).

Versions available ¶

Currently we support the following versions of AmberTools:

24
25

To check which AmberTools versions are currently supported on Discoverer, execute on the login node:

module avail ambertools

AmberTools programs ¶

AmberTools24 includes a wide variety of tools beyond the sander variants described above. Some of the most important ones are:

Structure preparation ¶

tleap (tLEaP): Text-based LEaP for molecular structure preparation, topology generation, and system setup - Build and modify molecular structures - Assign force field parameters - Solvate systems (water, ions) - Create topology and coordinate files - Command-line interface
xleap (xLEaP): Graphical LEaP (X11-based) for molecular structure preparation - Same functionality as tleap with graphical user interface - Requires X11 display for GUI - Useful for interactive structure building
parmed (ParmEd): Parameter file editor and molecular structure manipulation - Edit topology files - Add/remove atoms, bonds, angles, dihedrals - Modify force field parameters - Combine multiple structures

Simulation analysis ¶

cpptraj: Powerful trajectory analysis tool (formerly ptraj, serial version) - Analyse trajectories from multiple MD engines (AMBER, GROMACS, CHARMM, NAMD) - Calculate geometric properties (distances, angles, RMSD) - Hydrogen bond analysis - Secondary structure analysis - Clustering and principal component analysis - Extensive scripting capabilities
cpptraj.MPI: Parallel MPI version of cpptraj for multi-node trajectory analysis - Distribute analysis across multiple compute nodes - Suitable for large trajectories or computationally intensive analyses
cpptraj.OMP: OpenMP parallel version of cpptraj for shared-memory trajectory analysis - Uses threading for parallelisation within a single node - Suitable for multi-core workstations
process_mdout.perl: Extract and analyse energy data from AMBER MD output files

Binding free energy calculations ¶

MMPBSA.py: MM-PBSA and MM-GBSA binding free energy calculations (serial version) - Calculate binding free energies using implicit solvent models - Decompose binding energies by residue - Perform per-residue and per-atom energy decompositions - Support for multiple MD engines
MMPBSA.py.MPI: Parallel MPI version of MMPBSA.py - Distribute MM-PBSA/MM-GBSA calculations across multiple compute nodes - Suitable for large systems or multiple trajectory analysis

Quantum mechanics / molecular mechanics ¶

sqm: Semi-empirical quantum mechanics program (AM1, PM3, AM1-D, PM3-D methods, serial version) - Geometry optimisations - Energy and force calculations
sqm.MPI: Parallel MPI version of sqm - Distribute QM calculations across multiple compute nodes - Suitable for larger QM regions or multiple QM calculations
mdgx: Molecular dynamics geometry and topology exchange tool (serial version) - Generate geometry and topology files - Convert between different formats
mdgx.MPI: Parallel MPI version of mdgx - Distribute processing across multiple compute nodes
mdgx.OMP: OpenMP parallel version of mdgx - Uses threading for parallelisation within a single node

Utility programs ¶

antechamber: Automatic atom type assignment and parameter generation for small molecules (serial version) - Generate GAFF parameters for organic molecules - Create parameter files for new compounds - Interface with quantum chemistry programs
parmchk2: Check and generate Amber parameter files for molecules processed by antechamber - Validates GAFF parameters - Generates missing parameters
reduce: Add missing hydrogens to PDB structures - Places hydrogens at optimal positions - Handles protonation states
pdb4amber: Prepare PDB files for Amber simulations - Removes non-standard residues - Fixes common PDB format issues - Prepares structures for leap
packmol: Pack molecules into defined regions (solvation, membrane insertion) - Solvate systems with water - Insert molecules into membranes - Generate mixed-solvent systems
packmol-memgen: Generate membrane configurations using packmol
ambpdb: Convert Amber topology/coordinate files to PDB format - Extract coordinates from trajectory files - Convert topology files to PDB
ambmask: Manipulate Amber mask expressions - Test and validate mask syntax - Useful for advanced Amber scripting
quick: Semi-empirical quantum mechanics calculations (serial version) - QM/MM calculations - Geometry optimisations
quick.MPI: Parallel MPI version of quick for QM/MM calculations
gem.pmemd: Generalized Ensemble Methods (GEM) for enhanced sampling (serial version) - Temperature replica exchange - Hamiltonian replica exchange
gem.pmemd.MPI: Parallel MPI version of gem.pmemd for multi-node GEM simulations

Additional analysis tools ¶

pbsa: Poisson-Boltzmann surface area calculations - Calculate solvation free energies - Electrostatic calculations
gbnsr6: Generalized Born (GB) calculations using GB-Neck2 model - Implicit solvent calculations - Solvation free energy calculations
simplepbsa: Simplified PB calculations (serial version) - Fast PB approximations - Binding energy calculations
simplepbsa.MPI: Parallel MPI version of simplepbsa
rism1d: One-dimensional reference interaction site model - Solvation structure analysis - Thermodynamic properties
rism3d.snglpnt: Three-dimensional RISM (serial version) - 3D solvation structure - Site-site correlation functions
rism3d.snglpnt.MPI: Parallel MPI version of rism3d.snglpnt
saxs_md: Small-angle X-ray scattering analysis from MD trajectories (serial version) - Calculate SAXS profiles - Compare with experimental data
saxs_md.OMP: OpenMP parallel version of saxs_md
saxs_rism: SAXS from RISM calculations (serial version) - Combine RISM and SAXS analysis
saxs_rism.OMP: OpenMP parallel version of saxs_rism
nmode: Normal mode analysis - Vibrational frequencies - Entropy calculations
mmpbsa_py_energy: Extract energy components from MMPBSA calculations
mmpbsa_py_nabnmode: NAB-based normal mode calculations for MMPBSA

Enhanced sampling and free energy methods ¶

ndfes: Neural network-based free energy surfaces (serial version) - Enhanced sampling analysis - Free energy calculations
ndfes.OMP: OpenMP parallel version of ndfes
ndfes-path: Path-based analysis for ndfes calculations
ndfes-path.OMP: OpenMP parallel version of ndfes-path
ndfes-AvgFESs.py: Average free energy surfaces from multiple simulations
ndfes-CheckEquil.py: Check equilibrium in enhanced sampling simulations
ndfes-CombineMetafiles.py: Combine metadynamics files
ndfes-PrepareAmberData.py: Prepare Amber data for ndfes analysis
ndfes-PrintFES.py: Print free energy surfaces
ndfes-path-analyzesims.py: Analyse path simulations
ndfes-path-prepguess.py: Prepare initial guesses for path calculations
edgembar: Energy decomposition group method BAR (serial version) - Free energy decomposition - Binding energy analysis
edgembar.OMP: OpenMP parallel version of edgembar
edgembar-WriteGraphHtml.py: Generate HTML graphs for edgembar results
edgembar-amber2dats.py: Convert Amber data for edgembar
edgembar-bookend2dats.py: Convert bookend data for edgembar

Parameter fitting and optimisation ¶

paramfit: Parameter fitting for force field development (serial version) - Optimise force field parameters - Fit to quantum chemistry data
paramfit.OMP: OpenMP parallel version of paramfit
resp: Restrained Electrostatic Potential fitting - Generate atomic charges from quantum chemistry - ESP fitting
respgen: Generate RESP input files
parmcal: Parameter calculation utilities

Python analysis and utility tools ¶

MCPB.py: Metal Center Parameter Builder - Generate parameters for metal-containing systems - Fit metal-ligand interactions
CartHess2FC.py: Convert Cartesian Hessian to force constants
IPMach.py: Ion parameterisation machine learning
OptC4.py: Optimise C4 parameters
PdbSearcher.py: Search PDB structures
ProScrs.py: Protein scoring utilities
bar_pbsa.py: BAR method for PBSA calculations
py_resp.py: Python interface to RESP calculations
pype-resp.py: Enhanced Python RESP interface
pyresp_gen.py: Generate RESP input files
ceinutil.py, cpinutil.py, cpeinutil.py: Constant pH utilities - Constant pH MD setup - pH-dependent calculations
cestats, cphstats: Constant pH statistics
finddgref.py: Find reference free energy values
fitpkaeo.py: Fit pKa values
genremdinputs.py: Generate replica exchange MD input files
mdout_analyzer.py: Analyse MD output files
mdout2pymbar.pl: Convert MD output to PyMBAR format
metalpdb2mol2.py: Convert metal-containing PDB to MOL2 format
mol2rtf.py: Convert MOL2 to RTF format
charmmlipid2amber.py: Convert CHARMM lipid parameters to Amber format
amb2chm_par.py, amb2chm_psf_crd.py: Convert Amber to CHARMM formats
amb2gro_top_gro.py: Convert Amber to GROMACS formats
car_to_files.py: Convert Cartesian coordinate files

Specialised utilities ¶

AddToBox, ChBox: Manipulate simulation boxes
PropPDB: PDB property calculations
UnitCell: Unit cell manipulation
XrayPrep: Prepare structures for X-ray refinement
add_pdb, add_xray: Add structures from PDB or X-ray data
process_minout.perl: Process minimisation output
process_mdout.perl: Process MD output (already mentioned above)
teLeap: Terminal-based LEaP (alternative interface)
xaLeap: X11-based LEaP (alternative interface to xleap)
ucpp: Utility for processing Amber files
test-api, test-api.MPI: API testing tools

Note: This is not an exhaustive list. AmberTools includes many more specialised tools and utilities. For a complete list of available tools, see the AmberTools documentation or check the bin directory of your installation.

Our AmberTools builds use Open MPI as the MPI library.

Features:

Supports multi-node simulations

Uses Open MPI for inter-node communication

Compatible with SLURM multi-node job submission

Can handle larger systems across multiple nodes

Integrated with PLUMED2 for enhanced sampling methods

Uses LLVM.org OpenMP runtime for optimal threading performance

GUI support enabled (leap graphics, etc.)

Executable names: sander.MPI, sander, cpptraj, leap, parmed, antechamber, sqm, MMPBSA.py, and many others.

For more details see Multi-node run using MPI.

Important

Users are welcome to bring, or compile, and use their own builds of AmberTools but those builds will not be supported by Discoverer HPC team.

Build recipes, build logs, and build documentation for the AmberTools builds provided on Discoverer are available at the AmberTools build repository.

Running the tools ¶

Running data analysis or simulations means invoking AmberTools executables (such as sander.MPI, sander, cpptraj, etc.) for preparing systems, running simulations, or analysing trajectories.

Warning

You MUST NOT execute simulations or data analysis tools directly upon the login node (login.discoverer.bg). You have to run your simulations as SLURM jobs only.

Warning

Write your trajectories or data files and result of analysis only inside your Per-project scratch and storage folder and DO NOT use for that purpose (under any circumstances) your Home folder (/home/username)!

Common AmberTools executables:

Sander molecular dynamics engines:

sander: Serial version of sander for small systems or testing (single-threaded)

sander.MPI: Parallel MPI version of sander for multi-node molecular dynamics simulations across distributed memory systems

sander.OMP: OpenMP parallel version of sander for shared-memory parallelisation using threading (single-node multi-core)

sander.LES: Locally Enhanced Sampling (LES) version of sander. LES is an enhanced sampling method that allows selected atoms (e.g., side chains or ligands) to be represented by multiple copies, enabling more efficient conformational sampling. This version is serial (single-threaded)

sander.LES.MPI: LES version with MPI parallelisation, combining Locally Enhanced Sampling with multi-node distributed memory parallelisation

Other tools:

cpptraj: Trajectory analysis tool (serial version)

cpptraj.MPI: Parallel MPI version of cpptraj for multi-node trajectory analysis

cpptraj.OMP: OpenMP parallel version of cpptraj for shared-memory trajectory analysis

tleap: Text-based LEaP for structure preparation and topology generation

xleap: Graphical LEaP (X11-based) for structure preparation and topology generation

parmed: Parameter file editor

antechamber: Automatic atom type assignment for small molecules

sqm: Semi-empirical quantum mechanics program (serial version)

sqm.MPI: Parallel MPI version of sqm

mdgx: Molecular dynamics geometry and topology exchange tool (serial version)

mdgx.MPI: Parallel MPI version of mdgx

mdgx.OMP: OpenMP parallel version of mdgx

Python tools:

Serial Python tools: - MMPBSA.py: MM-PBSA and MM-GBSA binding free energy calculations (serial version, single-threaded)

MPI-parallel Python tools: - MMPBSA.py.MPI: Parallel MPI version of MMPBSA.py for multi-node binding free energy calculations

Note

Most Python tools in AmberTools (e.g., ante-MMPBSA.py, MCPB.py, etc.) are serial and run on a single CPU core. Only MMPBSA.py.MPI supports MPI parallelisation.

Multi-node run using MPI ¶

Note

The SLURM script displayed below applies only to those of the AmberTools tools that support MPI parallelisation:

Fortran/C++ MPI tools: sander.MPI, sander.LES.MPI, cpptraj.MPI, sqm.MPI, mdgx.MPI
Python MPI tools: MMPBSA.py.MPI (see MPI-parallel Python tools)

For detailed guidelines on optimal resource allocation (number of nodes, tasks per node, memory requirements, etc.) based on system size, see Resource allocation guidelines.

This script is used for multi-node MPI runs, but you can use it on a single node as well (by setting --nodes=1 and --ntasks-per-node=N where N is the number of MPI ranks):

#!/bin/bash
#
#SBATCH --partition=cn         # Partition (you may need to change this)
#SBATCH --job-name=sander_mpi   # Job name
#SBATCH --time=512:00:00       # WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes=2              # Number of nodes
#SBATCH --ntasks-per-node=64   # Number of MPI tasks to run upon each node
#SBATCH --ntasks-per-socket=32 # Number of tasks per NUMA-bound socket
#SBATCH --cpus-per-task=1      # Number of OpenMP threads per MPI rank (recommended: 1 for pure MPI)
#SBATCH --ntasks-per-core=1    # Each MPI rank is bound to a CPU core
#SBATCH --mem=251G             # Do not exceed this on Discoverer CPU cluster

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

# Load required modules
module purge || exit
module load ambertools/24/24.0 || exit

# Set OpenMP environment variables (if needed)
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=close          # Bind threads close to parent MPI process
export OMP_PLACES=cores              # Place threads on cores

# Optimise InfiniBand communication (if available)
export UCX_NET_DEVICES=mlx5_0:1

# Change to submission directory
cd ${SLURM_SUBMIT_DIR}

# Run MPI-parallel AmberTools tool
# Examples:
#   - For Fortran/C++ tools: sander.MPI, cpptraj.MPI, sqm.MPI, mdgx.MPI
#   - For Python tools: MMPBSA.py.MPI
#
# OpenMPI options:
# --map-by socket:PE=${OMP_NUM_THREADS} binds MPI processes to sockets
#   with PE (Processing Element) threads per MPI rank
# --bind-to core binds each MPI rank to a CPU core
# --report-bindings shows CPU binding (useful for debugging)

# Example for Fortran/C++ MPI tool (sander.MPI):
mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
       --bind-to core \
       --report-bindings \
       sander.MPI -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

# Example for Python MPI tool (MMPBSA.py.MPI):
# mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
#        --bind-to core \
#        --report-bindings \
#        MMPBSA.py.MPI -O -i mmpbsa.in -o mmpbsa.dat -sp complex.prmtop -cp complex.prmtop -lp ligand.prmtop -rp receptor.prmtop -y trajectory.nc

In the script above, edit the parameters and resources required for successfully running and completing the job. For detailed guidelines on optimal resource allocation, see Resource allocation guidelines.

SLURM partition of compute nodes (--partition): Specifies which group of nodes (partition) to use. For AmberTools on Discoverer, use cn partition which contains the CPU-optimised nodes.

Job name (--job-name): A descriptive name for your job that will appear in the queue. Use meaningful names like sander_protein_sim or sander_membrane_run.

Wall time (--time): Maximum time your job can run. Format is HH:MM:SS (e.g., 48:00:00 for 48 hours). Set this based on your simulation size and expected runtime.

Number of compute nodes (--nodes): How many physical nodes to allocate. For multi-node AmberTools simulations, this determines the total computational power available. See Resource allocation guidelines for recommendations based on system size.

Number of MPI processes per node (--ntasks-per-node): Critical for AmberTools performance. On Discoverer with 8 NUMA domains per node, use 64 MPI tasks to get 8 tasks per NUMA domain for optimal memory locality. See Resource allocation guidelines for recommended values based on system size.

Number of MPI tasks per NUMA domain (--ntasks-per-socket): Essential for NUMA-aware performance. Set to 32 to place exactly 32 MPI tasks per NUMA domain (64 total tasks ÷ 2 sockets per NUMA domain = 32 per domain). This ensures optimal memory access patterns and cache utilisation within each NUMA boundary.

Number of OpenMP threads per MPI process (--cpus-per-task): Controls hybrid parallelism. Recommended value is 1 (pure MPI mode) since OpenMP usage in sander.MPI is limited. For information on how this affects CPU thread affinity and pinning, see CPU thread affinity and pinning.

AmberTools version (module load): Choose the appropriate version based on your simulation requirements. See Versions available for available builds and their characteristics.

Save the complete SLURM job description as a file, for example /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/sander_mpi.sh, and submit it to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
sbatch sander_mpi.sh

Upon successful submission, the standard output will be directed by SLURM into the file /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out (where %j stands for the SLURM job ID), while the standard error output will be stored in /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err.

Single-threaded execution ¶

Note

The SLURM script displayed below applies only to those of the AmberTools tools that run on a single CPU core using one thread:

sander: Molecular dynamics simulations
cpptraj: Trajectory analysis
sqm: Semi-empirical quantum mechanics program
tleap: Protein preparation tool
parmed: Protein structure manipulation
antechamber: Automatic atom type assignment for small molecules
MMPBSA.py: MM-PBSA and MM-GBSA binding free energy calculations
mdgx: Molecular dynamics geometry and topology exchange tool

#!/bin/bash
#
#SBATCH --partition=cn         # Partition of compute nodes
#SBATCH --job-name=sander_single_threaded   # Job name
#SBATCH --time=01:00:00        # WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes=1              # Single node
#SBATCH --ntasks=1             # Single task
#SBATCH --cpus-per-task=1      # One CPU per task
#SBATCH --mem=32G              # Memory per task (increase if needed, typically 16-64G for small to medium systems)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

# Load required modules
module purge || exit
module load ambertools/24/24.0 || exit

# Change to submission directory
cd ${SLURM_SUBMIT_DIR}

# Run serial sander
sander -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

In the script above, edit the parameters and resources required for successfully running and completing the job.

For serial tools:

Use --ntasks=1 and --cpus-per-task=1 for single-threaded tools
Adjust --mem based on system size (typically 16-64G for small to medium systems)
These tools run on a single CPU core and are suitable for small systems or testing

Save the complete SLURM job description as a file, for example /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/sander_serial.sh, and submit it to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
sbatch sander_serial.sh

Upon successful submission, the standard output will be directed by SLURM into the file /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out (where %j stands for the SLURM job ID), while the standard error output will be stored in /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err.

Single-node run using OpenMP ¶

This script is used for OpenMP-parallel tools that use threading on a single node:

sander.OMP: OpenMP molecular dynamics simulations
cpptraj.OMP: OpenMP trajectory analysis
mdgx.OMP: OpenMP geometry/topology processing

Warning

OpenMP scaling is not guaranteed! The optimal number of OpenMP threads depends on many factors including algorithm efficiency, memory bandwidth, cache usage, and problem size. Always test different thread counts to find the optimal configuration for your specific system and workload. See OpenMP scaling considerations for more details.

#!/bin/bash
#
#SBATCH --partition=cn         # Partition of compute nodes
#SBATCH --job-name=sander_omp  # Job name
#SBATCH --time=01:00:00        # WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes=1              # Single node
#SBATCH --ntasks=1             # Single task
#SBATCH --cpus-per-task=64     # Number of OpenMP threads (START WITH FEWER AND TEST SCALING!)
#SBATCH --mem=251G             # Memory per task (do not exceed this on Discoverer CPU cluster)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

# Load required modules
module purge || exit
module load ambertools/24/24.0 || exit

# Set OpenMP environment variables
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=close          # Bind threads close to parent process
export OMP_PLACES=cores              # Place threads on cores

# Optional: Enable OpenMP verbose output for debugging
# export OMP_DISPLAY_ENV=VERBOSE

# Change to submission directory
cd ${SLURM_SUBMIT_DIR}

# Run OpenMP sander
sander.OMP -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

In the script above, edit the parameters and resources required for successfully running and completing the job.

For OpenMP tools:

Use --ntasks=1 and --cpus-per-task=N where N is the number of OpenMP threads (test scaling first!)
Set OMP_NUM_THREADS equal to --cpus-per-task
Start with fewer threads (8-16) and test scaling before using higher thread counts
Adjust --mem based on system size and number of threads
These tools use shared-memory threading and are suitable for single-node multi-core simulations
Always test scaling: Run with different thread counts to find optimal performance
Monitor wall-clock time and CPU utilisation to identify optimal thread count
See OpenMP scaling considerations for detailed guidelines on testing and optimising OpenMP performance

Save the complete SLURM job description as a file, for example /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/sander_omp.sh, and submit it to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
sbatch sander_omp.sh

Upon successful submission, the standard output will be directed by SLURM into the file /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out (where %j stands for the SLURM job ID), while the standard error output will be stored in /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err.

MPI-parallel Python tools ¶

For MPI-parallel Python tools such as MMPBSA.py.MPI, use the multi-node MPI script. This script is identical to the one shown in the Multi-node run using MPI section, but with the tool replaced by the MPI-parallel Python tool.

For detailed guidelines on optimal resource allocation (number of nodes, tasks per node, memory requirements, etc.) based on system size, see Resource allocation guidelines.

#!/bin/bash
#
#SBATCH --partition=cn         # Partition (you may need to change this)
#SBATCH --job-name=mmpbsa_mpi   # Job name
#SBATCH --time=512:00:00       # WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes=2              # Number of nodes
#SBATCH --ntasks-per-node=64   # Number of MPI tasks to run upon each node
#SBATCH --ntasks-per-socket=32 # Number of tasks per NUMA-bound socket
#SBATCH --cpus-per-task=1      # Number of OpenMP threads per MPI rank (recommended: 1 for pure MPI)
#SBATCH --ntasks-per-core=1    # Each MPI rank is bound to a CPU core
#SBATCH --mem=251G             # Do not exceed this on Discoverer CPU cluster

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

# Load required modules
module purge || exit
module load ambertools/24/24.0 || exit

# Set OpenMP environment variables (if needed)
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=close          # Bind threads close to parent MPI process
export OMP_PLACES=cores              # Place threads on cores

# Optimise InfiniBand communication (if available)
export UCX_NET_DEVICES=mlx5_0:1

# Change to submission directory
cd ${SLURM_SUBMIT_DIR}

# Run MMPBSA.py.MPI with OpenMPI
# --map-by socket:PE=${OMP_NUM_THREADS} binds MPI processes to sockets
#   with PE (Processing Element) threads per MPI rank
# --bind-to core binds each MPI rank to a CPU core
# --report-bindings shows CPU binding (useful for debugging)
mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
       --bind-to core \
       --report-bindings \
       MMPBSA.py.MPI -O -i mmpbsa.in -o mmpbsa.dat -sp complex.prmtop -cp complex.prmtop -lp ligand.prmtop -rp receptor.prmtop -y trajectory.nc

In the script above, edit the parameters and resources required for successfully running and completing the job. For detailed guidelines on optimal resource allocation, see Resource allocation guidelines.

SLURM partition of compute nodes (--partition): Specifies which group of nodes (partition) to use. For AmberTools on Discoverer, use cn partition which contains the CPU-optimised nodes.

Job name (--job-name): A descriptive name for your job that will appear in the queue. Use meaningful names like mmpbsa_protein_binding or mmpbsa_multitrajectory.

Wall time (--time): Maximum time your job can run. Format is HH:MM:SS (e.g., 48:00:00 for 48 hours). Set this based on your calculation size and expected runtime.

Number of compute nodes (--nodes): How many physical nodes to allocate. For multi-node MMPBSA.py.MPI runs, this determines the total computational power available. See Resource allocation guidelines for recommendations based on system size.

Number of MPI processes per node (--ntasks-per-node): Critical for MMPBSA.py.MPI performance. On Discoverer with 8 NUMA domains per node, use 64 MPI tasks to get 8 tasks per NUMA domain for optimal memory locality. See Resource allocation guidelines for recommended values based on system size.

Number of MPI tasks per NUMA domain (--ntasks-per-socket): Essential for NUMA-aware performance. Set to 32 to place exactly 32 MPI tasks per NUMA domain (64 total tasks ÷ 2 sockets per NUMA domain = 32 per domain). This ensures optimal memory access patterns and cache utilisation within each NUMA boundary.

Number of OpenMP threads per MPI process (--cpus-per-task): Controls hybrid parallelism. Recommended value is 1 (pure MPI mode) for MMPBSA.py.MPI since OpenMP usage in Python MPI tools is typically minimal. For information on how this affects CPU thread affinity and pinning, see CPU thread affinity and pinning.

AmberTools version (module load): Choose the appropriate version based on your calculation requirements. See Versions available for available builds and their characteristics.

The OpenMP environment variables (OMP_NUM_THREADS, OMP_PROC_BIND, OMP_PLACES) are set in the script but are typically not used since MMPBSA.py.MPI runs in pure MPI mode with --cpus-per-task=1. These variables are included for consistency with other MPI tools and in case any Python libraries use OpenMP internally.

Note

MMPBSA.py.MPI uses MPI for parallelisation across multiple compute nodes, similar to sander.MPI. For optimal performance with large systems or multiple trajectories, use MMPBSA.py.MPI instead of serial MMPBSA.py.

Save the complete SLURM job description as a file, for example /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/mmpbsa_mpi.sh, and submit it to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
sbatch mmpbsa_mpi.sh

Upon successful submission, the standard output will be directed by SLURM into the file /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out (where %j stands for the SLURM job ID), while the standard error output will be stored in /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err.

Python single-threaded tools ¶

Note

Our AmberTools builds include some Python-based tools that run on a single CPU core using one thread, such as MMPBSA.py and ante-MMPBSA.py. They are compatible with Python 3.12.

When loading the environment module ambertools, Python 3.12 is loaded automatically and it is provieded by anaconda3 module (loaded as a dependency of ambertools).

This script is used for Python-based tools that may run serially or with limited parallelism:

MMPBSA.py: Serial MM-PBSA/MM-GBSA binding free energy calculations
ante-MMPBSA.py: Pre-processing for MMPBSA
Other Python tools in AmberTools

#!/bin/bash
#
#SBATCH --partition=cn         # Partition of compute nodes
#SBATCH --job-name=mmpbsa_single_threaded   # Job name
#SBATCH --time=00:30:00        # WallTime - set it accordingly

#SBATCH --account=<specify_your_slurm_account_name_here>
#SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

#SBATCH --nodes=1              # One node
#SBATCH --ntasks=1             # One task per node
#SBATCH --cpus-per-task=1      # One CPU per task
#SBATCH --mem=2G               # Memory per task (increase if needed, typically 2-16G for small to medium systems)

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

# Load required modules
module purge || exit
module load ambertools/24/24.0 || exit

# Change to submission directory
cd ${SLURM_SUBMIT_DIR}

# Run MMPBSA.py
MMPBSA.py -O -i mmpbsa.in -o mmpbsa.dat -sp complex.prmtop -cp complex.prmtop -lp ligand.prmtop -rp receptor.prmtop -y trajectory.nc

In the script above, edit the parameters and resources required for successfully running and completing the job.

For Python tools:

Use --ntasks=1 and --cpus-per-task=N where N is the number of CPU cores (typically 4-16)
Some Python tools may use internal parallelisation
Adjust --mem based on system size and tool requirements

Save the complete SLURM job description as a file, for example /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/mmpbsa.sh, and submit it to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
sbatch sander_serial.sh

Upon successful submission, the standard output will be directed by SLURM into the file /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out (where %j stands for the SLURM job ID), while the standard error output will be stored in /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err.

Scalling and performance considerations ¶

Here we provide some guidelines and considerations for scaling and performance of the tools. Always consider the specific system and workload when determining the optimal number of threads and resources. If you are not sure, please contact the Discoverer HPC team (see Getting help).

OpenMP scaling considerations ¶

OpenMP parallelisation efficiency depends on several factors:

Algorithm parallelisation: Some algorithms parallelise better than others. Not all code sections may benefit from threading.
Memory bandwidth: As the number of threads increases, memory bandwidth may become a bottleneck, limiting scaling.
Cache coherence: False sharing and cache line conflicts can degrade performance with too many threads.
Problem size: Small problems may not benefit from many threads due to overhead. Larger problems typically scale better.
NUMA topology: Thread placement across NUMA domains affects performance. Use OMP_PROC_BIND and OMP_PLACES to control placement.

Testing OpenMP scaling ¶

To determine the optimal number of OpenMP threads for your workload:

Start with fewer threads: Begin testing with 8-16 threads, then gradually increase.
Test multiple configurations: Run the same workload with different thread counts (e.g., 8, 16, 32, 64) and compare wall-clock times.
Monitor performance: Check the output logs for: - Wall-clock time (total execution time) - CPU utilisation (are all threads being used?) - Memory bandwidth utilisation
Calculate speedup: Speedup = Time(serial) / Time(threads). Efficiency = Speedup / Number_of_threads. Aim for efficiency > 50%.
Watch for diminishing returns: If doubling threads doesn’t reduce runtime by at least 1.5×, you’ve likely hit diminishing returns.

Example scaling test script:

#!/bin/bash
# Test OpenMP scaling by running the same job with different thread counts

for threads in 8 16 32 64; do
    echo "Testing with ${threads} threads"

    #SBATCH --cpus-per-task=${threads}
    # ... rest of SLURM directives ...

    export OMP_NUM_THREADS=${threads}
    time sander.OMP -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

    echo "Completed ${threads} threads test"
done

Recommended starting points:

Small systems (<50k atoms): Start with 8-16 threads
Medium systems (50k-100k atoms): Start with 16-32 threads
Large systems (>100k atoms): Start with 32-64 threads

Note

Do not over-subscribe CPU cores! Set --cpus-per-task to no more than the number of physical CPU cores available on the compute node. Over-subscription (using more threads than cores) typically degrades performance due to context switching overhead.

CPU thread affinity and pinning ¶

Our build of AmberTools uses Open MPI’s thread affinity management:

Open MPI binding: Use --map-by socket:PE=${OMP_NUM_THREADS} to bind MPI processes to sockets with PE (Processing Element) threads per MPI rank
Core binding: Use --bind-to core to bind each MPI rank to a CPU core
Thread affinity: OpenMP environment variables (OMP_PROC_BIND, OMP_PLACES) control OpenMP thread affinity within each MPI rank

Note

For OpenMP-only tools (e.g., sander.OMP, cpptraj.OMP, mdgx.OMP), thread affinity settings are particularly important for performance. See OpenMP scaling considerations for detailed guidelines on testing and optimising OpenMP thread affinity and scaling.

Recommended OpenMPI settings:

mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
       --bind-to core \
       --report-bindings \
       sander.MPI -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

Open MPI binding options:

--map-by socket:PE=${OMP_NUM_THREADS}: Maps MPI processes to sockets with PE threads per rank

--bind-to core: Binds each MPI rank to a CPU core

--report-bindings: Shows CPU binding (useful for debugging)

OpenMP thread affinity settings (when combined with Open MPI settings):

OMP_PROC_BIND=close: Binds threads close to parent MPI process

OMP_PLACES=cores: Places threads on cores

Resource allocation guidelines ¶

These resource allocation guidelines are specifically for MPI-parallel AmberTools executables that run across multiple compute nodes:

sander.MPI: Multi-node molecular dynamics simulations
sander.LES.MPI: Multi-node LES simulations
cpptraj.MPI: Multi-node trajectory analysis
MMPBSA.py.MPI: Multi-node binding free energy calculations
sqm.MPI: Multi-node QM calculations
mdgx.MPI: Multi-node geometry/topology processing

For achieving optimal performance when running these MPI-parallel AmberTools on Discoverer CPU cluster, you should follow the following guidelines. For details on CPU thread affinity and process pinning, see CPU thread affinity and pinning.

Note

Serial and OpenMP tools (sander, sander.OMP, cpptraj, cpptraj.OMP, etc.) typically run on single nodes or workstations and do not require the multi-node resource allocation strategies described here. For single-node OpenMP tools, use --cpus-per-task equal to the number of OpenMP threads you want (typically the number of cores available on a single node).

Recommended SLURM resource allocation¶
Scenario	Nodes	Tasks/Node	Tasks/Socket	CPUs/Task	Total Cores	Use Case
Small system	1	32	16	1	32	<50k atoms
Medium system	2	64	32	1	128	50k-100k atoms
Large system	4	64	32	1	256	100k-200k atoms
Very large system	8+	64	32	1	512+	>200k atoms

Guidelines:

Number of nodes: Start with 1-2 nodes for small systems, scale up for larger systems

Tasks per node: Use 32-64 MPI tasks per node depending on system size

Tasks per socket: Set to distribute tasks evenly across NUMA domains (32 tasks per socket for 64 tasks/node)

CPUs per task: Always use 1 (pure MPI mode) since OpenMP usage is typically minimal

Memory: Do not exceed 251G per node on Discoverer CPU cluster

Total resource allocation calculations:

Total MPI ranks = nodes × tasks-per-node

Total CPU cores = nodes × tasks-per-node × cpus-per-task

Example: 2 nodes × 64 tasks/node × 1 cpu/task = 128 cores

Build information ¶

Build recipes, build logs, and build documentation for the AmberTools builds provided on Discoverer are available at the AmberTools build repository.

Getting help ¶

See Getting help