AmberTools (CPU)
================

.. toctree::
   :maxdepth: 1
   :caption: Contents:

.. contents:: Table of Contents
   :depth: 3

About
-----

According to the `AmberTools website <https://ambermd.org/AmberTools.php>`_, AmberTools is a comprehensive suite of biomolecular simulation tools that works alongside the AMBER molecular dynamics package. It provides a collection of programs for setting up, running, and analysing molecular dynamics simulations, with a focus on biomolecular systems such as proteins, nucleic acids, and small molecules.

AmberTools is freely available and open-source, providing extensive functionality for preparing simulations, analysing trajectories, and performing computational chemistry calculations. Unlike AMBER's ``pmemd.MPI``, AmberTools has no licensing restrictions and can be used by both academic and commercial users on Discoverer CPU cluster.

This document describes running AmberTools on Discoverer CPU cluster.

Documentation about how to use AmberTools is available here: https://ambermd.org/Manuals.php

.. note:: If you are looking for running ``pmemd.MPI`` on Discoverer CPU cluster, see :doc:`amber`.

Versions available
------------------

Currently we support the following versions of AmberTools:

- 24
- 25

To check which AmberTools versions are currently supported on Discoverer, execute on the login node:

 .. code-block:: bash

    module avail ambertools

AmberTools programs
-------------------

AmberTools24 includes a wide variety of tools beyond the sander variants described above. Some of the most important ones are:

Structure preparation
.....................

- ``tleap`` (tLEaP): Text-based LEaP for molecular structure preparation, topology generation, and system setup
  - Build and modify molecular structures
  - Assign force field parameters
  - Solvate systems (water, ions)
  - Create topology and coordinate files
  - Command-line interface

- ``xleap`` (xLEaP): Graphical LEaP (X11-based) for molecular structure preparation
  - Same functionality as tleap with graphical user interface
  - Requires X11 display for GUI
  - Useful for interactive structure building

- ``parmed`` (ParmEd): Parameter file editor and molecular structure manipulation
  - Edit topology files
  - Add/remove atoms, bonds, angles, dihedrals
  - Modify force field parameters
  - Combine multiple structures

Simulation analysis
...................

- ``cpptraj``: Powerful trajectory analysis tool (formerly ptraj, serial version)
  - Analyse trajectories from multiple MD engines (AMBER, GROMACS, CHARMM, NAMD)
  - Calculate geometric properties (distances, angles, RMSD)
  - Hydrogen bond analysis
  - Secondary structure analysis
  - Clustering and principal component analysis
  - Extensive scripting capabilities

- ``cpptraj.MPI``: Parallel MPI version of cpptraj for multi-node trajectory analysis
  - Distribute analysis across multiple compute nodes
  - Suitable for large trajectories or computationally intensive analyses

- ``cpptraj.OMP``: OpenMP parallel version of cpptraj for shared-memory trajectory analysis
  - Uses threading for parallelisation within a single node
  - Suitable for multi-core workstations

- ``process_mdout.perl``: Extract and analyse energy data from AMBER MD output files

Binding free energy calculations
.................................

- ``MMPBSA.py``: MM-PBSA and MM-GBSA binding free energy calculations (serial version)
  - Calculate binding free energies using implicit solvent models
  - Decompose binding energies by residue
  - Perform per-residue and per-atom energy decompositions
  - Support for multiple MD engines

- ``MMPBSA.py.MPI``: Parallel MPI version of MMPBSA.py
  - Distribute MM-PBSA/MM-GBSA calculations across multiple compute nodes
  - Suitable for large systems or multiple trajectory analysis

Quantum mechanics / molecular mechanics
.......................................

- ``sqm``: Semi-empirical quantum mechanics program (AM1, PM3, AM1-D, PM3-D methods, serial version)
  - Geometry optimisations
  - Energy and force calculations

- ``sqm.MPI``: Parallel MPI version of sqm
  - Distribute QM calculations across multiple compute nodes
  - Suitable for larger QM regions or multiple QM calculations

- ``mdgx``: Molecular dynamics geometry and topology exchange tool (serial version)
  - Generate geometry and topology files
  - Convert between different formats

- ``mdgx.MPI``: Parallel MPI version of mdgx
  - Distribute processing across multiple compute nodes

- ``mdgx.OMP``: OpenMP parallel version of mdgx
  - Uses threading for parallelisation within a single node

Utility programs
.................

- ``antechamber``: Automatic atom type assignment and parameter generation for small molecules (serial version)
  - Generate GAFF parameters for organic molecules
  - Create parameter files for new compounds
  - Interface with quantum chemistry programs

- ``parmchk2``: Check and generate Amber parameter files for molecules processed by antechamber
  - Validates GAFF parameters
  - Generates missing parameters

- ``reduce``: Add missing hydrogens to PDB structures
  - Places hydrogens at optimal positions
  - Handles protonation states

- ``pdb4amber``: Prepare PDB files for Amber simulations
  - Removes non-standard residues
  - Fixes common PDB format issues
  - Prepares structures for leap

- ``packmol``: Pack molecules into defined regions (solvation, membrane insertion)
  - Solvate systems with water
  - Insert molecules into membranes
  - Generate mixed-solvent systems

- ``packmol-memgen``: Generate membrane configurations using packmol

- ``ambpdb``: Convert Amber topology/coordinate files to PDB format
  - Extract coordinates from trajectory files
  - Convert topology files to PDB

- ``ambmask``: Manipulate Amber mask expressions
  - Test and validate mask syntax
  - Useful for advanced Amber scripting

- ``quick``: Semi-empirical quantum mechanics calculations (serial version)
  - QM/MM calculations
  - Geometry optimisations

- ``quick.MPI``: Parallel MPI version of quick for QM/MM calculations

- ``gem.pmemd``: Generalized Ensemble Methods (GEM) for enhanced sampling (serial version)
  - Temperature replica exchange
  - Hamiltonian replica exchange

- ``gem.pmemd.MPI``: Parallel MPI version of gem.pmemd for multi-node GEM simulations

Additional analysis tools
..........................

- ``pbsa``: Poisson-Boltzmann surface area calculations
  - Calculate solvation free energies
  - Electrostatic calculations

- ``gbnsr6``: Generalized Born (GB) calculations using GB-Neck2 model
  - Implicit solvent calculations
  - Solvation free energy calculations

- ``simplepbsa``: Simplified PB calculations (serial version)
  - Fast PB approximations
  - Binding energy calculations

- ``simplepbsa.MPI``: Parallel MPI version of simplepbsa

- ``rism1d``: One-dimensional reference interaction site model
  - Solvation structure analysis
  - Thermodynamic properties

- ``rism3d.snglpnt``: Three-dimensional RISM (serial version)
  - 3D solvation structure
  - Site-site correlation functions

- ``rism3d.snglpnt.MPI``: Parallel MPI version of rism3d.snglpnt

- ``saxs_md``: Small-angle X-ray scattering analysis from MD trajectories (serial version)
  - Calculate SAXS profiles
  - Compare with experimental data

- ``saxs_md.OMP``: OpenMP parallel version of saxs_md

- ``saxs_rism``: SAXS from RISM calculations (serial version)
  - Combine RISM and SAXS analysis

- ``saxs_rism.OMP``: OpenMP parallel version of saxs_rism

- ``nmode``: Normal mode analysis
  - Vibrational frequencies
  - Entropy calculations

- ``mmpbsa_py_energy``: Extract energy components from MMPBSA calculations

- ``mmpbsa_py_nabnmode``: NAB-based normal mode calculations for MMPBSA

Enhanced sampling and free energy methods
.........................................

- ``ndfes``: Neural network-based free energy surfaces (serial version)
  - Enhanced sampling analysis
  - Free energy calculations

- ``ndfes.OMP``: OpenMP parallel version of ndfes

- ``ndfes-path``: Path-based analysis for ndfes calculations

- ``ndfes-path.OMP``: OpenMP parallel version of ndfes-path

- ``ndfes-AvgFESs.py``: Average free energy surfaces from multiple simulations

- ``ndfes-CheckEquil.py``: Check equilibrium in enhanced sampling simulations

- ``ndfes-CombineMetafiles.py``: Combine metadynamics files

- ``ndfes-PrepareAmberData.py``: Prepare Amber data for ndfes analysis

- ``ndfes-PrintFES.py``: Print free energy surfaces

- ``ndfes-path-analyzesims.py``: Analyse path simulations

- ``ndfes-path-prepguess.py``: Prepare initial guesses for path calculations

- ``edgembar``: Energy decomposition group method BAR (serial version)
  - Free energy decomposition
  - Binding energy analysis

- ``edgembar.OMP``: OpenMP parallel version of edgembar

- ``edgembar-WriteGraphHtml.py``: Generate HTML graphs for edgembar results

- ``edgembar-amber2dats.py``: Convert Amber data for edgembar

- ``edgembar-bookend2dats.py``: Convert bookend data for edgembar

Parameter fitting and optimisation
...................................

- ``paramfit``: Parameter fitting for force field development (serial version)
  - Optimise force field parameters
  - Fit to quantum chemistry data

- ``paramfit.OMP``: OpenMP parallel version of paramfit

- ``resp``: Restrained Electrostatic Potential fitting
  - Generate atomic charges from quantum chemistry
  - ESP fitting

- ``respgen``: Generate RESP input files

- ``parmcal``: Parameter calculation utilities

Python analysis and utility tools
...................................

- ``MCPB.py``: Metal Center Parameter Builder
  - Generate parameters for metal-containing systems
  - Fit metal-ligand interactions

- ``CartHess2FC.py``: Convert Cartesian Hessian to force constants

- ``IPMach.py``: Ion parameterisation machine learning

- ``OptC4.py``: Optimise C4 parameters

- ``PdbSearcher.py``: Search PDB structures

- ``ProScrs.py``: Protein scoring utilities

- ``bar_pbsa.py``: BAR method for PBSA calculations

- ``py_resp.py``: Python interface to RESP calculations

- ``pype-resp.py``: Enhanced Python RESP interface

- ``pyresp_gen.py``: Generate RESP input files

- ``ceinutil.py``, ``cpinutil.py``, ``cpeinutil.py``: Constant pH utilities
  - Constant pH MD setup
  - pH-dependent calculations

- ``cestats``, ``cphstats``: Constant pH statistics

- ``finddgref.py``: Find reference free energy values

- ``fitpkaeo.py``: Fit pKa values

- ``genremdinputs.py``: Generate replica exchange MD input files

- ``mdout_analyzer.py``: Analyse MD output files

- ``mdout2pymbar.pl``: Convert MD output to PyMBAR format

- ``metalpdb2mol2.py``: Convert metal-containing PDB to MOL2 format

- ``mol2rtf.py``: Convert MOL2 to RTF format

- ``charmmlipid2amber.py``: Convert CHARMM lipid parameters to Amber format

- ``amb2chm_par.py``, ``amb2chm_psf_crd.py``: Convert Amber to CHARMM formats

- ``amb2gro_top_gro.py``: Convert Amber to GROMACS formats

- ``car_to_files.py``: Convert Cartesian coordinate files

Specialised utilities
.....................

- ``AddToBox``, ``ChBox``: Manipulate simulation boxes

- ``PropPDB``: PDB property calculations

- ``UnitCell``: Unit cell manipulation

- ``XrayPrep``: Prepare structures for X-ray refinement

- ``add_pdb``, ``add_xray``: Add structures from PDB or X-ray data

- ``process_minout.perl``: Process minimisation output

- ``process_mdout.perl``: Process MD output (already mentioned above)

- ``teLeap``: Terminal-based LEaP (alternative interface)

- ``xaLeap``: X11-based LEaP (alternative interface to xleap)

- ``ucpp``: Utility for processing Amber files

- ``test-api``, ``test-api.MPI``: API testing tools

Note: This is not an exhaustive list. AmberTools includes many more specialised tools and utilities. For a complete list of available tools, see the AmberTools documentation or check the ``bin`` directory of your installation.

Our AmberTools builds use Open MPI as the MPI library.

Features:

   - Supports multi-node simulations
   - Uses Open MPI for inter-node communication
   - Compatible with SLURM multi-node job submission
   - Can handle larger systems across multiple nodes
   - Integrated with PLUMED2 for enhanced sampling methods
   - Uses LLVM.org OpenMP runtime for optimal threading performance
   - GUI support enabled (leap graphics, etc.)

Executable names: ``sander.MPI``, ``sander``, ``cpptraj``, ``leap``, ``parmed``, ``antechamber``, ``sqm``, ``MMPBSA.py``, and many others.

For more details see `Multi-node run using MPI`_.

.. important:: Users are welcome to bring, or compile, and use their own builds of AmberTools but **those builds will not be supported by Discoverer HPC team.**

Build recipes, build logs, and build documentation for the AmberTools builds provided on Discoverer are available at the `AmberTools build repository <https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/AmberTools>`_.

Running the tools
-----------------

Running data analysis or simulations means invoking AmberTools executables (such as ``sander.MPI``, ``sander``, ``cpptraj``, etc.) for preparing systems, running simulations, or analysing trajectories.

.. warning:: **You MUST NOT execute simulations or data analysis tools directly upon the login node (login.discoverer.bg).** You have to run your simulations as SLURM jobs only.

.. warning:: Write your trajectories or data files and result of analysis only inside your :doc:`projectfolder` and DO NOT use for that purpose (under any circumstances) your :doc:`homefolder`!

Common AmberTools executables:

  Sander molecular dynamics engines:

    - ``sander``: Serial version of sander for small systems or testing (single-threaded)
    - ``sander.MPI``: Parallel MPI version of sander for multi-node molecular dynamics simulations across distributed memory systems
    - ``sander.OMP``: OpenMP parallel version of sander for shared-memory parallelisation using threading (single-node multi-core)
    - ``sander.LES``: Locally Enhanced Sampling (LES) version of sander. LES is an enhanced sampling method that allows selected atoms (e.g., side chains or ligands) to be represented by multiple copies, enabling more efficient conformational sampling. This version is serial (single-threaded)
    - ``sander.LES.MPI``: LES version with MPI parallelisation, combining Locally Enhanced Sampling with multi-node distributed memory parallelisation

  Other tools:

    - ``cpptraj``: Trajectory analysis tool (serial version)
    - ``cpptraj.MPI``: Parallel MPI version of cpptraj for multi-node trajectory analysis
    - ``cpptraj.OMP``: OpenMP parallel version of cpptraj for shared-memory trajectory analysis
    - ``tleap``: Text-based LEaP for structure preparation and topology generation
    - ``xleap``: Graphical LEaP (X11-based) for structure preparation and topology generation
    - ``parmed``: Parameter file editor
    - ``antechamber``: Automatic atom type assignment for small molecules
    - ``sqm``: Semi-empirical quantum mechanics program (serial version)
    - ``sqm.MPI``: Parallel MPI version of sqm
    - ``mdgx``: Molecular dynamics geometry and topology exchange tool (serial version)
    - ``mdgx.MPI``: Parallel MPI version of mdgx
    - ``mdgx.OMP``: OpenMP parallel version of mdgx

  Python tools:

    Serial Python tools:
    - ``MMPBSA.py``: MM-PBSA and MM-GBSA binding free energy calculations (serial version, single-threaded)

    MPI-parallel Python tools:
    - ``MMPBSA.py.MPI``: Parallel MPI version of MMPBSA.py for multi-node binding free energy calculations

    .. note:: Most Python tools in AmberTools (e.g., ``ante-MMPBSA.py``, ``MCPB.py``, etc.) are serial and run on a single CPU core. Only ``MMPBSA.py.MPI`` supports MPI parallelisation.

Multi-node run using MPI
........................

.. note:: The SLURM script displayed below applies only to those of the AmberTools tools that support MPI parallelisation:

- **Fortran/C++ MPI tools**: ``sander.MPI``, ``sander.LES.MPI``, ``cpptraj.MPI``, ``sqm.MPI``, ``mdgx.MPI``
- **Python MPI tools**: ``MMPBSA.py.MPI`` (see `MPI-parallel Python tools`_)

For detailed guidelines on optimal resource allocation (number of nodes, tasks per node, memory requirements, etc.) based on system size, see `Resource allocation guidelines`_.

This script is used for multi-node MPI runs, but you can use it on a single node as well (by setting ``--nodes=1`` and ``--ntasks-per-node=N`` where N is the number of MPI ranks):

 .. code:: bash

   #!/bin/bash
   #
   #SBATCH --partition=cn         # Partition (you may need to change this)
   #SBATCH --job-name=sander_mpi   # Job name
   #SBATCH --time=512:00:00       # WallTime - set it accordingly

   #SBATCH --account=<specify_your_slurm_account_name_here>
   #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

   #SBATCH --nodes=2              # Number of nodes
   #SBATCH --ntasks-per-node=64   # Number of MPI tasks to run upon each node
   #SBATCH --ntasks-per-socket=32 # Number of tasks per NUMA-bound socket
   #SBATCH --cpus-per-task=1      # Number of OpenMP threads per MPI rank (recommended: 1 for pure MPI)
   #SBATCH --ntasks-per-core=1    # Each MPI rank is bound to a CPU core
   #SBATCH --mem=251G             # Do not exceed this on Discoverer CPU cluster

   #SBATCH -o slurm.%j.out        # STDOUT
   #SBATCH -e slurm.%j.err        # STDERR

   # Load required modules
   module purge || exit
   module load ambertools/24/24.0 || exit

   # Set OpenMP environment variables (if needed)
   export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
   export OMP_PROC_BIND=close          # Bind threads close to parent MPI process
   export OMP_PLACES=cores              # Place threads on cores

   # Optimise InfiniBand communication (if available)
   export UCX_NET_DEVICES=mlx5_0:1

   # Change to submission directory
   cd ${SLURM_SUBMIT_DIR}

   # Run MPI-parallel AmberTools tool
   # Examples:
   #   - For Fortran/C++ tools: sander.MPI, cpptraj.MPI, sqm.MPI, mdgx.MPI
   #   - For Python tools: MMPBSA.py.MPI
   #
   # OpenMPI options:
   # --map-by socket:PE=${OMP_NUM_THREADS} binds MPI processes to sockets
   #   with PE (Processing Element) threads per MPI rank
   # --bind-to core binds each MPI rank to a CPU core
   # --report-bindings shows CPU binding (useful for debugging)
   
   # Example for Fortran/C++ MPI tool (sander.MPI):
   mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
          --bind-to core \
          --report-bindings \
          sander.MPI -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0
   
   # Example for Python MPI tool (MMPBSA.py.MPI):
   # mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
   #        --bind-to core \
   #        --report-bindings \
   #        MMPBSA.py.MPI -O -i mmpbsa.in -o mmpbsa.dat -sp complex.prmtop -cp complex.prmtop -lp ligand.prmtop -rp receptor.prmtop -y trajectory.nc

In the script above, edit the parameters and resources required for successfully running and completing the job. For detailed guidelines on optimal resource allocation, see `Resource allocation guidelines`_.

 - SLURM partition of compute nodes (``--partition``): Specifies which group of nodes (partition) to use. For AmberTools on Discoverer, use ``cn`` partition which contains the CPU-optimised nodes.

 - Job name (``--job-name``): A descriptive name for your job that will appear in the queue. Use meaningful names like ``sander_protein_sim`` or ``sander_membrane_run``.

 - Wall time (``--time``): Maximum time your job can run. Format is ``HH:MM:SS`` (e.g., ``48:00:00`` for 48 hours). Set this based on your simulation size and expected runtime.

 - Number of compute nodes (``--nodes``): How many physical nodes to allocate. For multi-node AmberTools simulations, this determines the total computational power available. See `Resource allocation guidelines`_ for recommendations based on system size.

 - Number of MPI processes per node (``--ntasks-per-node``): Critical for AmberTools performance. On Discoverer with 8 NUMA domains per node, use 64 MPI tasks to get 8 tasks per NUMA domain for optimal memory locality. See `Resource allocation guidelines`_ for recommended values based on system size.

 - Number of MPI tasks per NUMA domain (``--ntasks-per-socket``): Essential for NUMA-aware performance. Set to 32 to place exactly 32 MPI tasks per NUMA domain (64 total tasks ÷ 2 sockets per NUMA domain = 32 per domain). This ensures optimal memory access patterns and cache utilisation within each NUMA boundary.

 - Number of OpenMP threads per MPI process (``--cpus-per-task``): Controls hybrid parallelism. **Recommended value is 1** (pure MPI mode) since OpenMP usage in sander.MPI is limited. For information on how this affects CPU thread affinity and pinning, see `CPU thread affinity and pinning`_.

 - AmberTools version (``module load``): Choose the appropriate version based on your simulation requirements. See `Versions available`_ for available builds and their characteristics.

Save the complete SLURM job description as a file, for example ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/sander_mpi.sh``, and submit it to the queue:

 .. code-block:: bash

   cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
   sbatch sander_mpi.sh

Upon successful submission, the standard output will be directed by SLURM into the file ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out`` (where ``%j`` stands for the SLURM job ID), while the standard error output will be stored in ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err``.

Single-threaded execution
.........................

.. note:: The SLURM script displayed below applies only to those of the AmberTools tools that run on a single CPU core using one thread:
  
  - ``sander``: Molecular dynamics simulations
  - ``cpptraj``: Trajectory analysis
  - ``sqm``: Semi-empirical quantum mechanics program
  - ``tleap``: Protein preparation tool
  - ``parmed``: Protein structure manipulation
  - ``antechamber``: Automatic atom type assignment for small molecules
  - ``MMPBSA.py``: MM-PBSA and MM-GBSA binding free energy calculations
  - ``mdgx``: Molecular dynamics geometry and topology exchange tool

.. code:: bash

   #!/bin/bash
   #
   #SBATCH --partition=cn         # Partition of compute nodes
   #SBATCH --job-name=sander_single_threaded   # Job name
   #SBATCH --time=01:00:00        # WallTime - set it accordingly

   #SBATCH --account=<specify_your_slurm_account_name_here>
   #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

   #SBATCH --nodes=1              # Single node
   #SBATCH --ntasks=1             # Single task
   #SBATCH --cpus-per-task=1      # One CPU per task
   #SBATCH --mem=32G              # Memory per task (increase if needed, typically 16-64G for small to medium systems)

   #SBATCH -o slurm.%j.out        # STDOUT
   #SBATCH -e slurm.%j.err        # STDERR

   # Load required modules
   module purge || exit
   module load ambertools/24/24.0 || exit

   # Change to submission directory
   cd ${SLURM_SUBMIT_DIR}

   # Run serial sander
   sander -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

In the script above, edit the parameters and resources required for successfully running and completing the job.

For serial tools:
 - Use ``--ntasks=1`` and ``--cpus-per-task=1`` for single-threaded tools
 - Adjust ``--mem`` based on system size (typically 16-64G for small to medium systems)
 - These tools run on a single CPU core and are suitable for small systems or testing

Save the complete SLURM job description as a file, for example ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/sander_serial.sh``, and submit it to the queue:

 .. code-block:: bash

   cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
   sbatch sander_serial.sh

Upon successful submission, the standard output will be directed by SLURM into the file ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out`` (where ``%j`` stands for the SLURM job ID), while the standard error output will be stored in ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err``.

Single-node run using OpenMP
............................

This script is used for **OpenMP-parallel tools** that use threading on a single node:

- ``sander.OMP``: OpenMP molecular dynamics simulations
- ``cpptraj.OMP``: OpenMP trajectory analysis
- ``mdgx.OMP``: OpenMP geometry/topology processing

.. warning:: **OpenMP scaling is not guaranteed!** The optimal number of OpenMP threads depends on many factors including algorithm efficiency, memory bandwidth, cache usage, and problem size. Always test different thread counts to find the optimal configuration for your specific system and workload. See `OpenMP scaling considerations`_ for more details.

.. code:: bash

   #!/bin/bash
   #
   #SBATCH --partition=cn         # Partition of compute nodes
   #SBATCH --job-name=sander_omp  # Job name
   #SBATCH --time=01:00:00        # WallTime - set it accordingly

   #SBATCH --account=<specify_your_slurm_account_name_here>
   #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

   #SBATCH --nodes=1              # Single node
   #SBATCH --ntasks=1             # Single task
   #SBATCH --cpus-per-task=64     # Number of OpenMP threads (START WITH FEWER AND TEST SCALING!)
   #SBATCH --mem=251G             # Memory per task (do not exceed this on Discoverer CPU cluster)

   #SBATCH -o slurm.%j.out        # STDOUT
   #SBATCH -e slurm.%j.err        # STDERR

   # Load required modules
   module purge || exit
   module load ambertools/24/24.0 || exit

   # Set OpenMP environment variables
   export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
   export OMP_PROC_BIND=close          # Bind threads close to parent process
   export OMP_PLACES=cores              # Place threads on cores

   # Optional: Enable OpenMP verbose output for debugging
   # export OMP_DISPLAY_ENV=VERBOSE

   # Change to submission directory
   cd ${SLURM_SUBMIT_DIR}

   # Run OpenMP sander
   sander.OMP -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

In the script above, edit the parameters and resources required for successfully running and completing the job.

For OpenMP tools:
 - Use ``--ntasks=1`` and ``--cpus-per-task=N`` where N is the number of OpenMP threads (**test scaling first!**)
 - Set ``OMP_NUM_THREADS`` equal to ``--cpus-per-task``
 - Start with fewer threads (8-16) and test scaling before using higher thread counts
 - Adjust ``--mem`` based on system size and number of threads
 - These tools use shared-memory threading and are suitable for single-node multi-core simulations
 - Always test scaling: Run with different thread counts to find optimal performance
 - Monitor wall-clock time and CPU utilisation to identify optimal thread count
 - See `OpenMP scaling considerations`_ for detailed guidelines on testing and optimising OpenMP performance

Save the complete SLURM job description as a file, for example ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/sander_omp.sh``, and submit it to the queue:

 .. code-block:: bash

   cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
   sbatch sander_omp.sh

Upon successful submission, the standard output will be directed by SLURM into the file ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out`` (where ``%j`` stands for the SLURM job ID), while the standard error output will be stored in ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err``.

MPI-parallel Python tools
.........................

For **MPI-parallel Python tools** such as ``MMPBSA.py.MPI``, use the **multi-node MPI script**. This script is identical to the one shown in the `Multi-node run using MPI`_ section, but with the tool replaced by the MPI-parallel Python tool.

For detailed guidelines on optimal resource allocation (number of nodes, tasks per node, memory requirements, etc.) based on system size, see `Resource allocation guidelines`_.

 .. code:: bash

   #!/bin/bash
   #
   #SBATCH --partition=cn         # Partition (you may need to change this)
   #SBATCH --job-name=mmpbsa_mpi   # Job name
   #SBATCH --time=512:00:00       # WallTime - set it accordingly

   #SBATCH --account=<specify_your_slurm_account_name_here>
   #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

   #SBATCH --nodes=2              # Number of nodes
   #SBATCH --ntasks-per-node=64   # Number of MPI tasks to run upon each node
   #SBATCH --ntasks-per-socket=32 # Number of tasks per NUMA-bound socket
   #SBATCH --cpus-per-task=1      # Number of OpenMP threads per MPI rank (recommended: 1 for pure MPI)
   #SBATCH --ntasks-per-core=1    # Each MPI rank is bound to a CPU core
   #SBATCH --mem=251G             # Do not exceed this on Discoverer CPU cluster

   #SBATCH -o slurm.%j.out        # STDOUT
   #SBATCH -e slurm.%j.err        # STDERR

   # Load required modules
   module purge || exit
   module load ambertools/24/24.0 || exit

   # Set OpenMP environment variables (if needed)
   export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
   export OMP_PROC_BIND=close          # Bind threads close to parent MPI process
   export OMP_PLACES=cores              # Place threads on cores

   # Optimise InfiniBand communication (if available)
   export UCX_NET_DEVICES=mlx5_0:1

   # Change to submission directory
   cd ${SLURM_SUBMIT_DIR}

   # Run MMPBSA.py.MPI with OpenMPI
   # --map-by socket:PE=${OMP_NUM_THREADS} binds MPI processes to sockets
   #   with PE (Processing Element) threads per MPI rank
   # --bind-to core binds each MPI rank to a CPU core
   # --report-bindings shows CPU binding (useful for debugging)
   mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
          --bind-to core \
          --report-bindings \
          MMPBSA.py.MPI -O -i mmpbsa.in -o mmpbsa.dat -sp complex.prmtop -cp complex.prmtop -lp ligand.prmtop -rp receptor.prmtop -y trajectory.nc

In the script above, edit the parameters and resources required for successfully running and completing the job. For detailed guidelines on optimal resource allocation, see `Resource allocation guidelines`_.

 - SLURM partition of compute nodes (``--partition``): Specifies which group of nodes (partition) to use. For AmberTools on Discoverer, use ``cn`` partition which contains the CPU-optimised nodes.

 - Job name (``--job-name``): A descriptive name for your job that will appear in the queue. Use meaningful names like ``mmpbsa_protein_binding`` or ``mmpbsa_multitrajectory``.

 - Wall time (``--time``): Maximum time your job can run. Format is ``HH:MM:SS`` (e.g., ``48:00:00`` for 48 hours). Set this based on your calculation size and expected runtime.

 - Number of compute nodes (``--nodes``): How many physical nodes to allocate. For multi-node ``MMPBSA.py.MPI`` runs, this determines the total computational power available. See `Resource allocation guidelines`_ for recommendations based on system size.

 - Number of MPI processes per node (``--ntasks-per-node``): Critical for ``MMPBSA.py.MPI`` performance. On Discoverer with 8 NUMA domains per node, use 64 MPI tasks to get 8 tasks per NUMA domain for optimal memory locality. See `Resource allocation guidelines`_ for recommended values based on system size.

 - Number of MPI tasks per NUMA domain (``--ntasks-per-socket``): Essential for NUMA-aware performance. Set to 32 to place exactly 32 MPI tasks per NUMA domain (64 total tasks ÷ 2 sockets per NUMA domain = 32 per domain). This ensures optimal memory access patterns and cache utilisation within each NUMA boundary.

 - Number of OpenMP threads per MPI process (``--cpus-per-task``): Controls hybrid parallelism. **Recommended value is 1** (pure MPI mode) for ``MMPBSA.py.MPI`` since OpenMP usage in Python MPI tools is typically minimal. For information on how this affects CPU thread affinity and pinning, see `CPU thread affinity and pinning`_.

 - AmberTools version (``module load``): Choose the appropriate version based on your calculation requirements. See `Versions available`_ for available builds and their characteristics.

The OpenMP environment variables (``OMP_NUM_THREADS``, ``OMP_PROC_BIND``, ``OMP_PLACES``) are set in the script but are typically not used since ``MMPBSA.py.MPI`` runs in pure MPI mode with ``--cpus-per-task=1``. These variables are included for consistency with other MPI tools and in case any Python libraries use OpenMP internally.

.. note:: ``MMPBSA.py.MPI`` uses MPI for parallelisation across multiple compute nodes, similar to ``sander.MPI``. For optimal performance with large systems or multiple trajectories, use ``MMPBSA.py.MPI`` instead of serial ``MMPBSA.py``.

Save the complete SLURM job description as a file, for example ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/mmpbsa_mpi.sh``, and submit it to the queue:

 .. code-block:: bash

   cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
   sbatch mmpbsa_mpi.sh

Upon successful submission, the standard output will be directed by SLURM into the file ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out`` (where ``%j`` stands for the SLURM job ID), while the standard error output will be stored in ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err``.

Python single-threaded tools
............................

.. note:: Our AmberTools builds include some Python-based tools that run on a single CPU core using one thread, such as ``MMPBSA.py`` and ``ante-MMPBSA.py``. They are compatible with Python 3.12.

When loading the environment module ``ambertools``, Python 3.12 is loaded automatically and it is provieded by ``anaconda3`` module (loaded as a dependency of ``ambertools``).

This script is used for **Python-based tools** that may run serially or with limited parallelism:

- ``MMPBSA.py``: Serial MM-PBSA/MM-GBSA binding free energy calculations
- ``ante-MMPBSA.py``: Pre-processing for MMPBSA
- Other Python tools in AmberTools

 .. code:: bash

   #!/bin/bash
   #
   #SBATCH --partition=cn         # Partition of compute nodes
   #SBATCH --job-name=mmpbsa_single_threaded   # Job name
   #SBATCH --time=00:30:00        # WallTime - set it accordingly

   #SBATCH --account=<specify_your_slurm_account_name_here>
   #SBATCH --qos=<specify_the_qos_name_here_if_it_is_not_the_default_one_for_the_account>

   #SBATCH --nodes=1              # One node
   #SBATCH --ntasks=1             # One task per node
   #SBATCH --cpus-per-task=1      # One CPU per task
   #SBATCH --mem=2G               # Memory per task (increase if needed, typically 2-16G for small to medium systems)

   #SBATCH -o slurm.%j.out        # STDOUT
   #SBATCH -e slurm.%j.err        # STDERR

   # Load required modules
   module purge || exit
   module load ambertools/24/24.0 || exit

   # Change to submission directory
   cd ${SLURM_SUBMIT_DIR}

   # Run MMPBSA.py
   MMPBSA.py -O -i mmpbsa.in -o mmpbsa.dat -sp complex.prmtop -cp complex.prmtop -lp ligand.prmtop -rp receptor.prmtop -y trajectory.nc

In the script above, edit the parameters and resources required for successfully running and completing the job.

For Python tools:
 - Use ``--ntasks=1`` and ``--cpus-per-task=N`` where N is the number of CPU cores (typically 4-16)
 - Some Python tools may use internal parallelisation
 - Adjust ``--mem`` based on system size and tool requirements

Save the complete SLURM job description as a file, for example ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/mmpbsa.sh``, and submit it to the queue:

 .. code-block:: bash

   cd /valhalla/projects/<your_slurm_project_account_name>/run_ambertools/
   sbatch sander_serial.sh

Upon successful submission, the standard output will be directed by SLURM into the file ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.out`` (where ``%j`` stands for the SLURM job ID), while the standard error output will be stored in ``/valhalla/projects/<your_slurm_project_account_name>/run_ambertools/slurm.%j.err``.

Scalling and performance considerations
---------------------------------------

Here we provide some guidelines and considerations for scaling and performance of the tools. Always consider the specific system and workload when determining the optimal number of threads and resources. If you are not sure, please contact the Discoverer HPC team (see :doc:`help`).

OpenMP scaling considerations
.............................

OpenMP parallelisation efficiency depends on several factors:

1. Algorithm parallelisation: Some algorithms parallelise better than others. Not all code sections may benefit from threading.

2. Memory bandwidth: As the number of threads increases, memory bandwidth may become a bottleneck, limiting scaling.

3. Cache coherence: False sharing and cache line conflicts can degrade performance with too many threads.

4. Problem size: Small problems may not benefit from many threads due to overhead. Larger problems typically scale better.

5. NUMA topology: Thread placement across NUMA domains affects performance. Use ``OMP_PROC_BIND`` and ``OMP_PLACES`` to control placement.

Testing OpenMP scaling
.......................

To determine the optimal number of OpenMP threads for your workload:

1. Start with fewer threads: Begin testing with 8-16 threads, then gradually increase.

2. Test multiple configurations: Run the same workload with different thread counts (e.g., 8, 16, 32, 64) and compare wall-clock times.

3. Monitor performance: Check the output logs for:
   - Wall-clock time (total execution time)
   - CPU utilisation (are all threads being used?)
   - Memory bandwidth utilisation

4. Calculate speedup: Speedup = Time(serial) / Time(threads). Efficiency = Speedup / Number_of_threads. Aim for efficiency > 50%.

5. Watch for diminishing returns: If doubling threads doesn't reduce runtime by at least 1.5×, you've likely hit diminishing returns.

Example scaling test script:

 .. code:: bash

   #!/bin/bash
   # Test OpenMP scaling by running the same job with different thread counts
   
   for threads in 8 16 32 64; do
       echo "Testing with ${threads} threads"
       
       #SBATCH --cpus-per-task=${threads}
       # ... rest of SLURM directives ...
       
       export OMP_NUM_THREADS=${threads}
       time sander.OMP -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0
       
       echo "Completed ${threads} threads test"
   done

Recommended starting points:

- Small systems (<50k atoms): Start with 8-16 threads
- Medium systems (50k-100k atoms): Start with 16-32 threads  
- Large systems (>100k atoms): Start with 32-64 threads

.. note:: **Do not over-subscribe CPU cores!** Set ``--cpus-per-task`` to no more than the number of physical CPU cores available on the compute node. Over-subscription (using more threads than cores) typically degrades performance due to context switching overhead.


CPU thread affinity and pinning
-------------------------------

Our build of AmberTools uses Open MPI's thread affinity management:

1. Open MPI binding: Use ``--map-by socket:PE=${OMP_NUM_THREADS}`` to bind MPI processes to sockets with PE (Processing Element) threads per MPI rank
2. Core binding: Use ``--bind-to core`` to bind each MPI rank to a CPU core
3. Thread affinity: OpenMP environment variables (``OMP_PROC_BIND``, ``OMP_PLACES``) control OpenMP thread affinity within each MPI rank

.. note:: For OpenMP-only tools (e.g., ``sander.OMP``, ``cpptraj.OMP``, ``mdgx.OMP``), thread affinity settings are particularly important for performance. See `OpenMP scaling considerations`_ for detailed guidelines on testing and optimising OpenMP thread affinity and scaling.

Recommended OpenMPI settings:

 .. code-block:: bash

    mpirun --map-by socket:PE=${OMP_NUM_THREADS} \
           --bind-to core \
           --report-bindings \
           sander.MPI -O -i mdin -p prmtop.0 -o out.0 -c inpcrd.0 -r restrt.0

Open MPI binding options:

   - ``--map-by socket:PE=${OMP_NUM_THREADS}``: Maps MPI processes to sockets with PE threads per rank
   - ``--bind-to core``: Binds each MPI rank to a CPU core
   - ``--report-bindings``: Shows CPU binding (useful for debugging)

OpenMP thread affinity settings (when combined with Open MPI settings):

   - ``OMP_PROC_BIND=close``: Binds threads close to parent MPI process
   - ``OMP_PLACES=cores``: Places threads on cores

Resource allocation guidelines
------------------------------

These resource allocation guidelines are specifically for **MPI-parallel AmberTools executables** that run across multiple compute nodes:

- ``sander.MPI``: Multi-node molecular dynamics simulations
- ``sander.LES.MPI``: Multi-node LES simulations
- ``cpptraj.MPI``: Multi-node trajectory analysis
- ``MMPBSA.py.MPI``: Multi-node binding free energy calculations
- ``sqm.MPI``: Multi-node QM calculations
- ``mdgx.MPI``: Multi-node geometry/topology processing

For achieving optimal performance when running these MPI-parallel AmberTools on Discoverer CPU cluster, you should follow the following guidelines. For details on CPU thread affinity and process pinning, see `CPU thread affinity and pinning`_.

.. note:: **Serial and OpenMP tools** (``sander``, ``sander.OMP``, ``cpptraj``, ``cpptraj.OMP``, etc.) typically run on single nodes or workstations and do not require the multi-node resource allocation strategies described here. For single-node OpenMP tools, use ``--cpus-per-task`` equal to the number of OpenMP threads you want (typically the number of cores available on a single node).

.. list-table:: Recommended SLURM resource allocation
   :header-rows: 1

   * - Scenario
     - Nodes
     - Tasks/Node
     - Tasks/Socket
     - CPUs/Task
     - Total Cores
     - Use Case

   * - Small system
     - 1
     - 32
     - 16
     - 1
     - 32
     - <50k atoms

   * - Medium system
     - 2
     - 64
     - 32
     - 1
     - 128
     - 50k-100k atoms

   * - Large system
     - 4
     - 64
     - 32
     - 1
     - 256
     - 100k-200k atoms

   * - Very large system
     - 8+
     - 64
     - 32
     - 1
     - 512+
     - >200k atoms

Guidelines:

   - Number of nodes: Start with 1-2 nodes for small systems, scale up for larger systems
   - Tasks per node: Use 32-64 MPI tasks per node depending on system size
   - Tasks per socket: Set to distribute tasks evenly across NUMA domains (32 tasks per socket for 64 tasks/node)
   - CPUs per task: Always use 1 (pure MPI mode) since OpenMP usage is typically minimal
   - Memory: Do not exceed 251G per node on Discoverer CPU cluster

Total resource allocation calculations:

   - Total MPI ranks = nodes × tasks-per-node
   - Total CPU cores = nodes × tasks-per-node × cpus-per-task
   - Example: 2 nodes × 64 tasks/node × 1 cpu/task = 128 cores

Build information
-----------------

Build recipes, build logs, and build documentation for the AmberTools builds provided on Discoverer are available at the `AmberTools build repository <https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/AmberTools>`_.

Getting help
------------

See :doc:`help`

.. _`AmberTools website`: https://ambermd.org/AmberTools.php
.. _`AMBER`: https://ambermd.org/
.. _`Open MPI`: https://www.open-mpi.org/
.. _`PLUMED`: https://www.plumed.org/
.. _`UCX`: https://openucx.org/