pmemd.MPI (CPU) =============== .. toctree:: :maxdepth: 1 :caption: Contents: .. contents:: Table of Contents :depth: 3 About ----- According to the `AMBER24 website `_, ``pmemd`` (Particle Mesh Ewald Molecular Dynamics) is the performance-optimised molecular dynamics engine from the AMBER suite that provides better performance on multiple CPUs and dramatic speed improvements on GPUs compared to the ``sander`` molecular dynamics code in AmberTools. ``pmemd.MPI`` extends ``pmemd`` with MPI (Message Passing Interface) parallelisation, enabling multi-node simulations across distributed memory systems. This document describes running ``pmemd.MPI`` on Discoverer CPU cluster. Documentation about how to use Amber is available here: https://ambermd.org/Manuals.php Licence ------- .. warning:: You can execute simulations using AMBER ``pmemd.MPI`` free of charge only if you are a non-commercial user or licensed commercial user with a verified licensing agreement with AmberMD. Access to ``pmemd.MPI`` will be granted based on the estimation of your eligibility. If you are an academic user and the use of ``pmemd.MPI`` is for research purposes, you are eligible for free access to ``pmemd.MPI``. You can find the licensing terms in the `AMBER download page `_. Read the licence agreement very carefully before applying for access to ``pmemd.MPI`` on Discoverer CPU cluster. If we discover abuse of the licence, the perpetrator will be banned from using Discoverer CPU cluster and the legal owner of the licence will be informed about the abuse. Versions available ------------------ Currently we support the following versions of AMBER ``pmemd.MPI``: - 24 as part of the AmberTools24 build. Therefore, if you want to use ``pmemd.MPI``, you need to load the AmberTools24 module: .. code-block:: bash module load ambertools/24/24.0 In case you are approved to use ``pmemd.MPI``, you will be able to use the ``pmemd.MPI`` executable because your username will be added to the POSIX group ``pmemd_mpi``. .. warning:: We are logging the usage of ``pmemd.MPI`` even if you brought your own build of AMBER and even if you rename the executable from ``pmemd.MPI`` to something else. In case we establish that such an execution is not in accordance with the licensing terms, you will be banned from using Discoverer CPU cluster and the legal owner of the licence will be informed about the abuse. Running simulations ------------------- Running simulations means invoking ``pmemd.MPI`` for generating trajectories based on given input files (mdin, prmtop, inpcrd, etc.). .. warning:: **You MUST NOT execute simulation directly upon the login node (login.discoverer.bg).** You have to run your simulations as SLURM jobs only. .. warning:: Write your trajectories and result of analysis only inside your :doc:`projectfolder` and DO NOT use for that purpose (under any circumstances) your :doc:`homefolder`! Below is an example of a SLURM job script for running ``pmemd.MPI`` on Discoverer CPU cluster: .. code:: bash #!/bin/bash # #SBATCH --partition=cn # Partition (you may need to change this) #SBATCH --job-name=pmemd_mpi # Job name #SBATCH --time=512:00:00 # WallTime - set it accordingly #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=2 # Number of nodes #SBATCH --ntasks-per-node=64 # Number of MPI tasks to run upon each node #SBATCH --ntasks-per-socket=32 # Number of tasks per NUMA-bound socket #SBATCH --cpus-per-task=1 # Number of OpenMP threads per MPI rank (recommended: 1 for pure MPI) #SBATCH --ntasks-per-core=1 # Each MPI rank is bound to a CPU core #SBATCH --mem=251G # Do not exceed this on Discoverer CPU cluster #SBATCH -o slurm.%j.out # STDOUT #SBATCH -e slurm.%j.err # STDERR # Load required modules module purge || exit module load ambertools/24/24.0 || exit # Set OpenMP environment variables (REQUIRED by pmemd.MPI) # pmemd.MPI will exit if OMP_NUM_THREADS is not set, even if OpenMP is not actually used # NOTE: OpenMP usage in pmemd.MPI is very limited: # - Velocity/kinetic energy OpenMP only if usemidpoint=false in mdin file # - PME force OpenMP only if usemidpoint=true in mdin file # - Bonded force OpenMP only on MPI rank 0 (master), all other ranks use serial code # In most runs, OpenMP is NOT used, so OMP_NUM_THREADS=1 is recommended export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} export OMP_PROC_BIND=close # Bind threads close to parent MPI process export OMP_PLACES=cores # Place threads on cores # Uncomment to verify OpenMP usage (if any): # export OMP_DISPLAY_ENV=VERBOSE # Optimise InfiniBand communication (if available) export UCX_NET_DEVICES=mlx5_0:1 # Change to submission directory cd ${SLURM_SUBMIT_DIR} # Run pmemd.MPI with OpenMPI # --map-by socket:PE=${OMP_NUM_THREADS} binds MPI processes to sockets # with PE (Processing Element) threads per MPI rank # --bind-to core binds each MPI rank to a CPU core # --report-bindings shows CPU binding (useful for debugging) mpirun --map-by socket:PE=${OMP_NUM_THREADS} \ --bind-to core \ --report-bindings \ pmemd.MPI -O -i mdin -p prmtop.0 -o out.0 < dummy In the script above, edit the parameters and resources required for successfully running and completing the job. For detailed guidelines on optimal resource allocation, see `Resource allocation guidelines`_. - SLURM partition of compute nodes (``--partition``): Specifies which group of nodes (partition) to use. For AMBER on Discoverer, use ``cn`` partition which contains the CPU-optimised nodes. - Job name (``--job-name``): A descriptive name for your job that will appear in the queue. Use meaningful names like ``pmemd_protein_sim`` or ``pmemd_membrane_run``. - Wall time (``--time``): Maximum time your job can run. Format is ``HH:MM:SS`` (e.g., ``48:00:00`` for 48 hours). Set this based on your simulation size and expected runtime. - Number of compute nodes (``--nodes``): How many physical nodes to allocate. For multi-node AMBER simulations, this determines the total computational power available. See `Resource allocation guidelines`_ for recommendations based on system size. - Number of MPI processes per node (``--ntasks-per-node``): Critical for AMBER performance. On Discoverer with 8 NUMA domains per node, use 64 MPI tasks to get 8 tasks per NUMA domain for optimal memory locality. See `Resource allocation guidelines`_ for recommended values based on system size. - Number of MPI tasks per NUMA domain (``--ntasks-per-socket``): Essential for NUMA-aware performance. Set to 32 to place exactly 32 MPI tasks per NUMA domain (64 total tasks ÷ 2 sockets per NUMA domain = 32 per domain). This ensures optimal memory access patterns and cache utilisation within each NUMA boundary. - Number of OpenMP threads per MPI process (``--cpus-per-task``): Controls hybrid parallelism. **Recommended value is 1** (pure MPI mode) since OpenMP usage in pmemd.MPI is very limited. See `OpenMP Usage in pmemd.MPI`_ for details. For information on how this affects CPU thread affinity and pinning, see `CPU thread affinity and pinning`_. - AMBER version (``module load``): Choose the appropriate version based on your simulation requirements. See `Versions available`_ for available builds and their characteristics. Save the complete SLURM job description as a file, for example ``/valhalla/projects//run_amber/pmemd_mpi.sh``, and submit it to the queue: .. code-block:: bash cd /valhalla/projects//run_amber/ sbatch pmemd_mpi.sh Upon successful submission, the standard output will be directed by SLURM into the file ``/valhalla/projects//run_amber/slurm.%j.out`` (where ``%j`` stands for the SLURM job ID), while the standard error output will be stored in ``/valhalla/projects//run_amber/slurm.%j.err``. OpenMP usage in pmemd.MPI ------------------------- .. important:: OpenMP usage in pmemd.MPI is **very limited** and highly conditional. Even though ``OMP_NUM_THREADS`` must be set (``pmemd.MPI`` exits if not set), OpenMP may not actually be used in typical runs. Here is a summary of the conditions under which OpenMP is used in ``pmemd.MPI``: 1. Bonded forces OpenMP only on master: Only MPI rank 0 (master) uses OpenMP for bonded force calculations. All other MPI ranks use serial code. 2. Contradictory conditions: - Velocity/kinetic energy OpenMP requires ``usemidpoint = false`` in the mdin file - PME force OpenMP requires ``usemidpoint = true`` in the mdin file - These cannot both be true at the same time! 3. Where OpenMP is used: - Velocity Updates: Only if ``usemidpoint = false`` - Kinetic Energy: Only if ``usemidpoint = false`` - Bonded Forces: Only on MPI rank 0 (master) - PME Forces: Only if ``usemidpoint = true`` 4. Where OpenMP is NOT used: - Main PME force calculations (PME direct and reciprocal space) are MPI-parallelised only - All MPI communication is handled by MPI processes Recommendations: 1. Use ``OMP_NUM_THREADS=1``: Since OpenMP usage is minimal/conditional, use pure MPI mode (``--cpus-per-task=1``) 2. Check your `mdin` file: Verify whether ``usemidpoint`` is set (affects OpenMP usage) 3. Monitor actual OpenMP usage: Use ``OMP_DISPLAY_ENV=VERBOSE`` to see if OpenMP is actually used 4. Use more MPI ranks: Since OpenMP is limited, maximise MPI parallelism instead Therefore, in most cases, to achieve optimal performance, you should use ``--cpus-per-task=1`` (pure MPI) instead of hybrid MPI+OpenMP. CPU thread affinity and pinning ------------------------------- Our build of ``pmemd.MPI`` uses Open MPI's thread affinity management: 1. OpenMPI binding: Use ``--map-by socket:PE=${OMP_NUM_THREADS}`` to bind MPI processes to sockets with PE (Processing Element) threads per MPI rank 2. Core binding: Use ``--bind-to core`` to bind each MPI rank to a CPU core 3. Thread affinity: OpenMP environment variables (``OMP_PROC_BIND``, ``OMP_PLACES``) control OpenMP thread affinity within each MPI rank Recommended OpenMPI settings: .. code-block:: bash mpirun --map-by socket:PE=${OMP_NUM_THREADS} \ --bind-to core \ --report-bindings \ pmemd.MPI -O -i mdin -p prmtop.0 -o out.0 Open MPI binding options: - ``--map-by socket:PE=${OMP_NUM_THREADS}``: Maps MPI processes to sockets with PE threads per rank - ``--bind-to core``: Binds each MPI rank to a CPU core - ``--report-bindings``: Shows CPU binding (useful for debugging) OpenMP thread affinity settings (when combined with Open MPI settings): - ``OMP_PROC_BIND=close``: Binds threads close to parent MPI process - ``OMP_PLACES=cores``: Places threads on cores .. note:: Since OpenMP usage in ``pmemd.MPI`` is very limited (see `OpenMP usage in pmemd.MPI`_), thread affinity settings may not have significant impact in practice. However, they must be set correctly for the code to run. Resource allocation guidelines ------------------------------ For achieving optimal performance when running ``pmemd.MPI`` on Discoverer CPU cluster, you should follow the following guidelines. For details on CPU thread affinity and process pinning, see `CPU thread affinity and pinning`_. .. list-table:: Recommended SLURM resource allocation :header-rows: 1 * - Scenario - Nodes - Tasks/Node - Tasks/Socket - CPUs/Task - Total Cores - Use Case * - Small system - 1 - 32 - 16 - 1 - 32 - <50k atoms * - Medium system - 2 - 64 - 32 - 1 - 128 - 50k-100k atoms * - Large system - 4 - 64 - 32 - 1 - 256 - 100k-200k atoms * - Very large system - 8+ - 64 - 32 - 1 - 512+ - >200k atoms Guidelines: - Number of nodes: Start with 1-2 nodes for small systems, scale up for larger systems - Tasks per node: Use 32-64 MPI tasks per node depending on system size - Tasks per socket: Set to distribute tasks evenly across NUMA domains (32 tasks per socket for 64 tasks/node) - CPUs per task: Always use 1 (pure MPI mode) since OpenMP usage is minimal (see `OpenMP usage in pmemd.MPI`_) - Memory: Do not exceed 251G per node on Discoverer CPU cluster Total resource allocation calculations: - Total MPI ranks = nodes × tasks-per-node - Total CPU cores = nodes × tasks-per-node × cpus-per-task - Example: 2 nodes × 64 tasks/node × 1 cpu/task = 128 cores Domain decomposition -------------------- AMBER's ``pmemd.MPI`` automatically handles domain decomposition: 1. Automatic decomposition: ``pmemd.MPI`` automatically decomposes the system across MPI ranks 2. Load balancing: The decomposition algorithm balances load across all MPI ranks 3. Communication optimisation: PME calculations are optimised for the decomposition Grid selection guidelines: When using multi-node MPI simulations, consider these factors. For detailed resource allocation recommendations, see `Resource allocation guidelines`_. System size guidelines: - Small systems (<50k atoms): 1-2 nodes sufficient - Medium systems (50k-100k atoms): 2-4 nodes recommended - Large systems (>100k atoms): 4-8 nodes optimal - Very large systems (>200k atoms): 8+ nodes required Communication overhead considerations: - More nodes = more MPI communication overhead - Balance parallelisation benefits against communication costs - Monitor network utilisation by setting ``export UCX_NET_DEVICES=mlx5_0:1`` Build recipe ------------ You can find the comeplete build recipe and build documentation used to compile the ``pmemd.MPI`` executable installed on Discoverer CPU cluster here: https://gitlab.discoverer.bg/vkolev/recipes/-/tree/main/Amber/24/MPI Getting help ------------ See :doc:`help` .. _`pmemd.MPI-OpenMP-usage.md`: pmemd.MPI-OpenMP-usage.md .. _`GROMACS 2025`: https://www.gromacs.org/ .. _`AMBER`: https://ambermd.org/ .. _`Open MPI`: https://www.open-mpi.org/ .. _`PLUMED`: https://www.plumed.org/ .. _`UCX`: https://openucx.org/