ML/AI Workloads on Discoverer CPU Cluster
=========================================

Table of contents
-----------------


-  :ref:`Discoverer CPU partition compute nodes specifications <discoverer-cpu-partition-compute-nodes-specifications>`
-  :ref:`CPU cluster and ML/AI <cpu-cluster-and-mlai>`

   -  :ref:`Distributed training as a basic principle <distributed-training-as-a-basic-principle>`
   -  :ref:`Considerations for ML/AI on CPU clusters <considerations-for-mlai-on-cpu-clusters>`
   -  :ref:`Workload types suitable for CPU clusters <workload-types-suitable-for-cpu-clusters>`
   -  :ref:`Limitations and considerations <limitations-and-considerations>`

-  :ref:`Virtual environment setup <virtual-environment-setup>`

   -  :ref:`Use conda from the anaconda3 module <use-conda-from-the-anaconda3-module>`
   -  :ref:`Common setup: Determine Python version compatibility <common-setup-determine-python-version-compatibility>`
   -  :ref:`Common setup: Create virtual environment with compatible Python version <common-setup-create-virtual-environment-with-compatible-python-version>`
   -  :ref:`Install PyTorch-CPU <install-pytorch-cpu>`

      -  :ref:`Optional: Search for available PyTorch CPU packages <optional-search-for-available-pytorch-cpu-packages>`
      -  :ref:`Installation script <installation-script-pytorch>`

   -  :ref:`Verify PyTorch-CPU installation <verify-pytorch-cpu-installation>`
   -  :ref:`Examine available PyTorch methods, functions, and constants <examine-available-pytorch-methods-functions-and-constants>`
   -  :ref:`Install TensorFlow-CPU <install-tensorflow-cpu>`

      -  :ref:`Optional: Search for available TensorFlow packages <optional-search-for-available-tensorflow-packages>`
      -  :ref:`Installation script <installation-script-tensorflow>`

   -  :ref:`Verify TensorFlow installation <verify-tensorflow-installation>`

      -  :ref:`Examine available TensorFlow methods, functions, and constants <examine-available-tensorflow-methods-functions-and-constants>`

-  :ref:`Create SLURM configuration for ML/AI workloads on CPU clusters <create-slurm-configuration-for-mlai-workloads-on-cpu-clusters>`

   -  :ref:`1. Resource allocation <1-resource-allocation>`
   -  :ref:`2. Job arrays for parallel experiments <2-job-arrays-for-parallel-experiments>`
   -  :ref:`3. Memory management <3-memory-management>`
   -  :ref:`5. Checkpointing <5-checkpointing>`

-  :ref:`Suitable ML/AI workloads for CPU clusters <suitable-mlai-workloads-for-cpu-clusters>`

   -  :ref:`1. Traditional machine learning <1-traditional-machine-learning>`

      -  :ref:`Scikit-learn workflows <scikit-learn-workflows>`
      -  :ref:`XGBoost/LightGBM/CatBoost <xgboostlightgbmcatboost>`

   -  :ref:`2. Deep learning (CPU-optimized) <2-deep-learning-cpu-optimized>`

      -  :ref:`TensorFlow/Keras with CPU optimization <tensorflowkeras-with-cpu-optimization>`
      -  :ref:`PyTorch with CPU backend <pytorch-with-cpu-backend>`

   -  :ref:`3. Natural language processing (NLP) <3-natural-language-processing-nlp>`

      -  :ref:`Traditional NLP pipelines <traditional-nlp-pipelines>`
      -  :ref:`Transformer models (CPU inference) <transformer-models-cpu-inference>`

   -  :ref:`4. Computer vision <4-computer-vision>`

      -  :ref:`Image processing and feature extraction <image-processing-and-feature-extraction>`

   -  :ref:`5. Data science and analytics <5-data-science-and-analytics>`

      -  :ref:`Large-scale data processing <large-scale-data-processing>`

   -  :ref:`6. Hyperparameter optimization <6-hyperparameter-optimization>`

      -  :ref:`Hyperparameter optimization <hyperparameter-optimization>`

   -  :ref:`7. Reinforcement learning <7-reinforcement-learning>`

      -  :ref:`CPU-based RL training <cpu-based-rl-training>`

-  :ref:`Distributed training <distributed-training>`

   -  :ref:`Recommended approach: Use MPI for HPC clusters <recommended-approach-use-mpi-for-hpc-clusters>`

      -  :ref:`Why MPI over Gloo? <why-mpi-over-gloo>`

   -  :ref:`TensorFlow distributed training with Horovod <tensorflow-distributed-training-with-horovod>`
   -  :ref:`Complete working example: Distributed training on MNIST dataset with TensorFlow <complete-working-example-distributed-training-on-mnist-dataset-with-tensorflow>`

      -  :ref:`Overview <overview-tensorflow>`
      -  :ref:`Step 1: Prepare the working directory <step-1-prepare-the-working-directory-tensorflow>`
      -  :ref:`Step 2: Get the training script <step-2-get-the-training-script-tensorflow>`
      -  :ref:`Step 3: Get the SLURM batch script <step-3-get-the-slurm-batch-script-tensorflow>`
      -  :ref:`Step 4: Submit the job <step-4-submit-the-job>`
      -  :ref:`Step 5: Monitor the job <step-5-monitor-the-job-tensorflow>`
      -  :ref:`Step 6: Check the results <step-6-check-the-results>`
      -  :ref:`Differences from PyTorch example <differences-from-pytorch-example>`
      -  :ref:`Troubleshooting <troubleshooting-tensorflow>`
      -  :ref:`Adapting for custom datasets <adapting-for-custom-datasets-tensorflow>`

-  :ref:`Running PyTorch workloads <running-pytorch-workloads>`

   -  :ref:`Package installation for distributed training <package-installation-for-distributed-training>`
   -  :ref:`PyTorch distributed training with MPI <pytorch-distributed-training-with-mpi>`

      -  :ref:`Option 1: PyTorch native DDP with MPI backend <option-1-pytorch-native-ddp-with-mpi-backend>`
      -  :ref:`Option 2: Horovod with PyTorch (recommended for simplicity) <option-2-horovod-with-pytorch-recommended-for-simplicity>`

   -  :ref:`Complete working example: Distributed training on MNIST dataset <complete-working-example-distributed-training-on-mnist-dataset>`

      -  :ref:`Overview <overview-pytorch>`
      -  :ref:`Step 1: Prepare the working directory <step-1-prepare-the-working-directory-pytorch>`
      -  :ref:`Step 2: Get the training script <step-2-get-the-training-script-pytorch>`
      -  :ref:`Step 3: Get the SLURM batch script <step-3-get-the-slurm-batch-script-pytorch>`
      -  :ref:`Step 4: Make the script executable and submit the job <step-4-make-the-script-executable-and-submit-the-job>`
      -  :ref:`Step 5: Monitor the job <step-5-monitor-the-job-pytorch>`
      -  :ref:`Step 6: Check job runtime and statistics <step-6-check-job-runtime-and-statistics>`
      -  :ref:`Scalability testing results <scalability-testing-results>`
      -  :ref:`Achieving better accuracy <achieving-better-accuracy>`
      -  :ref:`Understanding the data flow <understanding-the-data-flow>`
      -  :ref:`Expected output <expected-output>`
      -  :ref:`Understanding process structure <understanding-process-structure>`
      -  :ref:`Troubleshooting <troubleshooting-pytorch>`
      -  :ref:`Adapting for custom datasets <adapting-for-custom-datasets-pytorch>`
      -  :ref:`Scaling analysis: Why different node counts require different numbers of epochs <scaling-analysis-why-different-node-counts-require-different-numbers-of-epochs>`

-  :ref:`Summary: Best practices for distributed training on Discoverer <summary-best-practices-for-distributed-training-on-discoverer>`
-  :ref:`References <references>`
-  :ref:`Getting help <getting-help>`

.. _discoverer-cpu-partition-compute-nodes-specifications:

Discoverer CPU partition compute nodes specifications
-----------------------------------------------------

For more datails see `Discoverer Reseource Overview <https://docs.discoverer.bg/resource_overview.html>`__:

.. _cpu-cluster-and-mlai:

CPU cluster and ML/AI
---------------------

CPU clusters like Discoverer are well-suited for ML/AI workloads, particularly when leveraging distributed training across multiple nodes. The large number of CPU cores (144,384 hardware cores, 288,768 logical CPUs with hyperthreading) and high-speed InfiniBand interconnect (200 Gbps per node) enable efficient parallel processing of machine learning tasks.

.. _distributed-training-as-a-basic-principle:

Distributed training as a basic principle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Distributed training is fundamental for effectively utilizing CPU clusters for ML/AI workloads. Rather than training models on a single node, distributed training divides the workload across multiple nodes and processes, enabling:

-  Faster training times: Multiple nodes work simultaneously on different data shards
-  Scalability: Training can scale from a few nodes to hundreds of nodes depending on the workload
-  Efficient resource allocation: Enables allocation of CPU cores across multiple nodes
-  Large dataset handling: Enables training on datasets that may not fit in a single node’s memory

On Discoverer, distributed training uses MPI (Message Passing Interface) for inter-node communication, which is the standard approach for HPC clusters. Frameworks like Horovod provide a unified interface for distributed training with both PyTorch and TensorFlow, simplifying the implementation of distributed training workflows.

.. _considerations-for-mlai-on-cpu-clusters:

Considerations for ML/AI on CPU clusters

1. Data sharding: In distributed training, datasets are divided (sharded) across all worker processes. Each process sees only a subset of the data per epoch, which affects convergence behavior and requires careful tuning of learning rates and epoch counts.

2. Communication overhead: While distributed training speeds up training, communication between nodes adds overhead. The optimal number of nodes depends on the specific workload, dataset size, and model complexity.

3. Resource allocation: Full node allocation (128 tasks per node, using all 256 logical CPUs) maximizes training efficiency. Partial allocation wastes resources and increases training time.

4. Storage requirements: All data and outputs must be stored in project directories (``/valhalla/projects/<your_slurm_project_account_name>/``), never in home directories or ``/tmp``.

5. Python version compatibility: Python 3.11 is currently the optimal version for ML/AI workloads on Discoverer, providing best compatibility with TensorFlow, PyTorch, and Horovod.

.. _workload-types-suitable-for-cpu-clusters:

Workload types suitable for CPU clusters

CPU clusters excel at: - Traditional machine learning (scikit-learn, XGBoost, etc.) - Deep learning with distributed training (PyTorch, TensorFlow) - Large-scale data preprocessing and feature engineering - Hyperparameter optimization with parallel trials - CPU-optimized inference workloads - Research and development workflows

For production training of very large models or extremely large datasets, GPU clusters may provide better performance, but CPU clusters are highly effective for many ML/AI workloads when properly configured for distributed training.

.. _limitations-and-considerations:

Limitations and considerations

Workloads suitable for CPU clusters: - Traditional ML algorithms (scikit-learn, XGBoost) - Data preprocessing and feature engineering - CPU-optimized inference - Hyperparameter optimization - Smaller neural networks - Batch inference workloads

What’s better on GPU clusters (Discoverer+ GPU partition - Large-scale deep learning training - Transformer model training - Real-time inference requirements - Computer vision with large CNNs - High-throughput training workloads

.. _virtual-environment-setup:

Virtual environment setup
-------------------------

.. note:: Virtual environments replace the need for running containers and simplify the setup and installation process for ML/AI workloads on Discoverer CPU partition. They provide isolated Python environments with all necessary dependencies without the overhead of container management.

.. _use-conda-from-the-anaconda3-module:

Use conda from the anaconda3 module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All virtual environments on Discoverer must be managed by conda from the ``anaconda3`` environment module that is already installed on the system. Users must not download, install, or use their own Anaconda or Miniconda installations. Conda is always available on Discoverer through environment modules and must be accessed by loading the ``anaconda3`` module:

.. code-block:: bash

   module load anaconda3

After loading the module, the ``conda`` executable and the libraries that support that executable become available. All virtual environment creation, package installation, and management must be done using this system-provided Anaconda3 installation. Do not attempt to install Anaconda3/Miniconda3 yourself - ``conda`` tool is already available through the environment module system.

.. note:: Users who prefer containers can use Singularity/Apptainer on the Discoverer CPU partition, but this document does not cover container-based workflows. The examples and instructions in this guide focus on virtual environments as the recommended approach.

.. warning:: Do not use ``conda activate`` when working with virtual environments on Discoverer. Always use ``--prefix`` with environment variable exports to expose the environment. Set ``VIRTUAL_ENV`` to the path of the virtual environment directory (e.g., ``/valhalla/projects/<your_slurm_project_account_name>/virt_envs/torch``), then export:

.. code-block:: bash

   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

See the examples below for complete virtual environment installation scripts.

The libmamba solver is recommended to suppoprt all ``conda``-driven installations because it:

  - Resolves dependencies faster than the default solver
  - Handles complex dependency conflicts more reliably
  - Reduces warnings about deprecated version specifications

.. _common-setup-determine-python-version-compatibility:

Common setup: Determine Python version compatibility

Before creating a virtual environment, determine the highest Python version supported by the target packages (pytorch-cpu or tensorflow-cpu). That my need some research or negative feedback organised installation procedure.

3.11 is the current optimal Python version for ML/AI workloads on Discoverer. TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.

.. note:: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions.

Since packages from conda-forge may not include Python version in the search metadata, test installation with different Python versions in different virtual environments (point to the selected environment by using ``--prefix``):

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=search_pytorch
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G
   #SBATCH --time=00:10:00
   #SBATCH -o search_pytorch.%j.out
   #SBATCH -e search_pytorch.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   # Search for CPU-only PyTorch packages
   conda search -c conda-forge pytorch-cpu

   # Get detailed information
   conda search -c conda-forge pytorch-cpu --info

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=test_python_version
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G
   #SBATCH --time=00:30:00
   #SBATCH -o test_python_version.%j.out
   #SBATCH -e test_python_version.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   # Test which Python version works by trying to install the target package
   # Replace PACKAGE_NAME with pytorch-cpu or tensorflow-cpu
   # Replace <your_slurm_project_account_name> with the actual project account name

   PACKAGE_NAME=pytorch-cpu  # or tensorflow-cpu

   # Note: Python 3.11 is the current optimal version for ML/AI workloads on Discoverer.
   # TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.
   # This script tests higher versions first, but Python 3.11 should be used unless testing shows
   # that a higher version is compatible. In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible
   # with higher Python versions.

   # Test Python 3.13 (to test compatibility with future versions)
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py313
   [ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
   conda create --prefix ${VIRTUAL_ENV} python=3.13 --solver=libmamba -y
   if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
       echo "Python 3.13 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
       echo "A production environment can now be created with Python 3.13"
       exit 0
   else
       echo "Python 3.13 is not supported. Cleaning up test environment and trying lower version."
       conda env remove --prefix ${VIRTUAL_ENV} -y
   fi

   # Test Python 3.12 (to test compatibility with future versions)
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py312
   [ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
   conda create --prefix ${VIRTUAL_ENV} python=3.12 --solver=libmamba -y
   if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
       echo "Python 3.12 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
       echo "A production environment can now be created with Python 3.12"
       exit 0
   else
       echo "Python 3.12 is not supported. Cleaning up test environment and trying lower version."
       conda env remove --prefix ${VIRTUAL_ENV} -y
   fi

   # Python 3.11 is the current optimal version - use this for production
   # Continue with 3.10, etc. only if 3.11 doesn't work

.. _common-setup-create-virtual-environment-with-compatible-python-version:

Common setup: Create virtual environment with compatible Python version

Once the compatible Python version is determined, create the virtual environment using the template SLURM batch script provided below.

.. warning:: Do not use ``conda activate`` when working with virtual environments on Discoverer. Always use ``export PATH=${VIRTUAL_ENV}/bin:${PATH}``, ``export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}`` to expose the environment to the shell.

.. code-block:: bash

   #!/bin/bash
   #SBATCH --partition=cn
   #SBATCH --job-name=install
   #SBATCH --time=00:30:00
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G
   #SBATCH -o install.%j.out
   #SBATCH -e install.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   # Export the path to the Python virtual environment folder
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

   # Check if the target folder already exists
   [ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }

   # Create Python virtual environment with Python 3.11
   # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
   # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
   # Note: In the future, those packages might become compatible with higher Python versions
   conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y

   if [ $? -ne 0 ]; then
     echo "Conda Python virtual environment creation failed" >&2
     exit 1
   fi

   echo "Python virtual environment successfully created!"

   # Fully expose the Python virtual environment
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   # Verify Python version
   python --version

.. _install-pytorch-cpu:

Install PyTorch-CPU

On Discoverer CPU cluster, PyTorch-CPU must always be installed with ``nomkl`` because MKL (Intel Math Kernel Library) does not work well on AMD Zen2 processors (AMD EPYC 7H12 installed on Discoverer CPU compute nodes). The ``nomkl`` package ensures that MKL libraries are not used, preventing performance issues and compatibility problems on AMD architecture.

For distributed training with Horovod, install ``pytorch-cpu``, and ``torchvision`` packages together with ``horovod`` in the same command line to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed), and to install ``mpi4py`` as a dependency.

.. _optional-search-for-available-pytorch-cpu-packages:

Optional: Search for available PyTorch CPU packages

To check which PyTorch CPU versions are available before installing, save the SLURM batch script below as ``run_search_pytorch.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=search_pytorch
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G
   #SBATCH --time=00:10:00
   #SBATCH -o search_pytorch.%j.out
   #SBATCH -e search_pytorch.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   # Search for CPU-only PyTorch packages
   conda search -c conda-forge pytorch-cpu

   # Get detailed information
   conda search -c conda-forge pytorch-cpu --info

Submit the job to the queue:

.. code-block:: bash

   sbatch run_search_pytorch.sh

and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).

.. _installation-script-pytorch:

Installation script

Save the SLURM batch script provided below as ``run_install_pytorch.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=install_pytorch
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G
   #SBATCH --time=01:30:00
   #SBATCH -o install_pytorch.%j.out
   #SBATCH -e install_pytorch.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

   # Create virtual environment if it doesn't exist
   if [ ! -d ${VIRTUAL_ENV} ]; then
       # Create Python virtual environment with Python 3.11
       # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
       # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
       # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
       conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
       
       if [ $? -ne 0 ]; then
           echo "Conda Python virtual environment creation failed" >&2
           exit 1
       fi
       
       echo "Python virtual environment successfully created!"
   else
       echo "Virtual environment already exists at ${VIRTUAL_ENV}"
   fi

   # Expose the virtual environment (do not use conda activate)
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   # For distributed training with Horovod:
   # Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
   conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba

   # For basic PyTorch installation (without distributed training):
   # Install PyTorch CPU version in nomkl mode (required)
   # torchvision is needed for datasets and image transformations
   # conda install --prefix ${VIRTUAL_ENV} -c conda-forge pytorch-cpu torchvision nomkl --solver=libmamba

   # Install additional packages if needed
   # conda install --prefix ${VIRTUAL_ENV} -c conda-forge transformers --solver=libmamba

   # Install multiple packages at once (pytorch-cpu must always include nomkl)
   # conda install --prefix ${VIRTUAL_ENV} -c conda-forge \
   #     pytorch-cpu \
   #     torchvision \
   #     nomkl \
   #     transformers \
   #     scikit-learn \
   #     pandas \
   #     numpy \
   #     --solver=libmamba

Submit the job to the queue:

.. code-block:: bash

   sbatch run_install_pytorch.sh

and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).

.. note::

   -  With ``conda`` as installer, the PyTorch package to install is named ``pytorch-cpu``, but within the Python code the installation is imported as a module under the name ``torch`` (``import torch``)
   -  ``torchvision`` is required for datasets (e.g., MNIST) and image transformations
   -  For distributed training: ``pytorch-cpu``, ``horovod``, and ``torchvision``\ must be installed together in the same command line to ensure Open MPI is used instead of MPICH and to bring ``mpi4py`` as a dependency
   -  Always include ``nomkl`` when installing ``pytorch-cpu`` on Discoverer CPU partition because of the used AMD Zen2 processor microarchitecture
   -  For distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)

.. _verify-pytorch-cpu-installation:

Verify PyTorch-CPU installation

After installing ``pytorch-cpu``, verify the installation. Save the SLURM batch script below as ``run_verify_pytorch.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=verify_pytorch
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=4G
   #SBATCH --time=00:10:00
   #SBATCH -o verify_pytorch.%j.out
   #SBATCH -e verify_pytorch.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
   [ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

   # Ensure the virtual environment is exposed
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   # Check PyTorch version
   python -c "import torch; print('PyTorch version:', torch.__version__)"

   # Check torchvision version
   python -c "import torchvision; print('torchvision version:', torchvision.__version__)"

   # Verify PyTorch is CPU-only (should return False for CUDA availability)
   python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

   # Check number of CPU threads PyTorch can use as if SMT is not there
   python -c "import torch; print('Number of threads:', torch.get_num_threads())"

   # Test basic tensor operations
   python -c "import torch; x = torch.randn(5, 3); print('Tensor shape:', x.shape); print('Tensor device:', x.device)"

Submit the job to the queue:

.. code-block:: bash

   sbatch run_verify_pytorch.sh

and follow the content written down in the standard output and standard error files (``verify_pytorch.JOBID.out`` and ``verify_pytorch.JOBID.err``) to see the verification results.

Expected output should show:

-  PyTorch version number
-  torchvision version number
-  CUDA available: False (since this is CPU-only)
-  Number of threads available (no SMT is taken into account)
-  Tensor operations working correctly

.. _examine-available-pytorch-methods-functions-and-constants:

Examine available PyTorch methods, functions, and constants

To inspect what methods, functions, and constants are available in the installed PyTorch, save the SLURM batch script below as ``run_examine_pytorch.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=examine_pytorch
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=4G
   #SBATCH --time=00:10:00
   #SBATCH -o examine_pytorch.%j.out
   #SBATCH -e examine_pytorch.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   # Export the path to the Python virtual environment folder
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

   # Check if the target folder exists
   [ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

   # Ensure the virtual environment is exposed
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   # List all available attributes in torch module
   python -c "import torch; print('Available attributes:'); print([attr for attr in dir(torch) if not attr.startswith('_')])"

   # Check specific submodules
   python -c "import torch; print('torch.nn available:', hasattr(torch, 'nn')); print('torch.optim available:', hasattr(torch, 'optim')); print('torch.distributed available:', hasattr(torch, 'distributed'))"

   # Check BLAS backend (should show OpenBLAS when nomkl is used)
   python -c "import torch; print('BLAS library:', torch.__config__.show())"

   # Check if MKLDNN is actually installed and available at runtime
   python -c "import torch; print('MKLDNN available:', torch.backends.mkldnn.is_available() if hasattr(torch.backends, 'mkldnn') else 'mkldnn backend not found')"
   python -c "import torch; print('MKLDNN enabled:', torch.backends.mkldnn.enabled if hasattr(torch.backends, 'mkldnn') else 'N/A')"

   # List available tensor operations
   python -c "import torch; ops = [attr for attr in dir(torch) if callable(getattr(torch, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"

   # Check constants
   python -c "import torch; print('Float32:', torch.float32); print('CPU device:', torch.device('cpu')); print('Default dtype:', torch.get_default_dtype())"

Submit the job to the queue:

.. code-block:: bash

   sbatch run_examine_pytorch.sh

and follow the content written down in the standard output and standard error files (``examine_pytorch.JOBID.out`` and ``examine_pytorch.JOBID.err``) to see the available methods, functions, and constants.

Alternatively, run interactively using ``srun``:

.. code-block:: bash

   srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
        --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash

   # Then in the interactive session:
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
   python -c "import torch; print('Available attributes:', [attr for attr in dir(torch) if not attr.startswith('_')])"
   # ... (continue with other examination commands)

These tests help to verify:

-  Which PyTorch modules are available (nn, optim, distributed, etc.)
-  Which BLAS library is being used (should be OpenBLAS with nomkl)
-  Available tensor operations and constants

The build configuration should show ``BLAS_INFO=open`` and ``LAPACK_INFO=open``, confirming that OpenBLAS is used for BLAS and LAPACK operations instead of MKL. This is the expected behavior when using ``nomkl``. The ``USE_MKLDNN=1`` setting indicates that MKLDNN (Intel MKL-DNN, now oneDNN) is available. MKLDNN is embedded in the libtorch/pytorch package and statically linked into PyTorch, so it remains available even when ``nomkl`` is used. This configuration is correct: ``nomkl`` prevents MKL BLAS/LAPACK usage (which does not work well on AMD Zen2 processors), while MKLDNN (for neural network operations) remains available as it is part of the PyTorch build itself.

.. _install-tensorflow-cpu:

Install TensorFlow-CPU

.. _optional-search-for-available-tensorflow-packages:

Optional: Search for available TensorFlow packages

To check which TensorFlow CPU versions are available before installing, save the SLURM batch script below as ``run_search_tensorflow.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=search_tensorflow
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G
   #SBATCH --time=00:10:00
   #SBATCH -o search_tensorflow.%j.out
   #SBATCH -e search_tensorflow.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   # Search for TensorFlow CPU packages
   conda search -c conda-forge tensorflow-cpu

   # Search for specific package with build string
   conda search -c conda-forge "tensorflow-cpu=*"

   # Check what's available in the configured channels
   conda search tensorflow-cpu

Submit the job to the queue:

.. code-block:: bash

   sbatch run_search_tensorflow.sh

and follow the content written down in the standard output and standard error files (``search_tensorflow.JOBID.out`` and ``search_tensorflow.JOBID.err``) to see which TensorFlow CPU versions are available.

.. _installation-script-tensorflow:

Installation script

To use Horovod for distributed training, ``tensorflow-cpu`` and ``horovod`` must be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer. Installing them together ensures Horovod is built with CPU-only support.

Important note on compatibility: Even when installing together, conda’s dependency resolver may select versions that are technically compatible at the package level but have runtime API incompatibilities (e.g., Horovod callbacks may not work with certain TensorFlow versions). Training scripts should handle this by falling back to manual synchronization when callbacks are incompatible.

Save the SLURM batch script provided below as ``run_install_tensorflow.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=install_tensorflow
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G
   #SBATCH --time=01:30:00
   #SBATCH -o install_tensorflow.%j.out
   #SBATCH -e install_tensorflow.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf

   # Create virtual environment if it doesn't exist
   if [ ! -d ${VIRTUAL_ENV} ]; then
       # Create Python virtual environment with Python 3.11
       # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
       # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
       # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
       conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
       
       if [ $? -ne 0 ]; then
           echo "Conda Python virtual environment creation failed" >&2
           exit 1
       fi
       
       echo "Python virtual environment successfully created!"
   else
       echo "Virtual environment already exists at ${VIRTUAL_ENV}"
   fi

   # Expose the virtual environment (do not use conda activate)
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   # Set channel priority to flexible (required when packages exist in multiple channels)
   conda config --set channel_priority flexible

   # Install TensorFlow CPU and Horovod together (required for CPU-only clusters)
   # If Horovod is installed before TensorFlow, it will install CUDA dependencies which are not needed on CPU-only clusters
   # Installing them together ensures Horovod is built with CPU-only support and conda resolves compatible versions

   # Option 1: Let conda choose latest compatible versions (recommended for best compatibility)
   conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmamba

   # Option 2: Pin TensorFlow version, let conda choose compatible Horovod
   # Use this to install a specific TensorFlow version (replace <version> with the desired version)
   # conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu=<version> horovod nomkl --solver=libmamba

Submit the job to the queue:

.. code-block:: bash

   sbatch run_install_tensorflow.sh

and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).

.. important::

   - Install TensorFlow and Horovod together: To use Horovod for distributed training, ``tensorflow-cpu`` and ``horovod`` must be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer.
   - Version compatibility: Conda’s dependency resolver does not guarantee runtime API compatibility. Even when installing together, runtime API incompatibilities may be encountered (e.g., Horovod callbacks failing with ``AttributeError: 'Variable' object has no attribute 'ref'``). The training script handles this automatically by falling back to manual synchronization.
   - Use flexible channel priority: When “excluded by strict repo priority” appears, set ``channel_priority`` to ``flexible`` to allow conda to use packages from any configured channel
   - Do not specify build string: If pinning a version, install by version only (e.g., ``tensorflow-cpu=<version>``), not with the build string (e.g., ``tensorflow-cpu=<version>=cpu_py313hbca4264_0``)
   - Include nomkl: Similar to PyTorch, include ``nomkl`` for AMD Zen2 processors

Runtime API compatibility: TensorFlow and Horovod may have runtime API incompatibilities even when conda installs them together. If errors like ``Horovod has been shut down`` or ``AttributeError: 'Variable' object has no attribute 'ref'`` are encountered, this indicates the installed versions are incompatible. The training script attempts to handle this, but if Horovod shuts down during initialization, training will fail. In this case, different version combinations may need to be tried or PyTorch may be used instead, which has better compatibility with Horovod.

.. _verify-tensorflow-installation:

Verify TensorFlow installation

After installing ``tensorflow-cpu``, verify the installation. Save the SLURM batch script below as ``run_verify_tensorflow.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=verify_tensorflow
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=4G
   #SBATCH --time=00:10:00
   #SBATCH -o verify_tensorflow.%j.out
   #SBATCH -e verify_tensorflow.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
   [ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

   # Ensure the virtual environment is exposed
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   # Check TensorFlow version
   python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"

   # Check available CPU devices (TensorFlow shows logical CPU devices, not physical sockets)
   python -c "import tensorflow as tf; print('CPU devices:', tf.config.list_physical_devices('CPU'))"

   # Check number of CPU threads TensorFlow can use
   python -c "import tensorflow as tf; print('Inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"

   # Check total CPU cores available to TensorFlow
   python -c "import tensorflow as tf; import os; print('Total CPU cores available:', len(os.sched_getaffinity(0)) if hasattr(os, 'sched_getaffinity') else 'N/A')"

   # Configure TensorFlow to use all available CPU threads (256 threads per node on Discoverer)
   python -c "import tensorflow as tf; tf.config.threading.set_inter_op_parallelism_threads(256); tf.config.threading.set_intra_op_parallelism_threads(256); print('Configured inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Configured intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"

   # Check if TensorFlow has MPI integration
   python -c "import tensorflow as tf; print('tf.distribute available:', hasattr(tf, 'distribute')); print('tf.distribute.Strategy available:', hasattr(tf.distribute, 'Strategy') if hasattr(tf, 'distribute') else False)"

   # Check TensorFlow build configuration for MPI support
   python -c "import tensorflow as tf; config = tf.sysconfig.get_build_info(); print('Build flags:', config.get('cpu_compiler_flags', 'N/A')); import sysconfig; print('Has MPI:', 'mpi' in str(sysconfig.get_config_var('LIBS')).lower() or 'mpi' in str(tf.sysconfig.get_compile_flags()).lower())"

   # Check Horovod installation (if installed with TensorFlow)
   python -c "try: import horovod.tensorflow as hvd; print('Horovod version:', hvd.__version__); print('Horovod TensorFlow support: OK'); except ImportError: print('Horovod not installed or TensorFlow support not available')"

   # Test basic tensor operations
   python -c "import tensorflow as tf; x = tf.constant([[1.0, 2.0], [3.0, 4.0]]); print('Tensor shape:', x.shape); print('Tensor device:', x.device); y = tf.matmul(x, x); print('Matrix multiplication result shape:', y.shape)"

Submit the job to the queue:

.. code-block:: bash

   sbatch run_verify_tensorflow.sh

and follow the content written down in the standard output and standard error files (``verify_tensorflow.JOBID.out`` and ``verify_tensorflow.JOBID.err``) to see the verification results.

Expected output should show:

-  TensorFlow version number
-  CPU devices available
-  Thread configuration (inter-op and intra-op parallelism)
-  Horovod installation status (if installed)
-  Tensor operations working correctly

.. _examine-available-tensorflow-methods-functions-and-constants:

Examine available TensorFlow methods, functions, and constants

To inspect what methods, functions, and constants are available in the installed TensorFlow, save the SLURM batch script below as ``run_examine_tensorflow.sh`` and provide the correct ``account`` and ``qos`` values:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=examine_tensorflow
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=4G
   #SBATCH --time=00:10:00
   #SBATCH -o examine_tensorflow.%j.out
   #SBATCH -e examine_tensorflow.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   # Export the path to the Python virtual environment folder
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf

   # Check if the target folder exists
   [ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

   # Ensure the virtual environment is exposed
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   # List all available attributes in tensorflow module
   python -c "import tensorflow as tf; print('Available attributes:'); print([attr for attr in dir(tf) if not attr.startswith('_')])"

   # Check specific submodules
   python -c "import tensorflow as tf; print('tf.keras available:', hasattr(tf, 'keras')); print('tf.nn available:', hasattr(tf, 'nn')); print('tf.optimizers available:', hasattr(tf, 'optimizers')); print('tf.distribute available:', hasattr(tf, 'distribute'))"

   # Check TensorFlow build configuration
   python -c "import tensorflow as tf; print('Build info:', tf.sysconfig.get_build_info())"

   # Check available operations
   python -c "import tensorflow as tf; ops = [attr for attr in dir(tf) if callable(getattr(tf, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"

   # Check constants and dtypes
   python -c "import tensorflow as tf; print('Float32:', tf.float32); print('Int32:', tf.int32); print('Default float dtype:', tf.keras.backend.floatx())"

   # Check Keras availability and version
   python -c "import tensorflow as tf; print('Keras version:', tf.keras.__version__ if hasattr(tf.keras, '__version__') else 'N/A'); print('Keras available:', hasattr(tf, 'keras'))"

Submit the job to the queue:

.. code-block:: bash

   sbatch run_examine_tensorflow.sh

and follow the content written down in the standard output and standard error files (``examine_tensorflow.JOBID.out`` and ``examine_tensorflow.JOBID.err``) to see the available methods, functions, and constants.

Alternatively, run interactively using ``srun``:

.. code-block:: bash

   srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
        --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash

   # Then in the interactive session:
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
   python -c "import tensorflow as tf; print('Available attributes:', [attr for attr in dir(tf) if not attr.startswith('_')])"
   # ... (continue with other examination commands)

These tests help to verify:

-  Which TensorFlow modules are available (keras, nn, optimizers, distribute, etc.)
-  Build configuration and available features
-  Available tensor operations and constants
-  Keras integration and version

.. _create-slurm-configuration-for-mlai-workloads-on-cpu-clusters:

Create SLURM configuration for ML/AI workloads on CPU clusters

.. _1-resource-allocation:

1. Resource allocation

-  CPU-intensive tasks: Request all 256 threads per node
-  Memory-intensive tasks: Use fat nodes (1TB RAM) when needed
-  I/O-intensive tasks: Consider network-attached storage performance
-  **Wall time limits**: All jobs have maximum runtime limits specified with ``#SBATCH --time=``. Jobs are automatically terminated when the wall time limit is reached. Plan checkpoint intervals accordingly (see `Checkpointing <#5-checkpointing>`__)

.. _2-job-arrays-for-parallel-experiments:

2. Job arrays for parallel experiments

.. code-block:: bash

   #SBATCH --array=0-99%10  # Run 100 jobs, max 10 concurrent

.. _3-memory-management:

3. Memory management

-  Monitor memory usage with ``sstat`` during job execution
-  Use memory-efficient libraries (e.g., ``pandas`` with chunking)
-  Consider using fat nodes for large datasets

.. _5-checkpointing:

5. Checkpointing

All SLURM jobs on Discoverer have wall time limitations (specified with ``#SBATCH --time=``). When a job reaches its wall time limit, it is automatically terminated by SLURM, regardless of whether training has completed. Therefore, checkpointing is essential for any training job that may exceed the allocated wall time.

The necessity of making checkpoints:

-  Jobs are terminated when wall time is reached, even if training is incomplete
-  Without checkpoints, all progress is lost and training must be restarted from the beginning
-  Checkpoints allow resuming training from the last saved state after the job is terminated

Best practices:

-  Implement checkpointing for all long-running training jobs
-  Save checkpoints to network storage (``/valhalla/projects/<your_slurm_project_account_name>/outputs/``)
-  Plan checkpoint intervals based on the wall time limit (e.g., if wall time is 24 hours, save checkpoints every 2-4 hours)
-  Ensure checkpoint saving happens frequently enough that significant progress is not lost if the job is terminated
-  Only rank 0 should write checkpoints in distributed training to avoid conflicts
-  Include epoch number, model state, optimizer state, and any other necessary information in checkpoints
-  Implement checkpoint loading at the start of training to resume from the last checkpoint if available

.. _suitable-mlai-workloads-for-cpu-clusters:

Suitable ML/AI workloads for CPU clusters

.. _1-traditional-machine-learning:

1. Traditional machine learning

.. _scikit-learn-workflows:

Scikit-learn workflows

-  Large-scale feature engineering and preprocessing
-  Hyperparameter optimization with grid/random search
-  Cross-validation on large datasets
-  Ensemble methods (Random Forest, Gradient Boosting)
-  Model training on datasets that fit in memory (up to 251 GB allocatable per node)

.. _xgboostlightgbmcatboost:

XGBoost/LightGBM/CatBoost

-  Tabular data classification/regression
-  Large-scale gradient boosting
-  Feature importance analysis
-  Model interpretation

.. _2-deep-learning-cpu-optimized:

2. Deep learning (CPU-optimized)

.. _tensorflowkeras-with-cpu-optimization:

TensorFlow/Keras with CPU optimization

-  Training smaller neural networks (CNNs, RNNs, MLPs)
-  Inference workloads
-  Transfer learning with pre-trained models
-  Model fine-tuning
-  Distributed training (see official TensorFlow distributed documentation)

.. _pytorch-with-cpu-backend:

PyTorch with CPU backend

-  Research prototyping
-  Model development and debugging
-  CPU-optimized inference
-  Distributed training (see official PyTorch distributed documentation)

.. _3-natural-language-processing-nlp:

3. Natural language processing (NLP)

.. _traditional-nlp-pipelines:

Traditional NLP pipelines

-  Large-scale text preprocessing and tokenization
-  Feature extraction (TF-IDF, word embeddings)
-  Document classification
-  Sentiment analysis on large corpora

.. _transformer-models-cpu-inference:

Transformer models (CPU inference)

-  Batch inference with Hugging Face models
-  Text generation (smaller models)
-  Embedding extraction
-  Model quantization and optimization

.. _4-computer-vision:

4. Computer vision

.. _image-processing-and-feature-extraction:

Image processing and feature extraction

-  Large-scale image preprocessing
-  Feature extraction (SIFT, SURF, HOG)
-  Image classification with traditional ML
-  Batch image transformations

.. _5-data-science-and-analytics:

5. Data science and analytics

.. _large-scale-data-processing:

Large-scale data processing

-  Pandas/NumPy operations on large datasets
-  Dask for large-scale data processing
-  Feature engineering pipelines
-  Data validation and cleaning

.. _6-hyperparameter-optimization:

6. Hyperparameter optimization

.. _hyperparameter-optimization-1:

.. _hyperparameter-optimization:

Hyperparameter optimization

-  Grid search with parallel trials
-  Random search with parallel trials
-  Bayesian optimization
-  AutoML frameworks

.. _7-reinforcement-learning:

7. Reinforcement learning

.. _cpu-based-rl-training:

CPU-based RL training

-  Policy gradient methods
-  Q-learning variants
-  Multi-agent systems
-  Environment simulation

.. _distributed-training:

Distributed training

.. _recommended-approach-use-mpi-for-hpc-clusters:

Recommended approach: Use MPI for HPC clusters

For multi-user HPC clusters like Discoverer, MPI is the recommended and verified approach for distributed training.

.. _why-mpi-over-gloo:

Why MPI over Gloo?

Gloo (PyTorch’s TCP/IP-based backend) has major practical problems on shared HPC clusters: - Port conflicts: Multiple users can pick the same ``MASTER_PORT``, causing job failures - No automatic port management: Manual coordination required between users - Race conditions: Port availability can change between check and use - Administrative burden: Unsuitable for multi-user environments

MPI was designed for multi-user HPC environments and handles all communication automatically without port conflicts. It provides: - No port management - MPI runtime handles all communication - Multi-user safe - designed for shared clusters - Battle-tested - decades of HPC use - SLURM integration - seamless with ``srun --mpi=``

Conclusion: For multi-user HPC clusters, MPI is the correct choice. Gloo should be avoided in shared environments.

.. _tensorflow-distributed-training-with-horovod:

TensorFlow distributed training with Horovod

Horovod provides a unified interface for distributed training with both TensorFlow and PyTorch. For a complete working example with TensorFlow, see the `Complete working example: Distributed training on MNIST dataset with TensorFlow <#complete-working-example-distributed-training-on-mnist-dataset-with-tensorflow>`__ section below.

Basic TensorFlow with Horovod setup:

.. code-block:: python

   import tensorflow as tf
   import horovod.tensorflow as hvd

   # Initialize Horovod
   hvd.init()

   # Build and compile model
   model = tf.keras.Sequential([
       # Add layers here
   ])

   # Scale learning rate by number of workers
   opt = tf.keras.optimizers.Adam(0.001 * hvd.size())

   # Wrap optimizer with Horovod
   opt = hvd.DistributedOptimizer(opt)

   model.compile(optimizer=opt, loss='sparse_categorical_crossentropy')

   # Add Horovod callbacks
   callbacks = [
       hvd.callbacks.BroadcastGlobalVariablesCallback(0),
       hvd.callbacks.MetricAverageCallback(),
       # Add other callbacks here
   ]

   # Only rank 0 should write checkpoints
   if hvd.rank() == 0:
       callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

   model.fit(dataset, epochs=10, callbacks=callbacks)

.. _running-pytorch-workloads:

Running PyTorch workloads

.. _package-installation-for-distributed-training:

Package installation for distributed training

For distributed training with Horovod, install PyTorch, Horovod, and torchvision together in the same command. See `Install PyTorch-CPU <#install-pytorch-cpu>`__ in the Virtual environment setup section for complete installation instructions.

.. important:: Python version requirement:

   -  Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
   -  TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11
   -  The virtual environment must be created with Python 3.11 for optimal compatibility
   -  Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions

.. important:: MPI implementation for Horovod:

   -  Install PyTorch and Horovod together to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed)
   -  No separate MPI module needs to be loaded - Open MPI is installed as a dependency in the virtual environment when Horovod is installed
   -  Use ``srun --mpi=pmix`` to launch jobs - ``srun`` automatically uses all allocated tasks, no need to specify ``-np``
   -  The ``mpirun`` command is also available from the virtual environment (installed as part of Open MPI dependency), but ``srun`` is preferred for SLURM integration

Installation command for distributed training:

.. code-block:: bash

   # Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
   conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba

.. note::

   -  With conda, the package is named ``pytorch-cpu`` (or ``pytorch``), and it is imported as ``import torch``
   -  torchvision is required for datasets (e.g., MNIST) and image transformations
   -  pytorch-cpu, horovod, and torchvision must be installed together in the same command to ensure Open MPI is used instead of MPICH and to bring ``mpi4py`` as a dependency
   -  Open MPI is installed as a dependency in the virtual environment when Horovod is installed - no separate MPI module needs to be loaded
   -  Always include ``nomkl`` when installing ``pytorch-cpu`` on Discoverer (AMD Zen2 processors)
   -  For distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)

.. _pytorch-distributed-training-with-mpi:

PyTorch distributed training with MPI

.. _option-1-pytorch-native-ddp-with-mpi-backend:

Option 1: PyTorch native DDP with MPI backend

SLURM batch script:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=pytorch_distributed
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=4
   #SBATCH --ntasks-per-node=8
   #SBATCH --cpus-per-task=2
   #SBATCH --mem=16G
   #SBATCH --time=02:00:00
   #SBATCH -o pytorch_distributed.%j.out
   #SBATCH -e pytorch_distributed.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
   [ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

   # Launch with SLURM's MPI integration
   srun --mpi=pmix python train.py

Python code example:

.. code-block:: python

   import torch
   import torch.distributed as dist
   from torch.nn.parallel import DistributedDataParallel as DDP
   from torch.utils.data.distributed import DistributedSampler

   # Initialize process group with MPI backend
   # No MASTER_ADDR, no MASTER_PORT needed - MPI handles everything
   dist.init_process_group(backend='mpi')

   # Get local rank and world size
   local_rank = dist.get_rank()
   world_size = dist.get_world_size()

   # Create model and move to device
   model = YourModel()
   model = DDP(model)

   # Create distributed sampler for data loading
   train_sampler = DistributedSampler(
       train_dataset,
       num_replicas=world_size,
       rank=local_rank
   )

   train_loader = torch.utils.data.DataLoader(
       train_dataset,
       batch_size=batch_size,
       sampler=train_sampler
   )

   # Training loop
   for epoch in range(num_epochs):
       train_sampler.set_epoch(epoch)
       for data, target in train_loader:
           # Training code here
           pass

   # Clean up
   dist.destroy_process_group()

.. _option-2-horovod-with-pytorch-recommended-for-simplicity:

Option 2: Horovod with PyTorch (recommended for simplicity)

Horovod works with both TensorFlow and PyTorch and simplifies distributed training setup.

SLURM batch script:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=horovod_pytorch
   #SBATCH --partition=cn
   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos>
   #SBATCH --nodes=2
   #SBATCH --ntasks-per-node=8
   #SBATCH --cpus-per-task=2
   #SBATCH --mem=16G
   #SBATCH --time=02:00:00
   #SBATCH -o horovod_pytorch.%j.out
   #SBATCH -e horovod_pytorch.%j.err

   cd ${SLURM_SUBMIT_DIR}
   module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
   module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

   export UCX_IB_SL=0

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
   [ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

   export PATH=${VIRTUAL_ENV}/bin:${PATH}
   export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

   export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

   # Launch with srun (srun automatically uses all allocated tasks)
   srun --mpi=pmix python train_horovod.py

Python code example:

.. code-block:: python

   import torch
   import horovod.torch as hvd

   # Initialize Horovod
   hvd.init()

   # Scale learning rate by number of workers
   optimizer = torch.optim.SGD(model.parameters(), lr=0.01 * hvd.size())

   # Wrap optimizer with Horovod
   optimizer = hvd.DistributedOptimizer(
       optimizer,
       named_parameters=model.named_parameters()
   )

   # Broadcast initial parameters from rank 0 to all workers
   hvd.broadcast_parameters(model.state_dict(), root_rank=0)

   # Training loop
   for epoch in range(num_epochs):
       for data, target in train_loader:
           optimizer.zero_grad()
           loss = criterion(model(data), target)
           loss.backward()
           optimizer.step()

.. _complete-working-example-distributed-training-on-mnist-dataset:

Complete working example: Distributed training on MNIST dataset

This section provides a complete, working example of distributed training using the MNIST handwritten digit dataset. MNIST is a standard benchmark dataset that’s automatically downloaded when the code is run, making it ideal for testing distributed training setups.

Example files: The complete training script and SLURM batch script are provided as separate, runnable files:

-  ``train_mnist_horovod_pytorch.py`` - Complete Python training script
-  ``run_mnist_training_pytorch.sh`` - SLURM batch submission script

Discoverer Storage Requirements:

-  All data and outputs must be stored under ``/valhalla/projects/<your_slurm_project_account_name>/`` (replace ``<your_slurm_project_account_name>`` with the SLURM account name)
-  Never store data in the home directory (``~/``) - this violates Discoverer’s best practices
-  The example scripts automatically use ``/valhalla/projects/<your_slurm_project_account_name>/data/`` for datasets
-  Outputs (checkpoints, logs) should go to ``/valhalla/projects/<your_slurm_project_account_name>/outputs/``
-  The SLURM script creates these directories automatically

These files can be copied and run directly after configuring the SLURM account settings.

.. _overview-pytorch:

Overview
^^^^^^^^

Dataset: MNIST.*Training set: 60,000 images
-  Test set: 10,000 images
-  Image size: 28×28 pixels
-  Data source: Automatically downloaded from PyTorch’s built-in datasets (no manual download needed)
-  Model: Simple convolutional neural network (CNN) suitable for CPU training
-  Framework: PyTorch with Horovod
-  Training: Distributed across multiple nodes using MPI

.. _step-1-prepare-the-working-directory-pytorch:

Step 1: Prepare the working directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On the Discoverer login node, create a directory for the training job and copy the example files:

.. code-block:: bash

   # Replace <your_slurm_project_account_name> with the actual SLURM account name
   mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
   cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example

   # Copy the example files (adjust paths as needed)
   cp /path/to/train_mnist_horovod_pytorch.py .
   cp /path/to/run_mnist_training_pytorch.sh .

Note: The example files (``train_mnist_horovod_pytorch.py`` and ``run_mnist_training_pytorch.sh``) are provided as separate, runnable scripts that can be copied and used directly.

.. _step-2-get-the-training-script-pytorch:

Step 2: Get the training script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy the training script to the working directory:

.. code-block:: bash

   # Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
   cp train_mnist_horovod_pytorch.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
   cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example

The training script (``train_mnist_horovod_pytorch.py``) contains:

-  CNN model definition for MNIST digit classification
-  Distributed data loading with Horovod
-  Training and validation loops with metric aggregation
-  Automatic MNIST dataset download (only rank 0 downloads)
-  Complete distributed training implementation

The full script can be viewed in ``train_mnist_horovod_pytorch.py`` or modified as needed.

.. _step-3-get-the-slurm-batch-script-pytorch:

Step 3: Get the SLURM batch script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy the SLURM batch script to the working directory:

.. code-block:: bash

   # Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
   cp run_mnist_training_pytorch.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
   cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example

Before submitting, edit ``run_mnist_training_pytorch.sh`` and replace:

-  ``<your_slurm_project_account_name>`` with the actual SLURM account name
-  ``<your_qos>`` with the actual QoS name
-  Verify the virtual environment path matches the setup

The SLURM script (``run_mnist_training_pytorch.sh``) contains:

-  Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
-  Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
-  Module loading (anaconda3 only - Open MPI is installed as a dependency in the virtual environment when Horovod is installed)
-  Virtual environment setup
-  Job launch with ``srun --mpi=pmix`` (srun automatically uses all allocated tasks)

Understanding epochs in distributed training:

-  Epochs are shared across all ranks, not per-rank
-  With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
-  Each rank processes a different subset of data during each epoch
-  All ranks work together on the same epochs simultaneously
-  256 epochs are *not* needed for 256 ranks - the number of epochs is independent of the number of ranks

Resource allocation for full node allocation:

-  Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
-  Recommended: ``--ntasks-per-node=128`` (one rank per core)
-  Recommended: ``--cpus-per-task=2`` (use both threads per core)
-  Memory: Adjust ``--mem`` based on the model size (e.g., 251G for full node - maximum available on Discoverer)
-  Total ranks: 2 nodes × 128 tasks = 256 ranks
-  Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes

The full script can be viewed in ``run_mnist_training_pytorch.sh`` and resource requirements can be modified as needed.

.. _step-4-make-the-script-executable-and-submit-the-job:

Step 4: Make the script executable and submit the job
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On the login node:

.. code-block:: bash

   # Edit the script first to set the account and QoS (if not already done)
   # Then submit the job
   sbatch run_mnist_training_pytorch.sh

Before submitting: Ensure ``run_mnist_training_pytorch.sh`` has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct.

.. _step-5-monitor-the-job-pytorch:

Step 5: Monitor the job
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Check job status
   squeue --me

   # Monitor output in real-time (replace JOBID with the actual job ID)
   # Note: These commands should only be executed when squeue --me shows the job is in running state
   tail -f pytorch_cpu_training.JOBID.out

   # Check for errors
   tail -f pytorch_cpu_training.JOBID.err

.. _step-6-check-job-runtime-and-statistics:

Step 6: Check job runtime and statistics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After the job completes, check how long it took:

.. code-block:: bash

   # Show job-level summary only (aggregated, no job steps)
   sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS

   # Show detailed accounting with job steps (default behavior)
   sacct -j JOBID --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,TotalCPU,MaxRSS,MaxVMSize,NodeList

   # Show elapsed time (wall clock time) for a job
   sacct -j JOBID -X --format=JobID,JobName,Elapsed,State

   # Show resource usage summary
   sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,MaxVMSize,ReqMem,AllocCPUS

   # Show all jobs for today
   sacct --starttime=today -X --format=JobID,JobName,Partition,Elapsed,State,ExitCode

Understanding sacct output:

Without ``-X`` flag (default): Shows job steps separately

-  ``JOBID``: Main job (aggregated across all steps)
-  ``JOBID.batch``: The batch script execution
-  ``JOBID.0``, ``JOBID.1``, etc.: Individual job steps (e.g., MPI processes, srun steps)

With ``-X`` flag: Shows only job-level summary (aggregated)

-  Single line with total statistics for the entire job

Example output without ``-X``:

.. code-block:: bash

   $ sacct -j 4060972 --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
          JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS 
   ------------ ---------- ---------- ---------- ---------- ---------- 
   4060972      mnist_hor+   00:00:56   05:03:21                   512 
   4060972.bat+      batch   00:00:56  00:00.513     21176K        256 
   4060972.0    hydra_pmi+   00:00:55   05:03:21 150700188K          4 

Example output with ``-X`` (job-level only):

.. code-block:: bash

   $ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
          JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS 
   ------------ ---------- ---------- ---------- ---------- ---------- 
   4060972      mnist_hor+   00:00:56   00:00:00                   512 

Common fields:

-  ``Elapsed``: Wall clock time (how long the job ran)
-  ``TotalCPU``: Total CPU time used (sum across all CPUs)
-  ``MaxRSS``: Maximum resident set size (peak memory usage)
-  ``MaxVMSize``: Maximum virtual memory size
-  ``AllocCPUS``: Number of CPUs allocated

Recommendation: Use ``-X`` flag for quick job-level summaries. Remove ``-X`` to see detailed breakdown by job steps.

Performance comparison: Scaling with number of CPUs

The benefits of fully allocating nodes are clear when comparing runtime. Note that SLURM’s ``Elapsed`` time includes job startup/teardown overhead, while the training script reports actual computation time:

Partial allocation (8 tasks per node, 32 CPUs total):

.. code-block:: bash

   $ sacct -j 4060968 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
          JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS 
   ------------ ---------- ---------- ---------- ---------- ---------- 
   4060968      mnist_hor+   00:02:35   00:00:00                    32 

-  Wall time (SLURM): 2 minutes 35 seconds
-  Allocated resources: 32 CPUs allocated

Full node allocation (128 tasks per node, 512 CPUs total):

.. code-block:: bash

   $ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
          JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS 
   ------------ ---------- ---------- ---------- ---------- ---------- 
   4060972      mnist_hor+   00:00:56   00:00:00                   512 

-  Wall time (SLURM): 56 seconds
-  Allocated resources: 512 CPUs (128 tasks × 2 nodes × 2 CPUs per task)
-  Speedup: ~2.8× faster (155 seconds → 56 seconds)

Large-scale runs (multiple nodes):

8 nodes (1024 CPUs total):

.. code-block:: bash

   $ sacct -j 4060975 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
          JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS 
   ------------ ---------- ---------- ---------- ---------- ---------- 
   4060975      mnist_hor+   00:00:52   00:00:00                  1024 

-  Wall time (SLURM): 52 seconds
-  Actual training time (from logs): 18.73 seconds
-  Average time per epoch: 3.75 seconds
-  Allocated resources: 1024 CPUs (128 tasks × 8 nodes × 2 CPUs per task)

16 nodes (2048 CPUs total):

.. code-block:: bash

   $ sacct -j 4060974 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
          JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS 
   ------------ ---------- ---------- ---------- ---------- 
   4060974      mnist_hor+   00:00:49   00:00:00                  2048 

-  Wall time (SLURM): 49 seconds
-  Actual training time (from logs): 14.75 seconds
-  Average time per epoch: 2.95 seconds
-  Allocated resources: 2048 CPUs (128 tasks × 16 nodes × 2 CPUs per task)
-  Speedup vs 1024 CPUs: ~1.3× faster training time (18.73s → 14.75s)

Analysis

   - Wall time vs. training time: SLURM’s ``Elapsed`` includes job startup/teardown overhead. The training script’s reported time is the actual computation time.
   - Scaling efficiency: Training time scales well from 512 → 1024 → 2048 CPUs, showing good parallel efficiency.
   - Overhead: Job startup/teardown overhead becomes more noticeable at larger scales (e.g., 49s wall time vs 14.75s training time for 2048 CPUs).
   - Full node allocation: Using 128 tasks per node (one per core) maximizes resource allocation and minimizes training time.

.. _scalability-testing-results:

Scalability testing results

The following results demonstrate how to measure scalability and training speed across different node configurations. These are speed/scalability tests, not optimized for accuracy. The accuracy values shown are lower than optimal because the focus is on demonstrating distributed training execution and measuring speedup.

2 nodes (256 ranks):

-  Training time: 23.80 seconds
-  Learning rate: 0.16 (sqrt scaling)
-  Test accuracy: 66.66% → 10.08% (unstable - not optimized for accuracy)

4 nodes (512 ranks):

-  Training time: 19.07 seconds
-  Learning rate: 0.226 (sqrt scaling)
-  Test accuracy: 64.99% → 21.16% (unstable - not optimized for accuracy)

8 nodes (1024 ranks):

-  Training time: 15.37 seconds
-  Learning rate: 0.1 (fixed scaling)
-  Test accuracy: 19.84% → 66.18% (improving but not optimized)

16 nodes (2048 ranks):

-  Training time: 15.10 seconds
-  Learning rate: 0.1 (fixed scaling)
-  Test accuracy: 20.68% → 64.96% (improving but not optimized)

Analysis:

-  Training speed scales well: 23.80s → 19.07s → 15.37s → 15.10s
-  Diminishing returns beyond 8 nodes (communication overhead)
-  These results demonstrate scalability - actual accuracy requires further tuning (see below)

.. _achieving-better-accuracy:

Achieving better accuracy

To achieve high accuracy (e.g., ~98-99% for MNIST) instead of just speed testing, apply these changes to the training script:

1. Reduce base learning rate: Change ``base_lr = 0.01`` to ``base_lr = 0.001`` or ``0.0001``
2. Train for more epochs: Change ``num_epochs = 5`` to ``num_epochs = 10`` or ``20``
3. Use Adam optimizer: Replace ``optim.SGD`` with ``optim.Adam`` or ``optim.AdamW`` for better stability
4. Add gradient clipping: Before ``optimizer.step()``, add ``torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)``
5. Use learning rate schedule: Add ``StepLR`` scheduler to decay learning rate over epochs
6. Increase batch size: Change ``batch_size = 64`` to ``128`` or ``256`` (if memory allows)

.. note:: Better accuracy typically requires more epochs, lower learning rates, and more careful tuning, which increases training time. For production training, start with 2-4 nodes to establish baseline accuracy, then scale up while monitoring convergence.

Full node allocation significantly reduces training time. For the same workload, using 128 tasks per node instead of 8 tasks per node provides nearly 3× speedup while fully allocating the available resources. Scaling to multiple nodes further reduces training time, though with diminishing returns due to communication overhead.

.. _understanding-the-data-flow:

Understanding the data flow

1. Data location:

   -  The MNIST dataset is already available at ``/opt/software/test/pytorch/mnist_data``
   -  The dataset is automatically loaded from this location - no download is needed
   -  The dataset is approximately 60 MB and includes both training and test sets
   -  The SLURM script sets ``DATA_DIR`` environment variable to this location

2. Data distribution:

   -  Each process gets a different subset of the training data via ``DistributedSampler``
   -  With 16 processes (2 nodes × 8 tasks), each process handles ~3,750 training images
   -  Data is automatically shuffled differently each epoch

3. Model synchronization:

   -  Initial model parameters are broadcast from rank 0 to all workers
   -  After each training step, gradients are averaged across all processes using Horovod’s ring-allreduce algorithm
   -  All processes maintain identical model parameters throughout training

4. Output:

   -  Only rank 0 prints training progress to avoid duplicate output
   -  Metrics (loss, accuracy) are averaged across all processes before printing
   -  Checkpoints should be saved to ``/valhalla/projects/<your_slurm_project_account_name>/outputs`` (never use home directory)
   -  The SLURM script sets ``OUTPUT_DIR`` environment variable for this purpose

.. _expected-output:

Expected output

When checking the output file (``pytorch_cpu_training.JOBID.out``), the output should be similar to:

.. code-block:: bash

   Job ID: 4060967
   Number of nodes: 2
   Tasks per node: 128
   Total tasks: 256
   CPUs per task: 2
   Working directory: /valhalla/projects/your_account/ai_cpu
   Data directory: /opt/software/test/pytorch/mnist_data
   Output directory: /valhalla/projects/your_account/ai_cpu/outputs
   Python: /valhalla/projects/your_account/virt_envs/pytorch/bin/python
   Python version: Python 3.11.14
   Starting training on 256 processes
   Process 0 of 256
   Loading MNIST dataset...
   Data will be stored in: /opt/software/test/pytorch/mnist_data
   Training dataset download completed
   Starting training for 5 epochs...
   Epoch 1, Batch 0, Loss: 2.3130
   Epoch 1 - Average Loss: 0.8803, Average Accuracy: 72.08%
   Test Loss: 0.0985, Test Accuracy: 97.07%
   Epoch 2, Batch 0, Loss: 0.2280
   Epoch 2 - Average Loss: 0.1597, Average Accuracy: 95.36%
   Test Loss: 0.0540, Test Accuracy: 98.29%
   Epoch 3, Batch 0, Loss: 0.1328
   Epoch 3 - Average Loss: 0.1004, Average Accuracy: 97.09%
   Test Loss: 0.0488, Test Accuracy: 98.45%
   Epoch 4, Batch 0, Loss: 0.0106
   Epoch 4 - Average Loss: 0.0834, Average Accuracy: 97.63%
   Test Loss: 0.0505, Test Accuracy: 98.45%
   Epoch 5, Batch 0, Loss: 0.1305
   Epoch 5 - Average Loss: 0.0726, Average Accuracy: 97.86%
   Test Loss: 0.0322, Test Accuracy: 98.99%
   Training completed in 99.44 seconds
   Average time per epoch: 19.89 seconds
   Training job completed

Expected output characteristics:

-  Job information header: Shows resource allocation and paths
-  Process count: Should match ``--nodes × --ntasks-per-node`` (e.g., 2 × 128 = 256 for full allocation)
-  Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
-  Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
-  Training progress: Loss should decrease and accuracy should increase over epochs
-  Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
-  Completion message: “Training job completed” indicates successful finish

.. note:: Only MPI rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.

.. _understanding-process-structure:

Understanding process structure

When checking running processes with ``top`` or ``htop`` on a compute node, the following will be visible:

1. Main Python training processes: 128 processes (one per ``--ntasks-per-node``, matching one per core)

   -  High CPU usage (180-220%) - using multiple threads (2 threads per process)
   -  Memory usage: ~500-530 MB per process
   -  These are the distributed training workers
   -  Each process uses 2 threads (matching ``--cpus-per-task=2``)

2. ``pt_data_worker`` processes: 256 processes (2 per training process)

   -  Lower CPU usage (~2%) - data loading workers
   -  Memory usage: ~280-290 MB per worker
   -  These are PyTorch DataLoader worker processes

The purpose of ``pt_data_worker`` processes: - PyTorch’s ``DataLoader`` uses ``num_workers=2`` (set in the training script) - Each training process spawns 2 worker processes to load data in parallel - This prevents data loading from blocking training computation - With 128 training processes × 2 workers = 256 ``pt_data_worker`` processes per node

Total processes per node (full allocation): - 128 main Python processes (training, one per core) - 256 ``pt_data_worker`` processes (data loading) - Total: 384 processes per node - Total across 2 nodes: 768 processes (512 training + 256 data workers)

Resource allocation: - CPU cores: 128 cores per node × 2 nodes = 256 cores (fully utilized) - Threads: 256 ranks × 2 threads = 512 threads (fully utilizing hyperthreading) - Memory: Adjust ``--mem`` based on model size (e.g., 251G per node for full allocation - maximum available on Discoverer)

This is normal and expected behavior. The worker processes help keep the GPU/CPU busy by pre-loading the next batch while the current batch is being processed.

To reduce worker processes (if needed for resource constraints): - With 256 ranks, having ``num_workers=2`` creates 512 data worker processes (256 × 2) - Consider reducing to ``num_workers=1`` for large-scale runs (256 worker processes total) - Or set ``num_workers=0`` to disable parallel data loading (slower but minimal overhead) - For full node allocation (256 ranks), ``num_workers=1`` is often a good balance

.. _troubleshooting-pytorch:

Troubleshooting

1. Job stuck in pending state with reason “(None)”:

   -  “(None)” means the job is waiting in the queue without a specific reason code

   -  This is common for large resource requests (e.g., 2 full nodes)

   -  Check detailed job information:

      .. code-block:: bash

         scontrol show job JOBID

   -  Check job priority and position in queue:

      .. code-block:: bash

         sprio -j JOBID

   -  Check partition availability:

      .. code-block:: bash

         sinfo -p cn

   -  Check available nodes:

      .. code-block:: bash

         sinfo -p cn -o '%N %t %C %m'

   -  Common reasons for delay:

      -  No 2 full nodes available simultaneously (most common for full node requests)
      -  Other jobs ahead in queue with higher priority
      -  Account/QoS limits (check with ``sacctmgr show assoc user=$USER``)
      -  Partition limits or maintenance

   -  Solutions:

      -  Wait for resources to become available
      -  Reduce resource request (fewer nodes, fewer tasks per node)
      -  Check with cluster administrators about account limits
      -  Consider using fewer nodes or reducing memory request if possible

2. Python version compatibility issues:

   -  Python 3.11 is the current optimal version for ML/AI workloads on Discoverer

   -  If Python version errors are encountered, recreate the virtual environment with Python 3.11:

      .. code-block:: bash

         # Remove the existing environment
         conda env remove --prefix ${VIRTUAL_ENV} -y

         # Create new environment with Python 3.11 (current optimal version)
         # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
         # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
         # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
         conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y

         # Then install packages again (install pytorch-cpu, horovod, and torchvision together to ensure Open MPI and mpi4py)
         conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba

3. Dataset location issues:

   -  The MNIST dataset should already be available at ``/opt/software/test/pytorch/mnist_data``
   -  If the dataset is not found, verify the path is correct and accessible from compute nodes
   -  The SLURM script sets ``DATA_DIR`` to this location automatically

4. Out of memory:

   -  Reduce ``batch_size`` in the training script (currently 64)
   -  Reduce number of workers in DataLoader (``num_workers=2``)

5. Slow training:

   -  Increase ``batch_size`` if memory allows
   -  Reduce number of epochs for testing
   -  Check that ``OMP_NUM_THREADS`` is set correctly

6. Training produces NaN loss values:

   -  Cause: Learning rate is too high when scaled linearly with large number of ranks
   -  With 256+ ranks, linear scaling (base_lr × num_ranks) creates extremely high learning rates
   -  Example: 0.01 × 2048 = 20.48 (way too high, causes divergence)
   -  Solution: The training script now uses square root scaling for large numbers of ranks
   -  For ≤32 ranks: linear scaling (base_lr × num_ranks)
   -  For >32 ranks: square root scaling (base_lr × sqrt(num_ranks))
   -  This prevents NaN values while still benefiting from distributed training
   -  If NaN values still appear, try reducing ``base_lr`` further (e.g., 0.001 instead of 0.01)

7. MPI errors:

   -  Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate MPI module needs to be loaded
   -  Use ``srun --mpi=pmix`` to launch jobs - ``srun`` automatically uses all allocated tasks (no need for ``-np $SLURM_NTASKS``)
   -  Verify that MPI libraries are available: ``which mpirun`` (should be in the virtual environment, though ``srun`` is preferred)
   -  Check that Horovod was installed correctly: ``conda list horovod``
   -  Ensure virtual environment has ``horovod`` and ``mpi4py`` installed
   -  Important: Install PyTorch, Horovod, and torchvision together (``conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba``) to ensure Open MPI is used and to bring ``mpi4py`` as a dependency

.. _adapting-for-custom-datasets-pytorch:

Adapting for custom datasets

To use a custom dataset:

1. Replace the dataset loading section:

   .. code-block:: python

      # Instead of datasets.MNIST, use a custom dataset
      from torch.utils.data import Dataset

      class YourDataset(Dataset):
          def __init__(self, data_path, transform=None):
              # Load data here
              pass

          def __len__(self):
              return len(self.data)

          def __getitem__(self, idx):
              # Return (image, label) tuple
              return image, label

      train_dataset = YourDataset(data_path='/path/to/data', transform=transform)

2. Adjust model architecture:

   -  Modify ``MNIST_CNN`` class to match the input dimensions and number of classes
   -  Update normalization values in transforms if needed

3. Update data directory:

   -  For the example MNIST training, the dataset is at ``/opt/software/test/pytorch/mnist_data``
   -  For custom datasets, store them in ``/valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/``
   -  Update the ``DATA_DIR`` environment variable in the SLURM script to point to the dataset location
   -  Ensure the path is accessible from all compute nodes (project directories are shared across nodes)
   -  Never store data in home directory - always use ``/valhalla/projects/<your_slurm_project_account_name>/``

.. _scaling-analysis-why-different-node-counts-require-different-numbers-of-epochs:

Scaling analysis: Why different node counts require different numbers of epochs

In distributed training with Horovod, the training dataset is sharded (divided) across all worker processes (ranks). Each rank processes only its assigned subset of data during each epoch.

Data sharding formula:

::

   Samples per rank per epoch = Total training samples / Number of ranks

As the number of nodes (and therefore ranks) increases:

1. Fewer samples per rank: Each rank sees progressively fewer training samples per epoch

   -  2 nodes (256 ranks): 60,000 / 256 = 234.4 samples per rank
   -  4 nodes (512 ranks): 60,000 / 512 = 117.2 samples per rank
   -  8 nodes (1024 ranks): 60,000 / 1024 = 58.6 samples per rank
   -  16 nodes (2048 ranks): 60,000 / 2048 = 29.3 samples per rank

2. Reduced effective batch size per rank: With fewer samples, batch sizes must be reduced to ensure multiple batches per epoch

   -  More ranks → smaller batch sizes (64 → 32 → 16)

3. More epochs needed for convergence: With fewer samples per rank, each rank needs more epochs to:

   -  See enough diverse data to learn meaningful patterns
   -  Accumulate sufficient gradient updates for convergence
   -  Overcome the statistical noise from limited data per rank

4. Learning rate scaling: Learning rates are scaled more conservatively with more ranks to prevent training instability

Explanation: 16 nodes required more epochs than 8 nodes

With 16 nodes (2048 ranks), each rank sees only 29.3 samples per epoch. This extremely limited data per rank means: - Each rank sees very little data diversity per epoch - Gradient estimates become noisier, requiring more iterations to converge - More communication overhead with more ranks can introduce slight inconsistencies - Beyond a certain point, adding more ranks doesn’t linearly improve convergence speed due to the data limitation

The following table shows PyTorch training results across different node configurations:

+---------------------+--------------------+--------------------+---------------------+----------------------+
| Metric              | 2 Nodes(256 ranks) | 4 Nodes(512 ranks) | 8 Nodes(1024 ranks) | 16 Nodes(2048 ranks) |
+=====================+====================+====================+=====================+======================+
| Total Ranks         | 256                | 512                | 1024                | 2048                 |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Samples per Rank    | 234.4              | 117.2              | 58.6                | 29.3                 |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Batch Size          | 64                 | 32                 | 16                  | 16                   |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Learning Rate       | 0.016000           | 0.012126           | 0.005657            | 0.006727             |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| LR Scaling          | sqrt (0.5)         | conservative (0.4) | conservative (0.25) | conservative (0.25)  |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Max Epochs          | 20                 | 25                 | 30                  | 30                   |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Epochs to 98%       | 17                 | 13                 | 13                  | 28                   |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Final Accuracy      | 98.14%             | 98.01%             | 98.11%              | 98.12%               |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Total Training Time | 71.44s             | 52.56s             | 52.80s              | 94.36s               |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Time per Epoch      | 4.20s              | 4.04s              | 4.06s               | 3.37s                |
+---------------------+--------------------+--------------------+---------------------+----------------------+
| Early Stopping      | Yes                | Yes                | Yes                 | Yes                  |
+---------------------+--------------------+--------------------+---------------------+----------------------+

Highlights:

1. Training time efficiency: From the total training time, 4 nodes (512 MPI ranks) is the most optimal for computation, achieving 98% accuracy in 52.56 seconds. 8 nodes (1024 ranks) is nearly as efficient at 52.80 seconds, while 2 nodes requires 71.44 seconds and 16 nodes requires 94.36 seconds.
2. Convergence patterns: 4 and 8 nodes show fast convergence (13 epochs), while 16 nodes requires 28 epochs due to extremely limited data per rank
3. Learning rate scaling: Becomes more conservative as ranks increase to prevent training instability
4. Batch size adaptation: Automatically reduced with more ranks to ensure sufficient batches per epoch

Recommendations:

-  Optimal node count for this workload: 4-8 nodes provide the best balance of speed and resource allocation
-  Avoid over-scaling: Beyond 8 nodes, the benefits diminish due to data sharding limitations
-  Early stopping is effective: All runs successfully used early stopping at 98%, saving computation time
-  Adaptive epoch scaling: The script correctly scales max epochs based on rank count, allowing sufficient training time for larger scales

Estimating optimal parallel rank for training jobs:

For each training job, estimate the optimal parallel rank (number of MPI ranks) by:

1. Running multiple test runs with different node counts (e.g., 2, 4, 8, 16 nodes) using the same model architecture and training setup
2. Comparing total training time, convergence speed (epochs to target accuracy), and resource allocation across different configurations
3. Identifying the configuration that achieves the target accuracy in the shortest time while maintaining stable convergence
4. Running production training jobs based on the estimated optimal configuration from these test runs

.. note:: The optimal configuration may vary depending on dataset size, model complexity, and target accuracy. Always validate the optimal rank count through empirical testing before running large-scale production jobs.

.. _complete-working-example-distributed-training-on-mnist-dataset-with-tensorflow:

Complete working example: Distributed training on MNIST dataset with TensorFlow

This section provides a complete, working example of distributed training using TensorFlow/Keras with Horovod on the MNIST handwritten digit dataset. This example follows the same structure as the PyTorch example above but uses TensorFlow/Keras instead.

Example files: The complete training script and SLURM batch script are provided as separate, runnable files: - ``train_mnist_horovod_tf.py`` - Complete Python training script for TensorFlow - ``run_mnist_training_tf.sh`` - SLURM batch submission script for TensorFlow

Discoverer Storage Requirements:

-  All outputs must be stored under ``/valhalla/projects/<your_slurm_project_account_name>/`` (replace ``<your_slurm_project_account_name>`` with the SLURM account name)
-  Never store data or other large files (like outputs) in the user’s home directory (``~/``) - this violates Discoverer CPU partition best file system utilisation practices
-  The MNIST dataset is already available locally at ``/opt/software/test/tf/mnist_data`` - no download is needed
-  Outputs (checkpoints, logs) should go to ``/valhalla/projects/<your_slurm_project_account_name>/outputs/``
-  The SLURM script creates the output directory automatically

.. _overview-tensorflow:

Overview
^^^^^^^^

Dataset: MNIST.*containing 70,000 grayscale images of handwritten digits (0-9):

-  Training set: 60,000 images
-  Test set: 10,000 images
-  Image size: 28×28 pixels
-  Data source: The MNIST dataset is already available locally and is automatically loaded from the local storage location
-  Model: Simple convolutional neural network (CNN) suitable for CPU training
-  Framework: TensorFlow/Keras with Horovod
-  Training: Distributed across multiple nodes using MPI

Prerequisites:

-  TensorFlow and Horovod must be installed together in the same conda install command to ensure CPU-only build
-  If Horovod is installed before TensorFlow, it may install CUDA dependencies which are not needed on CPU-only clusters like Discoverer CPU partition
-  Installation command: ``conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmamba``
-  See the `TensorFlow installation <#install-tensorflow-cpu>`__ section for complete installation instructions

.. _step-1-prepare-the-working-directory-tensorflow:

Step 1: Prepare the working directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On the Discoverer login node, create a directory for the training job and copy the example files:

.. code-block:: bash

   # Replace <your_slurm_project_account_name> with the actual SLURM account name
   mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
   cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf

   # Copy the example files (adjust paths as needed)
   cp /path/to/train_mnist_horovod_tf.py .
   cp /path/to/run_mnist_training_tf.sh .

.. note:: The example files (``train_mnist_horovod_tf.py`` and ``run_mnist_training_tf.sh``) are provided as separate, runnable scripts that can be copied and used directly.

.. _step-2-get-the-training-script-tensorflow:

Step 2: Get the training script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy the training script to the working directory:

.. code-block:: bash

   # Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
   cp train_mnist_horovod_tf.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
   cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf

The training script (``train_mnist_horovod_tf.py``) contains:

-  CNN model definition using TensorFlow/Keras Sequential API
-  Distributed data loading with TensorFlow Dataset API and Horovod sharding
-  Training and validation loops with metric aggregation
-  MNIST dataset loading from local storage
-  Complete distributed training implementation with Horovod callbacks

The full script can be viewed in ``train_mnist_horovod_tf.py`` or modified as needed.

.. _step-3-get-the-slurm-batch-script-tensorflow:

Step 3: Get the SLURM batch script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy the SLURM batch script to the working directory:

.. code-block:: bash

   # Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
   cp run_mnist_training_tf.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
   cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf

Before submitting, edit ``run_mnist_training_tf.sh`` and replace:

-  ``<your_slurm_project_account_name>`` with the actual SLURM account name
-  ``<your_qos>`` with the actual QoS name
-  Verify the virtual environment path matches the setup (should point to the TensorFlow virtual environment)

The SLURM script (``run_mnist_training_tf.sh``) contains:

-  Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
-  Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
-  Module loading (``anaconda3`` only is required - Open MPI is already installed as a dependency in the virtual environment during the installation of Horovod)
-  Virtual environment setup
-  Job launch with ``srun --mpi=pmix`` (``srun`` automatically uses all allocated tasks)

Understanding epochs in distributed training:

-  Epochs are shared across all ranks, not per-rank
-  With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
-  Each rank processes a different subset of data during each epoch
-  All ranks work together on the same epochs simultaneously
-  256 epochs are not needed for 256 ranks - the number of epochs is independent of the number of ranks

Resource allocation for full node allocation:

-  Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
-  Recommended: ``--ntasks-per-node=128`` (one rank per core)
-  Recommended: ``--cpus-per-task=2`` (use both threads per core)
-  Memory: Adjust ``--mem`` based on the model size (e.g., 251G for full node - maximum available on Discoverer)
-  Total ranks: 2 nodes × 128 tasks = 256 ranks
-  Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes

.. _step-4-submit-the-job:

Step 4: Submit the job
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Edit the script first to set the account and QoS (if not already done)
   # Then submit the job
   sbatch run_mnist_training_tf.sh

Before submitting: Ensure ``run_mnist_training_tf.sh`` has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct (should point to the TensorFlow virtual environment, not PyTorch).

.. _step-5-monitor-the-job-tensorflow:

Step 5: Monitor the job
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Check job status
   squeue --me

   # View output (replace JOBID with the actual job ID)
   # Note: These commands should only be executed when squeue --me shows the job is in running state
   tail -f mnist_training_tf.JOBID.out

   # View errors (if any)
   tail -f mnist_training_tf.JOBID.err

.. _step-6-check-the-results:

Step 6: Check the results
^^^^^^^^^^^^^^^^^^^^^^^^^

When the job completes, check the output file for training progress and final metrics.

When checking the output file (``mnist_training_tf.JOBID.out``), the output should be similar to:

.. code-block:: bash

   Job ID: 4060967
   Number of nodes: 2
   Tasks per node: 128
   Total tasks: 256
   CPUs per task: 2
   Working directory: /valhalla/projects/your_account/ai_cpu
   Data directory: /opt/software/test/tf/mnist_data
   Output directory: /valhalla/projects/your_account/ai_cpu/outputs
   Python: /valhalla/projects/your_account/virt_envs/tf/bin/python
   Python version: Python 3.11.14
   Starting training on 256 processes
   Process 0 of 256
   Loading MNIST dataset...
   Data will be stored in: /opt/software/test/tf/mnist_data
   Training dataset download completed
   Using learning rate: 0.16 (base: 0.01, ranks: 256, scaling: sqrt)
   Starting training for 5 epochs...
   Epoch 1, Loss: 0.8803, Accuracy: 72.08%, Val Loss: 0.0985, Val Accuracy: 97.07%, Time: 19.89s
   Epoch 2, Loss: 0.1597, Accuracy: 95.36%, Val Loss: 0.0540, Val Accuracy: 98.29%, Time: 19.45s
   Epoch 3, Loss: 0.1004, Accuracy: 97.09%, Val Loss: 0.0488, Val Accuracy: 98.45%, Time: 19.32s
   Epoch 4, Loss: 0.0834, Accuracy: 97.63%, Val Loss: 0.0505, Val Accuracy: 98.45%, Time: 19.28s
   Epoch 5, Loss: 0.0726, Accuracy: 97.86%, Val Loss: 0.0322, Val Accuracy: 98.99%, Time: 19.21s
   Training completed in 99.44 seconds
   Average time per epoch: 19.89 seconds
   Evaluating on test set...
   Test Loss: 0.0322, Test Accuracy: 98.99%
   Training job completed

Expected output characteristics:

-  Job information header: Shows resource allocation and paths
-  Process count: Should match ``--nodes × --ntasks-per-node`` (e.g., 2 × 128 = 256 for full allocation)
-  Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
-  Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
-  Training progress: Loss should decrease and accuracy should increase over epochs
-  Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
-  Completion message: “Training job completed” indicates successful finish

Note: Only rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.

.. _differences-from-pytorch-example:

Differences from PyTorch example

The TensorFlow implementation differs from the PyTorch example in several ways:

-  Data loading: TensorFlow uses ``tf.data.Dataset`` with ``.shard()`` for distributed data splitting, while PyTorch uses ``DistributedSampler``
-  Model definition: TensorFlow uses Keras Sequential API, while PyTorch uses ``nn.Module``
-  Training loop: TensorFlow uses ``model.fit()`` with callbacks, while PyTorch uses manual training loops
-  Metric aggregation: TensorFlow uses Horovod’s ``MetricAverageCallback``, while PyTorch uses ``hvd.allreduce()`` manually
-  Dataset download: TensorFlow downloads to a ``.npz`` file, while PyTorch downloads to a directory structure

.. _troubleshooting-tensorflow:

Troubleshooting

Most troubleshooting steps are the same as the PyTorch example. Additional TensorFlow-specific considerations:

1. TensorFlow dataset sharding:

   -  Ensure ``.shard(hvd.size(), hvd.rank())`` is called on the dataset before batching
   -  Each MPI rank must process a different subset of data

2. Horovod callbacks:

   -  Always include ``BroadcastGlobalVariablesCallback(0)`` to synchronize initial model parameters
   -  Include ``MetricAverageCallback()`` to average metrics across MPI ranks
   -  Only MPI rank #0 should write checkpoints (use ``if hvd.rank() == 0:``)

3. TensorFlow/Keras compatibility:

   -  Ensure TensorFlow and Horovod versions are compatible
   -  Check with: ``python -c "import tensorflow as tf; import horovod.tensorflow as hvd; print('TF:', tf.__version__, 'Horovod:', hvd.__version__)"``

4. Memory issues:

   -  TensorFlow may use more memory than PyTorch for the same model
   -  Reduce ``batch_size`` if OOM errors are encountered
   -  Consider using ``tf.data.Dataset`` prefetching carefully

5. InfiniBand transport errors:

   -  Error: “Transport retry count exceeded on mlx5_0:1/IB” is a known UCX/InfiniBand issue (`UCX issue #5199 <https://github.com/openucx/ucx/issues/5199>`__)

   -  This error typically occurs when there is excessive communication between MPI ranks across different nodes, indicating that the job has reached communication saturation and cannot utilize that number of nodes

   -  Solution: Reduce the number of nodes in the job allocation

   -  If reducing nodes is not feasible, set the UCX service level parameter as a workaround:

      .. code-block:: bash

         export UCX_IB_SL=0

   -  This parameter sets the InfiniBand service level to 0, which can resolve transport retry issues

   -  Note: ``export UCX_IB_SL=0`` is already included in all SLURM batch scripts in this document

   -  Note: Disabling InfiniBand (using TCP/IP instead) is not recommended as it significantly reduces performance for multi-node jobs

.. _adapting-for-custom-datasets-tensorflow:

Adapting for custom datasets

To use a custom dataset with TensorFlow:

1. Replace the dataset loading section:

   .. code-block:: python

      # Instead of tf.keras.datasets.mnist.load_data(), use a custom dataset
      import tensorflow as tf

      # Load data
      x_train, y_train = load_your_training_data()
      x_test, y_test = load_your_test_data()

      # Normalize and reshape as needed
      x_train = x_train.astype('float32') / 255.0
      x_test = x_test.astype('float32') / 255.0

      # Create TensorFlow datasets
      train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
      train_dataset = train_dataset.shard(hvd.size(), hvd.rank())  # Important for distributed training
      train_dataset = train_dataset.shuffle(10000).batch(batch_size).prefetch(tf.data.AUTOTUNE)

2. Adjust model architecture:

   -  Modify the ``create_model()`` function to match the input dimensions and number of classes
   -  Update normalization values if needed

3. Update data directory:

   -  Store the dataset in ``/valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/``
   -  Update the ``DATA_DIR`` environment variable in the SLURM script
   -  Ensure the path is accessible from all compute nodes (project directories are shared across nodes)
   -  Never store data in home directory - always use ``/valhalla/projects/<your_slurm_project_account_name>/``

.. _summary-best-practices-for-distributed-training-on-discoverer:

Summary: Best practices for distributed training on Discoverer

1. **Use MPI** - Avoid Gloo in multi-user HPC environments due to possible port conflict issues

2. Proper combination of components:

   -  Horovod + MPI (simplest, works for both TensorFlow and PyTorch)
   -  PyTorch native DDP with MPI backend

3. Launch methods:

   -  With Horovod: ``srun --mpi=pmix python train.py`` (srun automatically uses all allocated tasks)
   -  With PyTorch DDP: ``srun --mpi=pmix python train.py``

4. **No MPI module needed for Horovod:** Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate module loading required

.. _references:

References

-  `Discoverer HPC Resource Overview <https://docs.discoverer.bg/resource_overview.html>`__
-  `SLURM Documentation <https://slurm.schedmd.com/>`__
-  `PyTorch Distributed Training Documentation <https://docs.pytorch.org/docs/stable/distributed.html>`__
-  `TensorFlow Distributed Training Guide <https://www.tensorflow.org/guide/distributed_training>`__
-  Consider checking Discoverer HPC documentation for:

   -  Available software modules
   -  Storage locations and quotas
   -  Job submission best practices
   -  Partition-specific configurations

.. _getting-help:

Getting help

See :doc:`help`