ML/AI Workloads on Discoverer CPU Cluster

Table of contents

Discoverer CPU partition compute nodes specifications

For more datails see Discoverer Reseource Overview:

CPU cluster and ML/AI

CPU clusters like Discoverer are well-suited for ML/AI workloads, particularly when leveraging distributed training across multiple nodes. The large number of CPU cores (144,384 hardware cores, 288,768 logical CPUs with hyperthreading) and high-speed InfiniBand interconnect (200 Gbps per node) enable efficient parallel processing of machine learning tasks.

Distributed training as a basic principle

Distributed training is fundamental for effectively utilizing CPU clusters for ML/AI workloads. Rather than training models on a single node, distributed training divides the workload across multiple nodes and processes, enabling:

  • Faster training times: Multiple nodes work simultaneously on different data shards
  • Scalability: Training can scale from a few nodes to hundreds of nodes depending on the workload
  • Efficient resource allocation: Enables allocation of CPU cores across multiple nodes
  • Large dataset handling: Enables training on datasets that may not fit in a single node’s memory

On Discoverer, distributed training uses MPI (Message Passing Interface) for inter-node communication, which is the standard approach for HPC clusters. Frameworks like Horovod provide a unified interface for distributed training with both PyTorch and TensorFlow, simplifying the implementation of distributed training workflows.

Considerations for ML/AI on CPU clusters

  1. Data sharding: In distributed training, datasets are divided (sharded) across all worker processes. Each process sees only a subset of the data per epoch, which affects convergence behavior and requires careful tuning of learning rates and epoch counts.
  2. Communication overhead: While distributed training speeds up training, communication between nodes adds overhead. The optimal number of nodes depends on the specific workload, dataset size, and model complexity.
  3. Resource allocation: Full node allocation (128 tasks per node, using all 256 logical CPUs) maximizes training efficiency. Partial allocation wastes resources and increases training time.
  4. Storage requirements: All data and outputs must be stored in project directories (/valhalla/projects/<your_slurm_project_account_name>/), never in home directories or /tmp.
  5. Python version compatibility: Python 3.11 is currently the optimal version for ML/AI workloads on Discoverer, providing best compatibility with TensorFlow, PyTorch, and Horovod.

Workload types suitable for CPU clusters

CPU clusters excel at: - Traditional machine learning (scikit-learn, XGBoost, etc.) - Deep learning with distributed training (PyTorch, TensorFlow) - Large-scale data preprocessing and feature engineering - Hyperparameter optimization with parallel trials - CPU-optimized inference workloads - Research and development workflows

For production training of very large models or extremely large datasets, GPU clusters may provide better performance, but CPU clusters are highly effective for many ML/AI workloads when properly configured for distributed training.

Limitations and considerations

Workloads suitable for CPU clusters: - Traditional ML algorithms (scikit-learn, XGBoost) - Data preprocessing and feature engineering - CPU-optimized inference - Hyperparameter optimization - Smaller neural networks - Batch inference workloads

What’s better on GPU clusters (Discoverer+ GPU partition - Large-scale deep learning training - Transformer model training - Real-time inference requirements - Computer vision with large CNNs - High-throughput training workloads

Virtual environment setup

Note

Virtual environments replace the need for running containers and simplify the setup and installation process for ML/AI workloads on Discoverer CPU partition. They provide isolated Python environments with all necessary dependencies without the overhead of container management.

Use conda from the anaconda3 module

All virtual environments on Discoverer must be managed by conda from the anaconda3 environment module that is already installed on the system. Users must not download, install, or use their own Anaconda or Miniconda installations. Conda is always available on Discoverer through environment modules and must be accessed by loading the anaconda3 module:

module load anaconda3

After loading the module, the conda executable and the libraries that support that executable become available. All virtual environment creation, package installation, and management must be done using this system-provided Anaconda3 installation. Do not attempt to install Anaconda3/Miniconda3 yourself - conda tool is already available through the environment module system.

Note

Users who prefer containers can use Singularity/Apptainer on the Discoverer CPU partition, but this document does not cover container-based workflows. The examples and instructions in this guide focus on virtual environments as the recommended approach.

Warning

Do not use conda activate when working with virtual environments on Discoverer. Always use --prefix with environment variable exports to expose the environment. Set VIRTUAL_ENV to the path of the virtual environment directory (e.g., /valhalla/projects/<your_slurm_project_account_name>/virt_envs/torch), then export:

export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

See the examples below for complete virtual environment installation scripts.

The libmamba solver is recommended to suppoprt all conda-driven installations because it:

  • Resolves dependencies faster than the default solver
  • Handles complex dependency conflicts more reliably
  • Reduces warnings about deprecated version specifications

Common setup: Determine Python version compatibility

Before creating a virtual environment, determine the highest Python version supported by the target packages (pytorch-cpu or tensorflow-cpu). That my need some research or negative feedback organised installation procedure.

3.11 is the current optimal Python version for ML/AI workloads on Discoverer. TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.

Note

In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions.

Since packages from conda-forge may not include Python version in the search metadata, test installation with different Python versions in different virtual environments (point to the selected environment by using --prefix):

#!/bin/bash
#SBATCH --job-name=search_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_pytorch.%j.out
#SBATCH -e search_pytorch.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

# Search for CPU-only PyTorch packages
conda search -c conda-forge pytorch-cpu

# Get detailed information
conda search -c conda-forge pytorch-cpu --info
#!/bin/bash
#SBATCH --job-name=test_python_version
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:30:00
#SBATCH -o test_python_version.%j.out
#SBATCH -e test_python_version.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

# Test which Python version works by trying to install the target package
# Replace PACKAGE_NAME with pytorch-cpu or tensorflow-cpu
# Replace <your_slurm_project_account_name> with the actual project account name

PACKAGE_NAME=pytorch-cpu  # or tensorflow-cpu

# Note: Python 3.11 is the current optimal version for ML/AI workloads on Discoverer.
# TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.
# This script tests higher versions first, but Python 3.11 should be used unless testing shows
# that a higher version is compatible. In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible
# with higher Python versions.

# Test Python 3.13 (to test compatibility with future versions)
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py313
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
conda create --prefix ${VIRTUAL_ENV} python=3.13 --solver=libmamba -y
if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
    echo "Python 3.13 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
    echo "A production environment can now be created with Python 3.13"
    exit 0
else
    echo "Python 3.13 is not supported. Cleaning up test environment and trying lower version."
    conda env remove --prefix ${VIRTUAL_ENV} -y
fi

# Test Python 3.12 (to test compatibility with future versions)
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py312
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
conda create --prefix ${VIRTUAL_ENV} python=3.12 --solver=libmamba -y
if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
    echo "Python 3.12 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
    echo "A production environment can now be created with Python 3.12"
    exit 0
else
    echo "Python 3.12 is not supported. Cleaning up test environment and trying lower version."
    conda env remove --prefix ${VIRTUAL_ENV} -y
fi

# Python 3.11 is the current optimal version - use this for production
# Continue with 3.10, etc. only if 3.11 doesn't work

Common setup: Create virtual environment with compatible Python version

Once the compatible Python version is determined, create the virtual environment using the template SLURM batch script provided below.

Warning

Do not use conda activate when working with virtual environments on Discoverer. Always use export PATH=${VIRTUAL_ENV}/bin:${PATH}, export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH} to expose the environment to the shell.

#!/bin/bash
#SBATCH --partition=cn
#SBATCH --job-name=install
#SBATCH --time=00:30:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH -o install.%j.out
#SBATCH -e install.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

# Check if the target folder already exists
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }

# Create Python virtual environment with Python 3.11
# Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
# (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
# Note: In the future, those packages might become compatible with higher Python versions
conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y

if [ $? -ne 0 ]; then
  echo "Conda Python virtual environment creation failed" >&2
  exit 1
fi

echo "Python virtual environment successfully created!"

# Fully expose the Python virtual environment
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

# Verify Python version
python --version

Install PyTorch-CPU

On Discoverer CPU cluster, PyTorch-CPU must always be installed with nomkl because MKL (Intel Math Kernel Library) does not work well on AMD Zen2 processors (AMD EPYC 7H12 installed on Discoverer CPU compute nodes). The nomkl package ensures that MKL libraries are not used, preventing performance issues and compatibility problems on AMD architecture.

For distributed training with Horovod, install pytorch-cpu, and torchvision packages together with horovod in the same command line to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed), and to install mpi4py as a dependency.

Optional: Search for available PyTorch CPU packages

To check which PyTorch CPU versions are available before installing, save the SLURM batch script below as run_search_pytorch.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=search_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_pytorch.%j.out
#SBATCH -e search_pytorch.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

# Search for CPU-only PyTorch packages
conda search -c conda-forge pytorch-cpu

# Get detailed information
conda search -c conda-forge pytorch-cpu --info

Submit the job to the queue:

sbatch run_search_pytorch.sh

and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).

Installation script

Save the SLURM batch script provided below as run_install_pytorch.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=install_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=01:30:00
#SBATCH -o install_pytorch.%j.out
#SBATCH -e install_pytorch.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

# Create virtual environment if it doesn't exist
if [ ! -d ${VIRTUAL_ENV} ]; then
    # Create Python virtual environment with Python 3.11
    # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
    # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
    # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
    conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y

    if [ $? -ne 0 ]; then
        echo "Conda Python virtual environment creation failed" >&2
        exit 1
    fi

    echo "Python virtual environment successfully created!"
else
    echo "Virtual environment already exists at ${VIRTUAL_ENV}"
fi

# Expose the virtual environment (do not use conda activate)
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

# For distributed training with Horovod:
# Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba

# For basic PyTorch installation (without distributed training):
# Install PyTorch CPU version in nomkl mode (required)
# torchvision is needed for datasets and image transformations
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge pytorch-cpu torchvision nomkl --solver=libmamba

# Install additional packages if needed
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge transformers --solver=libmamba

# Install multiple packages at once (pytorch-cpu must always include nomkl)
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge \
#     pytorch-cpu \
#     torchvision \
#     nomkl \
#     transformers \
#     scikit-learn \
#     pandas \
#     numpy \
#     --solver=libmamba

Submit the job to the queue:

sbatch run_install_pytorch.sh

and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).

Note

  • With conda as installer, the PyTorch package to install is named pytorch-cpu, but within the Python code the installation is imported as a module under the name torch (import torch)
  • torchvision is required for datasets (e.g., MNIST) and image transformations
  • For distributed training: pytorch-cpu, horovod, and torchvisionmust be installed together in the same command line to ensure Open MPI is used instead of MPICH and to bring mpi4py as a dependency
  • Always include nomkl when installing pytorch-cpu on Discoverer CPU partition because of the used AMD Zen2 processor microarchitecture
  • For distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)

Verify PyTorch-CPU installation

After installing pytorch-cpu, verify the installation. Save the SLURM batch script below as run_verify_pytorch.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=verify_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o verify_pytorch.%j.out
#SBATCH -e verify_pytorch.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

# Check PyTorch version
python -c "import torch; print('PyTorch version:', torch.__version__)"

# Check torchvision version
python -c "import torchvision; print('torchvision version:', torchvision.__version__)"

# Verify PyTorch is CPU-only (should return False for CUDA availability)
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

# Check number of CPU threads PyTorch can use as if SMT is not there
python -c "import torch; print('Number of threads:', torch.get_num_threads())"

# Test basic tensor operations
python -c "import torch; x = torch.randn(5, 3); print('Tensor shape:', x.shape); print('Tensor device:', x.device)"

Submit the job to the queue:

sbatch run_verify_pytorch.sh

and follow the content written down in the standard output and standard error files (verify_pytorch.JOBID.out and verify_pytorch.JOBID.err) to see the verification results.

Expected output should show:

  • PyTorch version number
  • torchvision version number
  • CUDA available: False (since this is CPU-only)
  • Number of threads available (no SMT is taken into account)
  • Tensor operations working correctly

Examine available PyTorch methods, functions, and constants

To inspect what methods, functions, and constants are available in the installed PyTorch, save the SLURM batch script below as run_examine_pytorch.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=examine_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o examine_pytorch.%j.out
#SBATCH -e examine_pytorch.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

# Check if the target folder exists
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

# List all available attributes in torch module
python -c "import torch; print('Available attributes:'); print([attr for attr in dir(torch) if not attr.startswith('_')])"

# Check specific submodules
python -c "import torch; print('torch.nn available:', hasattr(torch, 'nn')); print('torch.optim available:', hasattr(torch, 'optim')); print('torch.distributed available:', hasattr(torch, 'distributed'))"

# Check BLAS backend (should show OpenBLAS when nomkl is used)
python -c "import torch; print('BLAS library:', torch.__config__.show())"

# Check if MKLDNN is actually installed and available at runtime
python -c "import torch; print('MKLDNN available:', torch.backends.mkldnn.is_available() if hasattr(torch.backends, 'mkldnn') else 'mkldnn backend not found')"
python -c "import torch; print('MKLDNN enabled:', torch.backends.mkldnn.enabled if hasattr(torch.backends, 'mkldnn') else 'N/A')"

# List available tensor operations
python -c "import torch; ops = [attr for attr in dir(torch) if callable(getattr(torch, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"

# Check constants
python -c "import torch; print('Float32:', torch.float32); print('CPU device:', torch.device('cpu')); print('Default dtype:', torch.get_default_dtype())"

Submit the job to the queue:

sbatch run_examine_pytorch.sh

and follow the content written down in the standard output and standard error files (examine_pytorch.JOBID.out and examine_pytorch.JOBID.err) to see the available methods, functions, and constants.

Alternatively, run interactively using srun:

srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
     --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash

# Then in the interactive session:
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
python -c "import torch; print('Available attributes:', [attr for attr in dir(torch) if not attr.startswith('_')])"
# ... (continue with other examination commands)

These tests help to verify:

  • Which PyTorch modules are available (nn, optim, distributed, etc.)
  • Which BLAS library is being used (should be OpenBLAS with nomkl)
  • Available tensor operations and constants

The build configuration should show BLAS_INFO=open and LAPACK_INFO=open, confirming that OpenBLAS is used for BLAS and LAPACK operations instead of MKL. This is the expected behavior when using nomkl. The USE_MKLDNN=1 setting indicates that MKLDNN (Intel MKL-DNN, now oneDNN) is available. MKLDNN is embedded in the libtorch/pytorch package and statically linked into PyTorch, so it remains available even when nomkl is used. This configuration is correct: nomkl prevents MKL BLAS/LAPACK usage (which does not work well on AMD Zen2 processors), while MKLDNN (for neural network operations) remains available as it is part of the PyTorch build itself.

Install TensorFlow-CPU

Optional: Search for available TensorFlow packages

To check which TensorFlow CPU versions are available before installing, save the SLURM batch script below as run_search_tensorflow.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=search_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_tensorflow.%j.out
#SBATCH -e search_tensorflow.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

# Search for TensorFlow CPU packages
conda search -c conda-forge tensorflow-cpu

# Search for specific package with build string
conda search -c conda-forge "tensorflow-cpu=*"

# Check what's available in the configured channels
conda search tensorflow-cpu

Submit the job to the queue:

sbatch run_search_tensorflow.sh

and follow the content written down in the standard output and standard error files (search_tensorflow.JOBID.out and search_tensorflow.JOBID.err) to see which TensorFlow CPU versions are available.

Installation script

To use Horovod for distributed training, tensorflow-cpu and horovod must be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer. Installing them together ensures Horovod is built with CPU-only support.

Important note on compatibility: Even when installing together, conda’s dependency resolver may select versions that are technically compatible at the package level but have runtime API incompatibilities (e.g., Horovod callbacks may not work with certain TensorFlow versions). Training scripts should handle this by falling back to manual synchronization when callbacks are incompatible.

Save the SLURM batch script provided below as run_install_tensorflow.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=install_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=01:30:00
#SBATCH -o install_tensorflow.%j.out
#SBATCH -e install_tensorflow.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf

# Create virtual environment if it doesn't exist
if [ ! -d ${VIRTUAL_ENV} ]; then
    # Create Python virtual environment with Python 3.11
    # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
    # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
    # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
    conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y

    if [ $? -ne 0 ]; then
        echo "Conda Python virtual environment creation failed" >&2
        exit 1
    fi

    echo "Python virtual environment successfully created!"
else
    echo "Virtual environment already exists at ${VIRTUAL_ENV}"
fi

# Expose the virtual environment (do not use conda activate)
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

# Set channel priority to flexible (required when packages exist in multiple channels)
conda config --set channel_priority flexible

# Install TensorFlow CPU and Horovod together (required for CPU-only clusters)
# If Horovod is installed before TensorFlow, it will install CUDA dependencies which are not needed on CPU-only clusters
# Installing them together ensures Horovod is built with CPU-only support and conda resolves compatible versions

# Option 1: Let conda choose latest compatible versions (recommended for best compatibility)
conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmamba

# Option 2: Pin TensorFlow version, let conda choose compatible Horovod
# Use this to install a specific TensorFlow version (replace <version> with the desired version)
# conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu=<version> horovod nomkl --solver=libmamba

Submit the job to the queue:

sbatch run_install_tensorflow.sh

and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).

Important

  • Install TensorFlow and Horovod together: To use Horovod for distributed training, tensorflow-cpu and horovod must be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer.
  • Version compatibility: Conda’s dependency resolver does not guarantee runtime API compatibility. Even when installing together, runtime API incompatibilities may be encountered (e.g., Horovod callbacks failing with AttributeError: 'Variable' object has no attribute 'ref'). The training script handles this automatically by falling back to manual synchronization.
  • Use flexible channel priority: When “excluded by strict repo priority” appears, set channel_priority to flexible to allow conda to use packages from any configured channel
  • Do not specify build string: If pinning a version, install by version only (e.g., tensorflow-cpu=<version>), not with the build string (e.g., tensorflow-cpu=<version>=cpu_py313hbca4264_0)
  • Include nomkl: Similar to PyTorch, include nomkl for AMD Zen2 processors

Runtime API compatibility: TensorFlow and Horovod may have runtime API incompatibilities even when conda installs them together. If errors like Horovod has been shut down or AttributeError: 'Variable' object has no attribute 'ref' are encountered, this indicates the installed versions are incompatible. The training script attempts to handle this, but if Horovod shuts down during initialization, training will fail. In this case, different version combinations may need to be tried or PyTorch may be used instead, which has better compatibility with Horovod.

Verify TensorFlow installation

After installing tensorflow-cpu, verify the installation. Save the SLURM batch script below as run_verify_tensorflow.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=verify_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o verify_tensorflow.%j.out
#SBATCH -e verify_tensorflow.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

# Check TensorFlow version
python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"

# Check available CPU devices (TensorFlow shows logical CPU devices, not physical sockets)
python -c "import tensorflow as tf; print('CPU devices:', tf.config.list_physical_devices('CPU'))"

# Check number of CPU threads TensorFlow can use
python -c "import tensorflow as tf; print('Inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"

# Check total CPU cores available to TensorFlow
python -c "import tensorflow as tf; import os; print('Total CPU cores available:', len(os.sched_getaffinity(0)) if hasattr(os, 'sched_getaffinity') else 'N/A')"

# Configure TensorFlow to use all available CPU threads (256 threads per node on Discoverer)
python -c "import tensorflow as tf; tf.config.threading.set_inter_op_parallelism_threads(256); tf.config.threading.set_intra_op_parallelism_threads(256); print('Configured inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Configured intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"

# Check if TensorFlow has MPI integration
python -c "import tensorflow as tf; print('tf.distribute available:', hasattr(tf, 'distribute')); print('tf.distribute.Strategy available:', hasattr(tf.distribute, 'Strategy') if hasattr(tf, 'distribute') else False)"

# Check TensorFlow build configuration for MPI support
python -c "import tensorflow as tf; config = tf.sysconfig.get_build_info(); print('Build flags:', config.get('cpu_compiler_flags', 'N/A')); import sysconfig; print('Has MPI:', 'mpi' in str(sysconfig.get_config_var('LIBS')).lower() or 'mpi' in str(tf.sysconfig.get_compile_flags()).lower())"

# Check Horovod installation (if installed with TensorFlow)
python -c "try: import horovod.tensorflow as hvd; print('Horovod version:', hvd.__version__); print('Horovod TensorFlow support: OK'); except ImportError: print('Horovod not installed or TensorFlow support not available')"

# Test basic tensor operations
python -c "import tensorflow as tf; x = tf.constant([[1.0, 2.0], [3.0, 4.0]]); print('Tensor shape:', x.shape); print('Tensor device:', x.device); y = tf.matmul(x, x); print('Matrix multiplication result shape:', y.shape)"

Submit the job to the queue:

sbatch run_verify_tensorflow.sh

and follow the content written down in the standard output and standard error files (verify_tensorflow.JOBID.out and verify_tensorflow.JOBID.err) to see the verification results.

Expected output should show:

  • TensorFlow version number
  • CPU devices available
  • Thread configuration (inter-op and intra-op parallelism)
  • Horovod installation status (if installed)
  • Tensor operations working correctly

Examine available TensorFlow methods, functions, and constants

To inspect what methods, functions, and constants are available in the installed TensorFlow, save the SLURM batch script below as run_examine_tensorflow.sh and provide the correct account and qos values:

#!/bin/bash
#SBATCH --job-name=examine_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o examine_tensorflow.%j.out
#SBATCH -e examine_tensorflow.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf

# Check if the target folder exists
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

# List all available attributes in tensorflow module
python -c "import tensorflow as tf; print('Available attributes:'); print([attr for attr in dir(tf) if not attr.startswith('_')])"

# Check specific submodules
python -c "import tensorflow as tf; print('tf.keras available:', hasattr(tf, 'keras')); print('tf.nn available:', hasattr(tf, 'nn')); print('tf.optimizers available:', hasattr(tf, 'optimizers')); print('tf.distribute available:', hasattr(tf, 'distribute'))"

# Check TensorFlow build configuration
python -c "import tensorflow as tf; print('Build info:', tf.sysconfig.get_build_info())"

# Check available operations
python -c "import tensorflow as tf; ops = [attr for attr in dir(tf) if callable(getattr(tf, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"

# Check constants and dtypes
python -c "import tensorflow as tf; print('Float32:', tf.float32); print('Int32:', tf.int32); print('Default float dtype:', tf.keras.backend.floatx())"

# Check Keras availability and version
python -c "import tensorflow as tf; print('Keras version:', tf.keras.__version__ if hasattr(tf.keras, '__version__') else 'N/A'); print('Keras available:', hasattr(tf, 'keras'))"

Submit the job to the queue:

sbatch run_examine_tensorflow.sh

and follow the content written down in the standard output and standard error files (examine_tensorflow.JOBID.out and examine_tensorflow.JOBID.err) to see the available methods, functions, and constants.

Alternatively, run interactively using srun:

srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
     --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash

# Then in the interactive session:
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
python -c "import tensorflow as tf; print('Available attributes:', [attr for attr in dir(tf) if not attr.startswith('_')])"
# ... (continue with other examination commands)

These tests help to verify:

  • Which TensorFlow modules are available (keras, nn, optimizers, distribute, etc.)
  • Build configuration and available features
  • Available tensor operations and constants
  • Keras integration and version

Create SLURM configuration for ML/AI workloads on CPU clusters

  1. Resource allocation
  • CPU-intensive tasks: Request all 256 threads per node
  • Memory-intensive tasks: Use fat nodes (1TB RAM) when needed
  • I/O-intensive tasks: Consider network-attached storage performance
  • Wall time limits: All jobs have maximum runtime limits specified with #SBATCH --time=. Jobs are automatically terminated when the wall time limit is reached. Plan checkpoint intervals accordingly (see Checkpointing)
  1. Job arrays for parallel experiments
#SBATCH --array=0-99%10  # Run 100 jobs, max 10 concurrent
  1. Memory management
  • Monitor memory usage with sstat during job execution
  • Use memory-efficient libraries (e.g., pandas with chunking)
  • Consider using fat nodes for large datasets
  1. Checkpointing

All SLURM jobs on Discoverer have wall time limitations (specified with #SBATCH --time=). When a job reaches its wall time limit, it is automatically terminated by SLURM, regardless of whether training has completed. Therefore, checkpointing is essential for any training job that may exceed the allocated wall time.

The necessity of making checkpoints:

  • Jobs are terminated when wall time is reached, even if training is incomplete
  • Without checkpoints, all progress is lost and training must be restarted from the beginning
  • Checkpoints allow resuming training from the last saved state after the job is terminated

Best practices:

  • Implement checkpointing for all long-running training jobs
  • Save checkpoints to network storage (/valhalla/projects/<your_slurm_project_account_name>/outputs/)
  • Plan checkpoint intervals based on the wall time limit (e.g., if wall time is 24 hours, save checkpoints every 2-4 hours)
  • Ensure checkpoint saving happens frequently enough that significant progress is not lost if the job is terminated
  • Only rank 0 should write checkpoints in distributed training to avoid conflicts
  • Include epoch number, model state, optimizer state, and any other necessary information in checkpoints
  • Implement checkpoint loading at the start of training to resume from the last checkpoint if available

Suitable ML/AI workloads for CPU clusters

  1. Traditional machine learning

Scikit-learn workflows

  • Large-scale feature engineering and preprocessing
  • Hyperparameter optimization with grid/random search
  • Cross-validation on large datasets
  • Ensemble methods (Random Forest, Gradient Boosting)
  • Model training on datasets that fit in memory (up to 251 GB allocatable per node)

XGBoost/LightGBM/CatBoost

  • Tabular data classification/regression
  • Large-scale gradient boosting
  • Feature importance analysis
  • Model interpretation
  1. Deep learning (CPU-optimized)

TensorFlow/Keras with CPU optimization

  • Training smaller neural networks (CNNs, RNNs, MLPs)
  • Inference workloads
  • Transfer learning with pre-trained models
  • Model fine-tuning
  • Distributed training (see official TensorFlow distributed documentation)

PyTorch with CPU backend

  • Research prototyping
  • Model development and debugging
  • CPU-optimized inference
  • Distributed training (see official PyTorch distributed documentation)
  1. Natural language processing (NLP)

Traditional NLP pipelines

  • Large-scale text preprocessing and tokenization
  • Feature extraction (TF-IDF, word embeddings)
  • Document classification
  • Sentiment analysis on large corpora

Transformer models (CPU inference)

  • Batch inference with Hugging Face models
  • Text generation (smaller models)
  • Embedding extraction
  • Model quantization and optimization
  1. Computer vision

Image processing and feature extraction

  • Large-scale image preprocessing
  • Feature extraction (SIFT, SURF, HOG)
  • Image classification with traditional ML
  • Batch image transformations
  1. Data science and analytics

Large-scale data processing

  • Pandas/NumPy operations on large datasets
  • Dask for large-scale data processing
  • Feature engineering pipelines
  • Data validation and cleaning
  1. Hyperparameter optimization

Hyperparameter optimization

  • Grid search with parallel trials
  • Random search with parallel trials
  • Bayesian optimization
  • AutoML frameworks
  1. Reinforcement learning

CPU-based RL training

  • Policy gradient methods
  • Q-learning variants
  • Multi-agent systems
  • Environment simulation

Distributed training

For multi-user HPC clusters like Discoverer, MPI is the recommended and verified approach for distributed training.

Why MPI over Gloo?

Gloo (PyTorch’s TCP/IP-based backend) has major practical problems on shared HPC clusters: - Port conflicts: Multiple users can pick the same MASTER_PORT, causing job failures - No automatic port management: Manual coordination required between users - Race conditions: Port availability can change between check and use - Administrative burden: Unsuitable for multi-user environments

MPI was designed for multi-user HPC environments and handles all communication automatically without port conflicts. It provides: - No port management - MPI runtime handles all communication - Multi-user safe - designed for shared clusters - Battle-tested - decades of HPC use - SLURM integration - seamless with srun --mpi=

Conclusion: For multi-user HPC clusters, MPI is the correct choice. Gloo should be avoided in shared environments.

TensorFlow distributed training with Horovod

Horovod provides a unified interface for distributed training with both TensorFlow and PyTorch. For a complete working example with TensorFlow, see the Complete working example: Distributed training on MNIST dataset with TensorFlow section below.

Basic TensorFlow with Horovod setup:

import tensorflow as tf
import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init()

# Build and compile model
model = tf.keras.Sequential([
    # Add layers here
])

# Scale learning rate by number of workers
opt = tf.keras.optimizers.Adam(0.001 * hvd.size())

# Wrap optimizer with Horovod
opt = hvd.DistributedOptimizer(opt)

model.compile(optimizer=opt, loss='sparse_categorical_crossentropy')

# Add Horovod callbacks
callbacks = [
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),
    hvd.callbacks.MetricAverageCallback(),
    # Add other callbacks here
]

# Only rank 0 should write checkpoints
if hvd.rank() == 0:
    callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

model.fit(dataset, epochs=10, callbacks=callbacks)

Running PyTorch workloads

Package installation for distributed training

For distributed training with Horovod, install PyTorch, Horovod, and torchvision together in the same command. See Install PyTorch-CPU in the Virtual environment setup section for complete installation instructions.

Important

Python version requirement:

  • Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
  • TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11
  • The virtual environment must be created with Python 3.11 for optimal compatibility
  • Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions

Important

MPI implementation for Horovod:

  • Install PyTorch and Horovod together to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed)
  • No separate MPI module needs to be loaded - Open MPI is installed as a dependency in the virtual environment when Horovod is installed
  • Use srun --mpi=pmix to launch jobs - srun automatically uses all allocated tasks, no need to specify -np
  • The mpirun command is also available from the virtual environment (installed as part of Open MPI dependency), but srun is preferred for SLURM integration

Installation command for distributed training:

# Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba

Note

  • With conda, the package is named pytorch-cpu (or pytorch), and it is imported as import torch
  • torchvision is required for datasets (e.g., MNIST) and image transformations
  • pytorch-cpu, horovod, and torchvision must be installed together in the same command to ensure Open MPI is used instead of MPICH and to bring mpi4py as a dependency
  • Open MPI is installed as a dependency in the virtual environment when Horovod is installed - no separate MPI module needs to be loaded
  • Always include nomkl when installing pytorch-cpu on Discoverer (AMD Zen2 processors)
  • For distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)

PyTorch distributed training with MPI

Option 1: PyTorch native DDP with MPI backend

SLURM batch script:

#!/bin/bash
#SBATCH --job-name=pytorch_distributed
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH -o pytorch_distributed.%j.out
#SBATCH -e pytorch_distributed.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Launch with SLURM's MPI integration
srun --mpi=pmix python train.py

Python code example:

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

# Initialize process group with MPI backend
# No MASTER_ADDR, no MASTER_PORT needed - MPI handles everything
dist.init_process_group(backend='mpi')

# Get local rank and world size
local_rank = dist.get_rank()
world_size = dist.get_world_size()

# Create model and move to device
model = YourModel()
model = DDP(model)

# Create distributed sampler for data loading
train_sampler = DistributedSampler(
    train_dataset,
    num_replicas=world_size,
    rank=local_rank
)

train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    sampler=train_sampler
)

# Training loop
for epoch in range(num_epochs):
    train_sampler.set_epoch(epoch)
    for data, target in train_loader:
        # Training code here
        pass

# Clean up
dist.destroy_process_group()

Horovod works with both TensorFlow and PyTorch and simplifies distributed training setup.

SLURM batch script:

#!/bin/bash
#SBATCH --job-name=horovod_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH -o horovod_pytorch.%j.out
#SBATCH -e horovod_pytorch.%j.err

cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

export UCX_IB_SL=0

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Launch with srun (srun automatically uses all allocated tasks)
srun --mpi=pmix python train_horovod.py

Python code example:

import torch
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Scale learning rate by number of workers
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 * hvd.size())

# Wrap optimizer with Horovod
optimizer = hvd.DistributedOptimizer(
    optimizer,
    named_parameters=model.named_parameters()
)

# Broadcast initial parameters from rank 0 to all workers
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

# Training loop
for epoch in range(num_epochs):
    for data, target in train_loader:
        optimizer.zero_grad()
        loss = criterion(model(data), target)
        loss.backward()
        optimizer.step()

Complete working example: Distributed training on MNIST dataset

This section provides a complete, working example of distributed training using the MNIST handwritten digit dataset. MNIST is a standard benchmark dataset that’s automatically downloaded when the code is run, making it ideal for testing distributed training setups.

Example files: The complete training script and SLURM batch script are provided as separate, runnable files:

  • train_mnist_horovod_pytorch.py - Complete Python training script
  • run_mnist_training_pytorch.sh - SLURM batch submission script

Discoverer Storage Requirements:

  • All data and outputs must be stored under /valhalla/projects/<your_slurm_project_account_name>/ (replace <your_slurm_project_account_name> with the SLURM account name)
  • Never store data in the home directory (~/) - this violates Discoverer’s best practices
  • The example scripts automatically use /valhalla/projects/<your_slurm_project_account_name>/data/ for datasets
  • Outputs (checkpoints, logs) should go to /valhalla/projects/<your_slurm_project_account_name>/outputs/
  • The SLURM script creates these directories automatically

These files can be copied and run directly after configuring the SLURM account settings.

Overview

Dataset: MNIST.*Training set: 60,000 images - Test set: 10,000 images - Image size: 28×28 pixels - Data source: Automatically downloaded from PyTorch’s built-in datasets (no manual download needed) - Model: Simple convolutional neural network (CNN) suitable for CPU training - Framework: PyTorch with Horovod - Training: Distributed across multiple nodes using MPI

Step 1: Prepare the working directory

On the Discoverer login node, create a directory for the training job and copy the example files:

# Replace <your_slurm_project_account_name> with the actual SLURM account name
mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example

# Copy the example files (adjust paths as needed)
cp /path/to/train_mnist_horovod_pytorch.py .
cp /path/to/run_mnist_training_pytorch.sh .

Note: The example files (train_mnist_horovod_pytorch.py and run_mnist_training_pytorch.sh) are provided as separate, runnable scripts that can be copied and used directly.

Step 2: Get the training script

Copy the training script to the working directory:

# Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp train_mnist_horovod_pytorch.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example

The training script (train_mnist_horovod_pytorch.py) contains:

  • CNN model definition for MNIST digit classification
  • Distributed data loading with Horovod
  • Training and validation loops with metric aggregation
  • Automatic MNIST dataset download (only rank 0 downloads)
  • Complete distributed training implementation

The full script can be viewed in train_mnist_horovod_pytorch.py or modified as needed.

Step 3: Get the SLURM batch script

Copy the SLURM batch script to the working directory:

# Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp run_mnist_training_pytorch.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example

Before submitting, edit run_mnist_training_pytorch.sh and replace:

  • <your_slurm_project_account_name> with the actual SLURM account name
  • <your_qos> with the actual QoS name
  • Verify the virtual environment path matches the setup

The SLURM script (run_mnist_training_pytorch.sh) contains:

  • Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
  • Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
  • Module loading (anaconda3 only - Open MPI is installed as a dependency in the virtual environment when Horovod is installed)
  • Virtual environment setup
  • Job launch with srun --mpi=pmix (srun automatically uses all allocated tasks)

Understanding epochs in distributed training:

  • Epochs are shared across all ranks, not per-rank
  • With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
  • Each rank processes a different subset of data during each epoch
  • All ranks work together on the same epochs simultaneously
  • 256 epochs are not needed for 256 ranks - the number of epochs is independent of the number of ranks

Resource allocation for full node allocation:

  • Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
  • Recommended: --ntasks-per-node=128 (one rank per core)
  • Recommended: --cpus-per-task=2 (use both threads per core)
  • Memory: Adjust --mem based on the model size (e.g., 251G for full node - maximum available on Discoverer)
  • Total ranks: 2 nodes × 128 tasks = 256 ranks
  • Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes

The full script can be viewed in run_mnist_training_pytorch.sh and resource requirements can be modified as needed.

Step 4: Make the script executable and submit the job

On the login node:

# Edit the script first to set the account and QoS (if not already done)
# Then submit the job
sbatch run_mnist_training_pytorch.sh

Before submitting: Ensure run_mnist_training_pytorch.sh has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct.

Step 5: Monitor the job

# Check job status
squeue --me

# Monitor output in real-time (replace JOBID with the actual job ID)
# Note: These commands should only be executed when squeue --me shows the job is in running state
tail -f pytorch_cpu_training.JOBID.out

# Check for errors
tail -f pytorch_cpu_training.JOBID.err

Step 6: Check job runtime and statistics

After the job completes, check how long it took:

# Show job-level summary only (aggregated, no job steps)
sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS

# Show detailed accounting with job steps (default behavior)
sacct -j JOBID --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,TotalCPU,MaxRSS,MaxVMSize,NodeList

# Show elapsed time (wall clock time) for a job
sacct -j JOBID -X --format=JobID,JobName,Elapsed,State

# Show resource usage summary
sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,MaxVMSize,ReqMem,AllocCPUS

# Show all jobs for today
sacct --starttime=today -X --format=JobID,JobName,Partition,Elapsed,State,ExitCode

Understanding sacct output:

Without -X flag (default): Shows job steps separately

  • JOBID: Main job (aggregated across all steps)
  • JOBID.batch: The batch script execution
  • JOBID.0, JOBID.1, etc.: Individual job steps (e.g., MPI processes, srun steps)

With -X flag: Shows only job-level summary (aggregated)

  • Single line with total statistics for the entire job

Example output without -X:

$ sacct -j 4060972 --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
       JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972      mnist_hor+   00:00:56   05:03:21                   512
4060972.bat+      batch   00:00:56  00:00.513     21176K        256
4060972.0    hydra_pmi+   00:00:55   05:03:21 150700188K          4

Example output with -X (job-level only):

$ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
       JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972      mnist_hor+   00:00:56   00:00:00                   512

Common fields:

  • Elapsed: Wall clock time (how long the job ran)
  • TotalCPU: Total CPU time used (sum across all CPUs)
  • MaxRSS: Maximum resident set size (peak memory usage)
  • MaxVMSize: Maximum virtual memory size
  • AllocCPUS: Number of CPUs allocated

Recommendation: Use -X flag for quick job-level summaries. Remove -X to see detailed breakdown by job steps.

Performance comparison: Scaling with number of CPUs

The benefits of fully allocating nodes are clear when comparing runtime. Note that SLURM’s Elapsed time includes job startup/teardown overhead, while the training script reports actual computation time:

Partial allocation (8 tasks per node, 32 CPUs total):

$ sacct -j 4060968 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
       JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060968      mnist_hor+   00:02:35   00:00:00                    32
  • Wall time (SLURM): 2 minutes 35 seconds
  • Allocated resources: 32 CPUs allocated

Full node allocation (128 tasks per node, 512 CPUs total):

$ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
       JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972      mnist_hor+   00:00:56   00:00:00                   512
  • Wall time (SLURM): 56 seconds
  • Allocated resources: 512 CPUs (128 tasks × 2 nodes × 2 CPUs per task)
  • Speedup: ~2.8× faster (155 seconds → 56 seconds)

Large-scale runs (multiple nodes):

8 nodes (1024 CPUs total):

$ sacct -j 4060975 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
       JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060975      mnist_hor+   00:00:52   00:00:00                  1024
  • Wall time (SLURM): 52 seconds
  • Actual training time (from logs): 18.73 seconds
  • Average time per epoch: 3.75 seconds
  • Allocated resources: 1024 CPUs (128 tasks × 8 nodes × 2 CPUs per task)

16 nodes (2048 CPUs total):

$ sacct -j 4060974 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
       JobID    JobName    Elapsed   TotalCPU     MaxRSS  AllocCPUS
------------ ---------- ---------- ---------- ----------
4060974      mnist_hor+   00:00:49   00:00:00                  2048
  • Wall time (SLURM): 49 seconds
  • Actual training time (from logs): 14.75 seconds
  • Average time per epoch: 2.95 seconds
  • Allocated resources: 2048 CPUs (128 tasks × 16 nodes × 2 CPUs per task)
  • Speedup vs 1024 CPUs: ~1.3× faster training time (18.73s → 14.75s)

Analysis

  • Wall time vs. training time: SLURM’s Elapsed includes job startup/teardown overhead. The training script’s reported time is the actual computation time.
  • Scaling efficiency: Training time scales well from 512 → 1024 → 2048 CPUs, showing good parallel efficiency.
  • Overhead: Job startup/teardown overhead becomes more noticeable at larger scales (e.g., 49s wall time vs 14.75s training time for 2048 CPUs).
  • Full node allocation: Using 128 tasks per node (one per core) maximizes resource allocation and minimizes training time.

Scalability testing results

The following results demonstrate how to measure scalability and training speed across different node configurations. These are speed/scalability tests, not optimized for accuracy. The accuracy values shown are lower than optimal because the focus is on demonstrating distributed training execution and measuring speedup.

2 nodes (256 ranks):

  • Training time: 23.80 seconds
  • Learning rate: 0.16 (sqrt scaling)
  • Test accuracy: 66.66% → 10.08% (unstable - not optimized for accuracy)

4 nodes (512 ranks):

  • Training time: 19.07 seconds
  • Learning rate: 0.226 (sqrt scaling)
  • Test accuracy: 64.99% → 21.16% (unstable - not optimized for accuracy)

8 nodes (1024 ranks):

  • Training time: 15.37 seconds
  • Learning rate: 0.1 (fixed scaling)
  • Test accuracy: 19.84% → 66.18% (improving but not optimized)

16 nodes (2048 ranks):

  • Training time: 15.10 seconds
  • Learning rate: 0.1 (fixed scaling)
  • Test accuracy: 20.68% → 64.96% (improving but not optimized)

Analysis:

  • Training speed scales well: 23.80s → 19.07s → 15.37s → 15.10s
  • Diminishing returns beyond 8 nodes (communication overhead)
  • These results demonstrate scalability - actual accuracy requires further tuning (see below)

Achieving better accuracy

To achieve high accuracy (e.g., ~98-99% for MNIST) instead of just speed testing, apply these changes to the training script:

  1. Reduce base learning rate: Change base_lr = 0.01 to base_lr = 0.001 or 0.0001
  2. Train for more epochs: Change num_epochs = 5 to num_epochs = 10 or 20
  3. Use Adam optimizer: Replace optim.SGD with optim.Adam or optim.AdamW for better stability
  4. Add gradient clipping: Before optimizer.step(), add torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  5. Use learning rate schedule: Add StepLR scheduler to decay learning rate over epochs
  6. Increase batch size: Change batch_size = 64 to 128 or 256 (if memory allows)

Note

Better accuracy typically requires more epochs, lower learning rates, and more careful tuning, which increases training time. For production training, start with 2-4 nodes to establish baseline accuracy, then scale up while monitoring convergence.

Full node allocation significantly reduces training time. For the same workload, using 128 tasks per node instead of 8 tasks per node provides nearly 3× speedup while fully allocating the available resources. Scaling to multiple nodes further reduces training time, though with diminishing returns due to communication overhead.

Understanding the data flow

  1. Data location:
    • The MNIST dataset is already available at /opt/software/test/pytorch/mnist_data
    • The dataset is automatically loaded from this location - no download is needed
    • The dataset is approximately 60 MB and includes both training and test sets
    • The SLURM script sets DATA_DIR environment variable to this location
  2. Data distribution:
    • Each process gets a different subset of the training data via DistributedSampler
    • With 16 processes (2 nodes × 8 tasks), each process handles ~3,750 training images
    • Data is automatically shuffled differently each epoch
  3. Model synchronization:
    • Initial model parameters are broadcast from rank 0 to all workers
    • After each training step, gradients are averaged across all processes using Horovod’s ring-allreduce algorithm
    • All processes maintain identical model parameters throughout training
  4. Output:
    • Only rank 0 prints training progress to avoid duplicate output
    • Metrics (loss, accuracy) are averaged across all processes before printing
    • Checkpoints should be saved to /valhalla/projects/<your_slurm_project_account_name>/outputs (never use home directory)
    • The SLURM script sets OUTPUT_DIR environment variable for this purpose

Expected output

When checking the output file (pytorch_cpu_training.JOBID.out), the output should be similar to:

Job ID: 4060967
Number of nodes: 2
Tasks per node: 128
Total tasks: 256
CPUs per task: 2
Working directory: /valhalla/projects/your_account/ai_cpu
Data directory: /opt/software/test/pytorch/mnist_data
Output directory: /valhalla/projects/your_account/ai_cpu/outputs
Python: /valhalla/projects/your_account/virt_envs/pytorch/bin/python
Python version: Python 3.11.14
Starting training on 256 processes
Process 0 of 256
Loading MNIST dataset...
Data will be stored in: /opt/software/test/pytorch/mnist_data
Training dataset download completed
Starting training for 5 epochs...
Epoch 1, Batch 0, Loss: 2.3130
Epoch 1 - Average Loss: 0.8803, Average Accuracy: 72.08%
Test Loss: 0.0985, Test Accuracy: 97.07%
Epoch 2, Batch 0, Loss: 0.2280
Epoch 2 - Average Loss: 0.1597, Average Accuracy: 95.36%
Test Loss: 0.0540, Test Accuracy: 98.29%
Epoch 3, Batch 0, Loss: 0.1328
Epoch 3 - Average Loss: 0.1004, Average Accuracy: 97.09%
Test Loss: 0.0488, Test Accuracy: 98.45%
Epoch 4, Batch 0, Loss: 0.0106
Epoch 4 - Average Loss: 0.0834, Average Accuracy: 97.63%
Test Loss: 0.0505, Test Accuracy: 98.45%
Epoch 5, Batch 0, Loss: 0.1305
Epoch 5 - Average Loss: 0.0726, Average Accuracy: 97.86%
Test Loss: 0.0322, Test Accuracy: 98.99%
Training completed in 99.44 seconds
Average time per epoch: 19.89 seconds
Training job completed

Expected output characteristics:

  • Job information header: Shows resource allocation and paths
  • Process count: Should match --nodes × --ntasks-per-node (e.g., 2 × 128 = 256 for full allocation)
  • Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
  • Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
  • Training progress: Loss should decrease and accuracy should increase over epochs
  • Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
  • Completion message: “Training job completed” indicates successful finish

Note

Only MPI rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.

Understanding process structure

When checking running processes with top or htop on a compute node, the following will be visible:

  1. Main Python training processes: 128 processes (one per --ntasks-per-node, matching one per core)
    • High CPU usage (180-220%) - using multiple threads (2 threads per process)
    • Memory usage: ~500-530 MB per process
    • These are the distributed training workers
    • Each process uses 2 threads (matching --cpus-per-task=2)
  2. pt_data_worker processes: 256 processes (2 per training process)
    • Lower CPU usage (~2%) - data loading workers
    • Memory usage: ~280-290 MB per worker
    • These are PyTorch DataLoader worker processes

The purpose of pt_data_worker processes: - PyTorch’s DataLoader uses num_workers=2 (set in the training script) - Each training process spawns 2 worker processes to load data in parallel - This prevents data loading from blocking training computation - With 128 training processes × 2 workers = 256 pt_data_worker processes per node

Total processes per node (full allocation): - 128 main Python processes (training, one per core) - 256 pt_data_worker processes (data loading) - Total: 384 processes per node - Total across 2 nodes: 768 processes (512 training + 256 data workers)

Resource allocation: - CPU cores: 128 cores per node × 2 nodes = 256 cores (fully utilized) - Threads: 256 ranks × 2 threads = 512 threads (fully utilizing hyperthreading) - Memory: Adjust --mem based on model size (e.g., 251G per node for full allocation - maximum available on Discoverer)

This is normal and expected behavior. The worker processes help keep the GPU/CPU busy by pre-loading the next batch while the current batch is being processed.

To reduce worker processes (if needed for resource constraints): - With 256 ranks, having num_workers=2 creates 512 data worker processes (256 × 2) - Consider reducing to num_workers=1 for large-scale runs (256 worker processes total) - Or set num_workers=0 to disable parallel data loading (slower but minimal overhead) - For full node allocation (256 ranks), num_workers=1 is often a good balance

Troubleshooting

  1. Job stuck in pending state with reason “(None)”:

    • “(None)” means the job is waiting in the queue without a specific reason code

    • This is common for large resource requests (e.g., 2 full nodes)

    • Check detailed job information:

      scontrol show job JOBID
      
    • Check job priority and position in queue:

      sprio -j JOBID
      
    • Check partition availability:

      sinfo -p cn
      
    • Check available nodes:

      sinfo -p cn -o '%N %t %C %m'
      
    • Common reasons for delay:

      • No 2 full nodes available simultaneously (most common for full node requests)
      • Other jobs ahead in queue with higher priority
      • Account/QoS limits (check with sacctmgr show assoc user=$USER)
      • Partition limits or maintenance
    • Solutions:

      • Wait for resources to become available
      • Reduce resource request (fewer nodes, fewer tasks per node)
      • Check with cluster administrators about account limits
      • Consider using fewer nodes or reducing memory request if possible
  2. Python version compatibility issues:

    • Python 3.11 is the current optimal version for ML/AI workloads on Discoverer

    • If Python version errors are encountered, recreate the virtual environment with Python 3.11:

      # Remove the existing environment
      conda env remove --prefix ${VIRTUAL_ENV} -y
      
      # Create new environment with Python 3.11 (current optimal version)
      # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
      # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
      # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
      conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
      
      # Then install packages again (install pytorch-cpu, horovod, and torchvision together to ensure Open MPI and mpi4py)
      conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba
      
  3. Dataset location issues:

    • The MNIST dataset should already be available at /opt/software/test/pytorch/mnist_data
    • If the dataset is not found, verify the path is correct and accessible from compute nodes
    • The SLURM script sets DATA_DIR to this location automatically
  4. Out of memory:

    • Reduce batch_size in the training script (currently 64)
    • Reduce number of workers in DataLoader (num_workers=2)
  5. Slow training:

    • Increase batch_size if memory allows
    • Reduce number of epochs for testing
    • Check that OMP_NUM_THREADS is set correctly
  6. Training produces NaN loss values:

    • Cause: Learning rate is too high when scaled linearly with large number of ranks
    • With 256+ ranks, linear scaling (base_lr × num_ranks) creates extremely high learning rates
    • Example: 0.01 × 2048 = 20.48 (way too high, causes divergence)
    • Solution: The training script now uses square root scaling for large numbers of ranks
    • For ≤32 ranks: linear scaling (base_lr × num_ranks)
    • For >32 ranks: square root scaling (base_lr × sqrt(num_ranks))
    • This prevents NaN values while still benefiting from distributed training
    • If NaN values still appear, try reducing base_lr further (e.g., 0.001 instead of 0.01)
  7. MPI errors:

    • Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate MPI module needs to be loaded
    • Use srun --mpi=pmix to launch jobs - srun automatically uses all allocated tasks (no need for -np $SLURM_NTASKS)
    • Verify that MPI libraries are available: which mpirun (should be in the virtual environment, though srun is preferred)
    • Check that Horovod was installed correctly: conda list horovod
    • Ensure virtual environment has horovod and mpi4py installed
    • Important: Install PyTorch, Horovod, and torchvision together (conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba) to ensure Open MPI is used and to bring mpi4py as a dependency

Adapting for custom datasets

To use a custom dataset:

  1. Replace the dataset loading section:

    # Instead of datasets.MNIST, use a custom dataset
    from torch.utils.data import Dataset
    
    class YourDataset(Dataset):
        def __init__(self, data_path, transform=None):
            # Load data here
            pass
    
        def __len__(self):
            return len(self.data)
    
        def __getitem__(self, idx):
            # Return (image, label) tuple
            return image, label
    
    train_dataset = YourDataset(data_path='/path/to/data', transform=transform)
    
  2. Adjust model architecture:

    • Modify MNIST_CNN class to match the input dimensions and number of classes
    • Update normalization values in transforms if needed
  3. Update data directory:

    • For the example MNIST training, the dataset is at /opt/software/test/pytorch/mnist_data
    • For custom datasets, store them in /valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/
    • Update the DATA_DIR environment variable in the SLURM script to point to the dataset location
    • Ensure the path is accessible from all compute nodes (project directories are shared across nodes)
    • Never store data in home directory - always use /valhalla/projects/<your_slurm_project_account_name>/

Scaling analysis: Why different node counts require different numbers of epochs

In distributed training with Horovod, the training dataset is sharded (divided) across all worker processes (ranks). Each rank processes only its assigned subset of data during each epoch.

Data sharding formula:

Samples per rank per epoch = Total training samples / Number of ranks

As the number of nodes (and therefore ranks) increases:

  1. Fewer samples per rank: Each rank sees progressively fewer training samples per epoch
    • 2 nodes (256 ranks): 60,000 / 256 = 234.4 samples per rank
    • 4 nodes (512 ranks): 60,000 / 512 = 117.2 samples per rank
    • 8 nodes (1024 ranks): 60,000 / 1024 = 58.6 samples per rank
    • 16 nodes (2048 ranks): 60,000 / 2048 = 29.3 samples per rank
  2. Reduced effective batch size per rank: With fewer samples, batch sizes must be reduced to ensure multiple batches per epoch
    • More ranks → smaller batch sizes (64 → 32 → 16)
  3. More epochs needed for convergence: With fewer samples per rank, each rank needs more epochs to:
    • See enough diverse data to learn meaningful patterns
    • Accumulate sufficient gradient updates for convergence
    • Overcome the statistical noise from limited data per rank
  4. Learning rate scaling: Learning rates are scaled more conservatively with more ranks to prevent training instability

Explanation: 16 nodes required more epochs than 8 nodes

With 16 nodes (2048 ranks), each rank sees only 29.3 samples per epoch. This extremely limited data per rank means: - Each rank sees very little data diversity per epoch - Gradient estimates become noisier, requiring more iterations to converge - More communication overhead with more ranks can introduce slight inconsistencies - Beyond a certain point, adding more ranks doesn’t linearly improve convergence speed due to the data limitation

The following table shows PyTorch training results across different node configurations:

Metric 2 Nodes(256 ranks) 4 Nodes(512 ranks) 8 Nodes(1024 ranks) 16 Nodes(2048 ranks)
Total Ranks 256 512 1024 2048
Samples per Rank 234.4 117.2 58.6 29.3
Batch Size 64 32 16 16
Learning Rate 0.016000 0.012126 0.005657 0.006727
LR Scaling sqrt (0.5) conservative (0.4) conservative (0.25) conservative (0.25)
Max Epochs 20 25 30 30
Epochs to 98% 17 13 13 28
Final Accuracy 98.14% 98.01% 98.11% 98.12%
Total Training Time 71.44s 52.56s 52.80s 94.36s
Time per Epoch 4.20s 4.04s 4.06s 3.37s
Early Stopping Yes Yes Yes Yes

Highlights:

  1. Training time efficiency: From the total training time, 4 nodes (512 MPI ranks) is the most optimal for computation, achieving 98% accuracy in 52.56 seconds. 8 nodes (1024 ranks) is nearly as efficient at 52.80 seconds, while 2 nodes requires 71.44 seconds and 16 nodes requires 94.36 seconds.
  2. Convergence patterns: 4 and 8 nodes show fast convergence (13 epochs), while 16 nodes requires 28 epochs due to extremely limited data per rank
  3. Learning rate scaling: Becomes more conservative as ranks increase to prevent training instability
  4. Batch size adaptation: Automatically reduced with more ranks to ensure sufficient batches per epoch

Recommendations:

  • Optimal node count for this workload: 4-8 nodes provide the best balance of speed and resource allocation
  • Avoid over-scaling: Beyond 8 nodes, the benefits diminish due to data sharding limitations
  • Early stopping is effective: All runs successfully used early stopping at 98%, saving computation time
  • Adaptive epoch scaling: The script correctly scales max epochs based on rank count, allowing sufficient training time for larger scales

Estimating optimal parallel rank for training jobs:

For each training job, estimate the optimal parallel rank (number of MPI ranks) by:

  1. Running multiple test runs with different node counts (e.g., 2, 4, 8, 16 nodes) using the same model architecture and training setup
  2. Comparing total training time, convergence speed (epochs to target accuracy), and resource allocation across different configurations
  3. Identifying the configuration that achieves the target accuracy in the shortest time while maintaining stable convergence
  4. Running production training jobs based on the estimated optimal configuration from these test runs

Note

The optimal configuration may vary depending on dataset size, model complexity, and target accuracy. Always validate the optimal rank count through empirical testing before running large-scale production jobs.

Complete working example: Distributed training on MNIST dataset with TensorFlow

This section provides a complete, working example of distributed training using TensorFlow/Keras with Horovod on the MNIST handwritten digit dataset. This example follows the same structure as the PyTorch example above but uses TensorFlow/Keras instead.

Example files: The complete training script and SLURM batch script are provided as separate, runnable files: - train_mnist_horovod_tf.py - Complete Python training script for TensorFlow - run_mnist_training_tf.sh - SLURM batch submission script for TensorFlow

Discoverer Storage Requirements:

  • All outputs must be stored under /valhalla/projects/<your_slurm_project_account_name>/ (replace <your_slurm_project_account_name> with the SLURM account name)
  • Never store data or other large files (like outputs) in the user’s home directory (~/) - this violates Discoverer CPU partition best file system utilisation practices
  • The MNIST dataset is already available locally at /opt/software/test/tf/mnist_data - no download is needed
  • Outputs (checkpoints, logs) should go to /valhalla/projects/<your_slurm_project_account_name>/outputs/
  • The SLURM script creates the output directory automatically

Overview

Dataset: MNIST.*containing 70,000 grayscale images of handwritten digits (0-9):

  • Training set: 60,000 images
  • Test set: 10,000 images
  • Image size: 28×28 pixels
  • Data source: The MNIST dataset is already available locally and is automatically loaded from the local storage location
  • Model: Simple convolutional neural network (CNN) suitable for CPU training
  • Framework: TensorFlow/Keras with Horovod
  • Training: Distributed across multiple nodes using MPI

Prerequisites:

  • TensorFlow and Horovod must be installed together in the same conda install command to ensure CPU-only build
  • If Horovod is installed before TensorFlow, it may install CUDA dependencies which are not needed on CPU-only clusters like Discoverer CPU partition
  • Installation command: conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmamba
  • See the TensorFlow installation section for complete installation instructions

Step 1: Prepare the working directory

On the Discoverer login node, create a directory for the training job and copy the example files:

# Replace <your_slurm_project_account_name> with the actual SLURM account name
mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf

# Copy the example files (adjust paths as needed)
cp /path/to/train_mnist_horovod_tf.py .
cp /path/to/run_mnist_training_tf.sh .

Note

The example files (train_mnist_horovod_tf.py and run_mnist_training_tf.sh) are provided as separate, runnable scripts that can be copied and used directly.

Step 2: Get the training script

Copy the training script to the working directory:

# Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp train_mnist_horovod_tf.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf

The training script (train_mnist_horovod_tf.py) contains:

  • CNN model definition using TensorFlow/Keras Sequential API
  • Distributed data loading with TensorFlow Dataset API and Horovod sharding
  • Training and validation loops with metric aggregation
  • MNIST dataset loading from local storage
  • Complete distributed training implementation with Horovod callbacks

The full script can be viewed in train_mnist_horovod_tf.py or modified as needed.

Step 3: Get the SLURM batch script

Copy the SLURM batch script to the working directory:

# Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp run_mnist_training_tf.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf

Before submitting, edit run_mnist_training_tf.sh and replace:

  • <your_slurm_project_account_name> with the actual SLURM account name
  • <your_qos> with the actual QoS name
  • Verify the virtual environment path matches the setup (should point to the TensorFlow virtual environment)

The SLURM script (run_mnist_training_tf.sh) contains:

  • Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
  • Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
  • Module loading (anaconda3 only is required - Open MPI is already installed as a dependency in the virtual environment during the installation of Horovod)
  • Virtual environment setup
  • Job launch with srun --mpi=pmix (srun automatically uses all allocated tasks)

Understanding epochs in distributed training:

  • Epochs are shared across all ranks, not per-rank
  • With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
  • Each rank processes a different subset of data during each epoch
  • All ranks work together on the same epochs simultaneously
  • 256 epochs are not needed for 256 ranks - the number of epochs is independent of the number of ranks

Resource allocation for full node allocation:

  • Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
  • Recommended: --ntasks-per-node=128 (one rank per core)
  • Recommended: --cpus-per-task=2 (use both threads per core)
  • Memory: Adjust --mem based on the model size (e.g., 251G for full node - maximum available on Discoverer)
  • Total ranks: 2 nodes × 128 tasks = 256 ranks
  • Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes

Step 4: Submit the job

# Edit the script first to set the account and QoS (if not already done)
# Then submit the job
sbatch run_mnist_training_tf.sh

Before submitting: Ensure run_mnist_training_tf.sh has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct (should point to the TensorFlow virtual environment, not PyTorch).

Step 5: Monitor the job

# Check job status
squeue --me

# View output (replace JOBID with the actual job ID)
# Note: These commands should only be executed when squeue --me shows the job is in running state
tail -f mnist_training_tf.JOBID.out

# View errors (if any)
tail -f mnist_training_tf.JOBID.err

Step 6: Check the results

When the job completes, check the output file for training progress and final metrics.

When checking the output file (mnist_training_tf.JOBID.out), the output should be similar to:

Job ID: 4060967
Number of nodes: 2
Tasks per node: 128
Total tasks: 256
CPUs per task: 2
Working directory: /valhalla/projects/your_account/ai_cpu
Data directory: /opt/software/test/tf/mnist_data
Output directory: /valhalla/projects/your_account/ai_cpu/outputs
Python: /valhalla/projects/your_account/virt_envs/tf/bin/python
Python version: Python 3.11.14
Starting training on 256 processes
Process 0 of 256
Loading MNIST dataset...
Data will be stored in: /opt/software/test/tf/mnist_data
Training dataset download completed
Using learning rate: 0.16 (base: 0.01, ranks: 256, scaling: sqrt)
Starting training for 5 epochs...
Epoch 1, Loss: 0.8803, Accuracy: 72.08%, Val Loss: 0.0985, Val Accuracy: 97.07%, Time: 19.89s
Epoch 2, Loss: 0.1597, Accuracy: 95.36%, Val Loss: 0.0540, Val Accuracy: 98.29%, Time: 19.45s
Epoch 3, Loss: 0.1004, Accuracy: 97.09%, Val Loss: 0.0488, Val Accuracy: 98.45%, Time: 19.32s
Epoch 4, Loss: 0.0834, Accuracy: 97.63%, Val Loss: 0.0505, Val Accuracy: 98.45%, Time: 19.28s
Epoch 5, Loss: 0.0726, Accuracy: 97.86%, Val Loss: 0.0322, Val Accuracy: 98.99%, Time: 19.21s
Training completed in 99.44 seconds
Average time per epoch: 19.89 seconds
Evaluating on test set...
Test Loss: 0.0322, Test Accuracy: 98.99%
Training job completed

Expected output characteristics:

  • Job information header: Shows resource allocation and paths
  • Process count: Should match --nodes × --ntasks-per-node (e.g., 2 × 128 = 256 for full allocation)
  • Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
  • Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
  • Training progress: Loss should decrease and accuracy should increase over epochs
  • Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
  • Completion message: “Training job completed” indicates successful finish

Note: Only rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.

Differences from PyTorch example

The TensorFlow implementation differs from the PyTorch example in several ways:

  • Data loading: TensorFlow uses tf.data.Dataset with .shard() for distributed data splitting, while PyTorch uses DistributedSampler
  • Model definition: TensorFlow uses Keras Sequential API, while PyTorch uses nn.Module
  • Training loop: TensorFlow uses model.fit() with callbacks, while PyTorch uses manual training loops
  • Metric aggregation: TensorFlow uses Horovod’s MetricAverageCallback, while PyTorch uses hvd.allreduce() manually
  • Dataset download: TensorFlow downloads to a .npz file, while PyTorch downloads to a directory structure

Troubleshooting

Most troubleshooting steps are the same as the PyTorch example. Additional TensorFlow-specific considerations:

  1. TensorFlow dataset sharding:

    • Ensure .shard(hvd.size(), hvd.rank()) is called on the dataset before batching
    • Each MPI rank must process a different subset of data
  2. Horovod callbacks:

    • Always include BroadcastGlobalVariablesCallback(0) to synchronize initial model parameters
    • Include MetricAverageCallback() to average metrics across MPI ranks
    • Only MPI rank #0 should write checkpoints (use if hvd.rank() == 0:)
  3. TensorFlow/Keras compatibility:

    • Ensure TensorFlow and Horovod versions are compatible
    • Check with: python -c "import tensorflow as tf; import horovod.tensorflow as hvd; print('TF:', tf.__version__, 'Horovod:', hvd.__version__)"
  4. Memory issues:

    • TensorFlow may use more memory than PyTorch for the same model
    • Reduce batch_size if OOM errors are encountered
    • Consider using tf.data.Dataset prefetching carefully
  5. InfiniBand transport errors:

    • Error: “Transport retry count exceeded on mlx5_0:1/IB” is a known UCX/InfiniBand issue (UCX issue #5199)

    • This error typically occurs when there is excessive communication between MPI ranks across different nodes, indicating that the job has reached communication saturation and cannot utilize that number of nodes

    • Solution: Reduce the number of nodes in the job allocation

    • If reducing nodes is not feasible, set the UCX service level parameter as a workaround:

      export UCX_IB_SL=0
      
    • This parameter sets the InfiniBand service level to 0, which can resolve transport retry issues

    • Note: export UCX_IB_SL=0 is already included in all SLURM batch scripts in this document

    • Note: Disabling InfiniBand (using TCP/IP instead) is not recommended as it significantly reduces performance for multi-node jobs

Adapting for custom datasets

To use a custom dataset with TensorFlow:

  1. Replace the dataset loading section:

    # Instead of tf.keras.datasets.mnist.load_data(), use a custom dataset
    import tensorflow as tf
    
    # Load data
    x_train, y_train = load_your_training_data()
    x_test, y_test = load_your_test_data()
    
    # Normalize and reshape as needed
    x_train = x_train.astype('float32') / 255.0
    x_test = x_test.astype('float32') / 255.0
    
    # Create TensorFlow datasets
    train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
    train_dataset = train_dataset.shard(hvd.size(), hvd.rank())  # Important for distributed training
    train_dataset = train_dataset.shuffle(10000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    
  2. Adjust model architecture:

    • Modify the create_model() function to match the input dimensions and number of classes
    • Update normalization values if needed
  3. Update data directory:

    • Store the dataset in /valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/
    • Update the DATA_DIR environment variable in the SLURM script
    • Ensure the path is accessible from all compute nodes (project directories are shared across nodes)
    • Never store data in home directory - always use /valhalla/projects/<your_slurm_project_account_name>/

Summary: Best practices for distributed training on Discoverer

  1. Use MPI - Avoid Gloo in multi-user HPC environments due to possible port conflict issues
  2. Proper combination of components:
    • Horovod + MPI (simplest, works for both TensorFlow and PyTorch)
    • PyTorch native DDP with MPI backend
  3. Launch methods:
    • With Horovod: srun --mpi=pmix python train.py (srun automatically uses all allocated tasks)
    • With PyTorch DDP: srun --mpi=pmix python train.py
  4. No MPI module needed for Horovod: Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate module loading required

References

Getting help

See Getting help