ML/AI Workloads on Discoverer CPU Cluster
Table of contents
Create SLURM configuration for ML/AI workloads on CPU clusters
Suitable ML/AI workloads for CPU clusters
Summary: Best practices for distributed training on Discoverer
Discoverer CPU partition compute nodes specifications
For more datails see Discoverer Reseource Overview:
CPU cluster and ML/AI
CPU clusters like Discoverer are well-suited for ML/AI workloads, particularly when leveraging distributed training across multiple nodes. The large number of CPU cores (144,384 hardware cores, 288,768 logical CPUs with hyperthreading) and high-speed InfiniBand interconnect (200 Gbps per node) enable efficient parallel processing of machine learning tasks.
Distributed training as a basic principle
Distributed training is fundamental for effectively utilizing CPU clusters for ML/AI workloads. Rather than training models on a single node, distributed training divides the workload across multiple nodes and processes, enabling:
Faster training times: Multiple nodes work simultaneously on different data shards
Scalability: Training can scale from a few nodes to hundreds of nodes depending on the workload
Efficient resource allocation: Enables allocation of CPU cores across multiple nodes
Large dataset handling: Enables training on datasets that may not fit in a single node’s memory
On Discoverer, distributed training uses MPI (Message Passing Interface) for inter-node communication, which is the standard approach for HPC clusters. Frameworks like Horovod provide a unified interface for distributed training with both PyTorch and TensorFlow, simplifying the implementation of distributed training workflows.
Considerations for ML/AI on CPU clusters
Data sharding: In distributed training, datasets are divided (sharded) across all worker processes. Each process sees only a subset of the data per epoch, which affects convergence behavior and requires careful tuning of learning rates and epoch counts.
Communication overhead: While distributed training speeds up training, communication between nodes adds overhead. The optimal number of nodes depends on the specific workload, dataset size, and model complexity.
Resource allocation: Full node allocation (128 tasks per node, using all 256 logical CPUs) maximizes training efficiency. Partial allocation wastes resources and increases training time.
Storage requirements: All data and outputs must be stored in project directories (
/valhalla/projects/<your_slurm_project_account_name>/), never in home directories or/tmp.Python version compatibility: Python 3.11 is currently the optimal version for ML/AI workloads on Discoverer, providing best compatibility with TensorFlow, PyTorch, and Horovod.
Workload types suitable for CPU clusters
CPU clusters excel at: - Traditional machine learning (scikit-learn, XGBoost, etc.) - Deep learning with distributed training (PyTorch, TensorFlow) - Large-scale data preprocessing and feature engineering - Hyperparameter optimization with parallel trials - CPU-optimized inference workloads - Research and development workflows
For production training of very large models or extremely large datasets, GPU clusters may provide better performance, but CPU clusters are highly effective for many ML/AI workloads when properly configured for distributed training.
Limitations and considerations
Workloads suitable for CPU clusters: - Traditional ML algorithms (scikit-learn, XGBoost) - Data preprocessing and feature engineering - CPU-optimized inference - Hyperparameter optimization - Smaller neural networks - Batch inference workloads
What’s better on GPU clusters (Discoverer+ GPU partition - Large-scale deep learning training - Transformer model training - Real-time inference requirements - Computer vision with large CNNs - High-throughput training workloads
Virtual environment setup
Note
Virtual environments replace the need for running containers and simplify the setup and installation process for ML/AI workloads on Discoverer CPU partition. They provide isolated Python environments with all necessary dependencies without the overhead of container management.
Use conda from the anaconda3 module
All virtual environments on Discoverer must be managed by conda from the anaconda3 environment module that is already installed on the system. Users must not download, install, or use their own Anaconda or Miniconda installations. Conda is always available on Discoverer through environment modules and must be accessed by loading the anaconda3 module:
module load anaconda3
After loading the module, the conda executable and the libraries that support that executable become available. All virtual environment creation, package installation, and management must be done using this system-provided Anaconda3 installation. Do not attempt to install Anaconda3/Miniconda3 yourself - conda tool is already available through the environment module system.
Note
Users who prefer containers can use Singularity/Apptainer on the Discoverer CPU partition, but this document does not cover container-based workflows. The examples and instructions in this guide focus on virtual environments as the recommended approach.
Warning
Do not use conda activate when working with virtual environments on Discoverer. Always use --prefix with environment variable exports to expose the environment. Set VIRTUAL_ENV to the path of the virtual environment directory (e.g., /valhalla/projects/<your_slurm_project_account_name>/virt_envs/torch), then export:
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
See the examples below for complete virtual environment installation scripts.
The libmamba solver is recommended to suppoprt all conda-driven installations because it:
Resolves dependencies faster than the default solver
Handles complex dependency conflicts more reliably
Reduces warnings about deprecated version specifications
Common setup: Determine Python version compatibility
Before creating a virtual environment, determine the highest Python version supported by the target packages (pytorch-cpu or tensorflow-cpu). That my need some research or negative feedback organised installation procedure.
3.11 is the current optimal Python version for ML/AI workloads on Discoverer. TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.
Note
In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions.
Since packages from conda-forge may not include Python version in the search metadata, test installation with different Python versions in different virtual environments (point to the selected environment by using --prefix):
#!/bin/bash
#SBATCH --job-name=search_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_pytorch.%j.out
#SBATCH -e search_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Search for CPU-only PyTorch packages
conda search -c conda-forge pytorch-cpu
# Get detailed information
conda search -c conda-forge pytorch-cpu --info
#!/bin/bash
#SBATCH --job-name=test_python_version
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:30:00
#SBATCH -o test_python_version.%j.out
#SBATCH -e test_python_version.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Test which Python version works by trying to install the target package
# Replace PACKAGE_NAME with pytorch-cpu or tensorflow-cpu
# Replace <your_slurm_project_account_name> with the actual project account name
PACKAGE_NAME=pytorch-cpu # or tensorflow-cpu
# Note: Python 3.11 is the current optimal version for ML/AI workloads on Discoverer.
# TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.
# This script tests higher versions first, but Python 3.11 should be used unless testing shows
# that a higher version is compatible. In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible
# with higher Python versions.
# Test Python 3.13 (to test compatibility with future versions)
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py313
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
conda create --prefix ${VIRTUAL_ENV} python=3.13 --solver=libmamba -y
if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
echo "Python 3.13 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
echo "A production environment can now be created with Python 3.13"
exit 0
else
echo "Python 3.13 is not supported. Cleaning up test environment and trying lower version."
conda env remove --prefix ${VIRTUAL_ENV} -y
fi
# Test Python 3.12 (to test compatibility with future versions)
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py312
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
conda create --prefix ${VIRTUAL_ENV} python=3.12 --solver=libmamba -y
if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
echo "Python 3.12 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
echo "A production environment can now be created with Python 3.12"
exit 0
else
echo "Python 3.12 is not supported. Cleaning up test environment and trying lower version."
conda env remove --prefix ${VIRTUAL_ENV} -y
fi
# Python 3.11 is the current optimal version - use this for production
# Continue with 3.10, etc. only if 3.11 doesn't work
Common setup: Create virtual environment with compatible Python version
Once the compatible Python version is determined, create the virtual environment using the template SLURM batch script provided below.
Warning
Do not use conda activate when working with virtual environments on Discoverer. Always use export PATH=${VIRTUAL_ENV}/bin:${PATH}, export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH} to expose the environment to the shell.
#!/bin/bash
#SBATCH --partition=cn
#SBATCH --job-name=install
#SBATCH --time=00:30:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH -o install.%j.out
#SBATCH -e install.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
# Check if the target folder already exists
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
# Create Python virtual environment with Python 3.11
# Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
# (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
# Note: In the future, those packages might become compatible with higher Python versions
conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
if [ $? -ne 0 ]; then
echo "Conda Python virtual environment creation failed" >&2
exit 1
fi
echo "Python virtual environment successfully created!"
# Fully expose the Python virtual environment
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Verify Python version
python --version
Install PyTorch-CPU
On Discoverer CPU cluster, PyTorch-CPU must always be installed with nomkl because MKL (Intel Math Kernel Library) does not work well on AMD Zen2 processors (AMD EPYC 7H12 installed on Discoverer CPU compute nodes). The nomkl package ensures that MKL libraries are not used, preventing performance issues and compatibility problems on AMD architecture.
For distributed training with Horovod, install pytorch-cpu, and torchvision packages together with horovod in the same command line to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed), and to install mpi4py as a dependency.
Optional: Search for available PyTorch CPU packages
To check which PyTorch CPU versions are available before installing, save the SLURM batch script below as run_search_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=search_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_pytorch.%j.out
#SBATCH -e search_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Search for CPU-only PyTorch packages
conda search -c conda-forge pytorch-cpu
# Get detailed information
conda search -c conda-forge pytorch-cpu --info
Submit the job to the queue:
sbatch run_search_pytorch.sh
and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).
Installation script
Save the SLURM batch script provided below as run_install_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=install_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=01:30:00
#SBATCH -o install_pytorch.%j.out
#SBATCH -e install_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
# Create virtual environment if it doesn't exist
if [ ! -d ${VIRTUAL_ENV} ]; then
# Create Python virtual environment with Python 3.11
# Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
# (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
# Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
if [ $? -ne 0 ]; then
echo "Conda Python virtual environment creation failed" >&2
exit 1
fi
echo "Python virtual environment successfully created!"
else
echo "Virtual environment already exists at ${VIRTUAL_ENV}"
fi
# Expose the virtual environment (do not use conda activate)
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# For distributed training with Horovod:
# Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba
# For basic PyTorch installation (without distributed training):
# Install PyTorch CPU version in nomkl mode (required)
# torchvision is needed for datasets and image transformations
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge pytorch-cpu torchvision nomkl --solver=libmamba
# Install additional packages if needed
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge transformers --solver=libmamba
# Install multiple packages at once (pytorch-cpu must always include nomkl)
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge \
# pytorch-cpu \
# torchvision \
# nomkl \
# transformers \
# scikit-learn \
# pandas \
# numpy \
# --solver=libmamba
Submit the job to the queue:
sbatch run_install_pytorch.sh
and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).
Note
With
condaas installer, the PyTorch package to install is namedpytorch-cpu, but within the Python code the installation is imported as a module under the nametorch(import torch)torchvisionis required for datasets (e.g., MNIST) and image transformationsFor distributed training:
pytorch-cpu,horovod, andtorchvisionmust be installed together in the same command line to ensure Open MPI is used instead of MPICH and to bringmpi4pyas a dependencyAlways include
nomklwhen installingpytorch-cpuon Discoverer CPU partition because of the used AMD Zen2 processor microarchitectureFor distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)
Verify PyTorch-CPU installation
After installing pytorch-cpu, verify the installation. Save the SLURM batch script below as run_verify_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=verify_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o verify_pytorch.%j.out
#SBATCH -e verify_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Check PyTorch version
python -c "import torch; print('PyTorch version:', torch.__version__)"
# Check torchvision version
python -c "import torchvision; print('torchvision version:', torchvision.__version__)"
# Verify PyTorch is CPU-only (should return False for CUDA availability)
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
# Check number of CPU threads PyTorch can use as if SMT is not there
python -c "import torch; print('Number of threads:', torch.get_num_threads())"
# Test basic tensor operations
python -c "import torch; x = torch.randn(5, 3); print('Tensor shape:', x.shape); print('Tensor device:', x.device)"
Submit the job to the queue:
sbatch run_verify_pytorch.sh
and follow the content written down in the standard output and standard error files (verify_pytorch.JOBID.out and verify_pytorch.JOBID.err) to see the verification results.
Expected output should show:
PyTorch version number
torchvision version number
CUDA available: False (since this is CPU-only)
Number of threads available (no SMT is taken into account)
Tensor operations working correctly
Examine available PyTorch methods, functions, and constants
To inspect what methods, functions, and constants are available in the installed PyTorch, save the SLURM batch script below as run_examine_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=examine_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o examine_pytorch.%j.out
#SBATCH -e examine_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
# Check if the target folder exists
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# List all available attributes in torch module
python -c "import torch; print('Available attributes:'); print([attr for attr in dir(torch) if not attr.startswith('_')])"
# Check specific submodules
python -c "import torch; print('torch.nn available:', hasattr(torch, 'nn')); print('torch.optim available:', hasattr(torch, 'optim')); print('torch.distributed available:', hasattr(torch, 'distributed'))"
# Check BLAS backend (should show OpenBLAS when nomkl is used)
python -c "import torch; print('BLAS library:', torch.__config__.show())"
# Check if MKLDNN is actually installed and available at runtime
python -c "import torch; print('MKLDNN available:', torch.backends.mkldnn.is_available() if hasattr(torch.backends, 'mkldnn') else 'mkldnn backend not found')"
python -c "import torch; print('MKLDNN enabled:', torch.backends.mkldnn.enabled if hasattr(torch.backends, 'mkldnn') else 'N/A')"
# List available tensor operations
python -c "import torch; ops = [attr for attr in dir(torch) if callable(getattr(torch, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"
# Check constants
python -c "import torch; print('Float32:', torch.float32); print('CPU device:', torch.device('cpu')); print('Default dtype:', torch.get_default_dtype())"
Submit the job to the queue:
sbatch run_examine_pytorch.sh
and follow the content written down in the standard output and standard error files (examine_pytorch.JOBID.out and examine_pytorch.JOBID.err) to see the available methods, functions, and constants.
Alternatively, run interactively using srun:
srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
--nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash
# Then in the interactive session:
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
python -c "import torch; print('Available attributes:', [attr for attr in dir(torch) if not attr.startswith('_')])"
# ... (continue with other examination commands)
These tests help to verify:
Which PyTorch modules are available (nn, optim, distributed, etc.)
Which BLAS library is being used (should be OpenBLAS with nomkl)
Available tensor operations and constants
The build configuration should show BLAS_INFO=open and LAPACK_INFO=open, confirming that OpenBLAS is used for BLAS and LAPACK operations instead of MKL. This is the expected behavior when using nomkl. The USE_MKLDNN=1 setting indicates that MKLDNN (Intel MKL-DNN, now oneDNN) is available. MKLDNN is embedded in the libtorch/pytorch package and statically linked into PyTorch, so it remains available even when nomkl is used. This configuration is correct: nomkl prevents MKL BLAS/LAPACK usage (which does not work well on AMD Zen2 processors), while MKLDNN (for neural network operations) remains available as it is part of the PyTorch build itself.
Install TensorFlow-CPU
Optional: Search for available TensorFlow packages
To check which TensorFlow CPU versions are available before installing, save the SLURM batch script below as run_search_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=search_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_tensorflow.%j.out
#SBATCH -e search_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Search for TensorFlow CPU packages
conda search -c conda-forge tensorflow-cpu
# Search for specific package with build string
conda search -c conda-forge "tensorflow-cpu=*"
# Check what's available in the configured channels
conda search tensorflow-cpu
Submit the job to the queue:
sbatch run_search_tensorflow.sh
and follow the content written down in the standard output and standard error files (search_tensorflow.JOBID.out and search_tensorflow.JOBID.err) to see which TensorFlow CPU versions are available.
Installation script
To use Horovod for distributed training, tensorflow-cpu and horovod must be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer. Installing them together ensures Horovod is built with CPU-only support.
Important note on compatibility: Even when installing together, conda’s dependency resolver may select versions that are technically compatible at the package level but have runtime API incompatibilities (e.g., Horovod callbacks may not work with certain TensorFlow versions). Training scripts should handle this by falling back to manual synchronization when callbacks are incompatible.
Save the SLURM batch script provided below as run_install_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=install_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=01:30:00
#SBATCH -o install_tensorflow.%j.out
#SBATCH -e install_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
# Create virtual environment if it doesn't exist
if [ ! -d ${VIRTUAL_ENV} ]; then
# Create Python virtual environment with Python 3.11
# Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
# (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
# Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
if [ $? -ne 0 ]; then
echo "Conda Python virtual environment creation failed" >&2
exit 1
fi
echo "Python virtual environment successfully created!"
else
echo "Virtual environment already exists at ${VIRTUAL_ENV}"
fi
# Expose the virtual environment (do not use conda activate)
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Set channel priority to flexible (required when packages exist in multiple channels)
conda config --set channel_priority flexible
# Install TensorFlow CPU and Horovod together (required for CPU-only clusters)
# If Horovod is installed before TensorFlow, it will install CUDA dependencies which are not needed on CPU-only clusters
# Installing them together ensures Horovod is built with CPU-only support and conda resolves compatible versions
# Option 1: Let conda choose latest compatible versions (recommended for best compatibility)
conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmamba
# Option 2: Pin TensorFlow version, let conda choose compatible Horovod
# Use this to install a specific TensorFlow version (replace <version> with the desired version)
# conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu=<version> horovod nomkl --solver=libmamba
Submit the job to the queue:
sbatch run_install_tensorflow.sh
and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).
Important
Install TensorFlow and Horovod together: To use Horovod for distributed training,
tensorflow-cpuandhorovodmust be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer.Version compatibility: Conda’s dependency resolver does not guarantee runtime API compatibility. Even when installing together, runtime API incompatibilities may be encountered (e.g., Horovod callbacks failing with
AttributeError: 'Variable' object has no attribute 'ref'). The training script handles this automatically by falling back to manual synchronization.Use flexible channel priority: When “excluded by strict repo priority” appears, set
channel_prioritytoflexibleto allow conda to use packages from any configured channelDo not specify build string: If pinning a version, install by version only (e.g.,
tensorflow-cpu=<version>), not with the build string (e.g.,tensorflow-cpu=<version>=cpu_py313hbca4264_0)Include nomkl: Similar to PyTorch, include
nomklfor AMD Zen2 processors
Runtime API compatibility: TensorFlow and Horovod may have runtime API incompatibilities even when conda installs them together. If errors like Horovod has been shut down or AttributeError: 'Variable' object has no attribute 'ref' are encountered, this indicates the installed versions are incompatible. The training script attempts to handle this, but if Horovod shuts down during initialization, training will fail. In this case, different version combinations may need to be tried or PyTorch may be used instead, which has better compatibility with Horovod.
Verify TensorFlow installation
After installing tensorflow-cpu, verify the installation. Save the SLURM batch script below as run_verify_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=verify_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o verify_tensorflow.%j.out
#SBATCH -e verify_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Check TensorFlow version
python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"
# Check available CPU devices (TensorFlow shows logical CPU devices, not physical sockets)
python -c "import tensorflow as tf; print('CPU devices:', tf.config.list_physical_devices('CPU'))"
# Check number of CPU threads TensorFlow can use
python -c "import tensorflow as tf; print('Inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"
# Check total CPU cores available to TensorFlow
python -c "import tensorflow as tf; import os; print('Total CPU cores available:', len(os.sched_getaffinity(0)) if hasattr(os, 'sched_getaffinity') else 'N/A')"
# Configure TensorFlow to use all available CPU threads (256 threads per node on Discoverer)
python -c "import tensorflow as tf; tf.config.threading.set_inter_op_parallelism_threads(256); tf.config.threading.set_intra_op_parallelism_threads(256); print('Configured inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Configured intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"
# Check if TensorFlow has MPI integration
python -c "import tensorflow as tf; print('tf.distribute available:', hasattr(tf, 'distribute')); print('tf.distribute.Strategy available:', hasattr(tf.distribute, 'Strategy') if hasattr(tf, 'distribute') else False)"
# Check TensorFlow build configuration for MPI support
python -c "import tensorflow as tf; config = tf.sysconfig.get_build_info(); print('Build flags:', config.get('cpu_compiler_flags', 'N/A')); import sysconfig; print('Has MPI:', 'mpi' in str(sysconfig.get_config_var('LIBS')).lower() or 'mpi' in str(tf.sysconfig.get_compile_flags()).lower())"
# Check Horovod installation (if installed with TensorFlow)
python -c "try: import horovod.tensorflow as hvd; print('Horovod version:', hvd.__version__); print('Horovod TensorFlow support: OK'); except ImportError: print('Horovod not installed or TensorFlow support not available')"
# Test basic tensor operations
python -c "import tensorflow as tf; x = tf.constant([[1.0, 2.0], [3.0, 4.0]]); print('Tensor shape:', x.shape); print('Tensor device:', x.device); y = tf.matmul(x, x); print('Matrix multiplication result shape:', y.shape)"
Submit the job to the queue:
sbatch run_verify_tensorflow.sh
and follow the content written down in the standard output and standard error files (verify_tensorflow.JOBID.out and verify_tensorflow.JOBID.err) to see the verification results.
Expected output should show:
TensorFlow version number
CPU devices available
Thread configuration (inter-op and intra-op parallelism)
Horovod installation status (if installed)
Tensor operations working correctly
Examine available TensorFlow methods, functions, and constants
To inspect what methods, functions, and constants are available in the installed TensorFlow, save the SLURM batch script below as run_examine_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=examine_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o examine_tensorflow.%j.out
#SBATCH -e examine_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
# Check if the target folder exists
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# List all available attributes in tensorflow module
python -c "import tensorflow as tf; print('Available attributes:'); print([attr for attr in dir(tf) if not attr.startswith('_')])"
# Check specific submodules
python -c "import tensorflow as tf; print('tf.keras available:', hasattr(tf, 'keras')); print('tf.nn available:', hasattr(tf, 'nn')); print('tf.optimizers available:', hasattr(tf, 'optimizers')); print('tf.distribute available:', hasattr(tf, 'distribute'))"
# Check TensorFlow build configuration
python -c "import tensorflow as tf; print('Build info:', tf.sysconfig.get_build_info())"
# Check available operations
python -c "import tensorflow as tf; ops = [attr for attr in dir(tf) if callable(getattr(tf, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"
# Check constants and dtypes
python -c "import tensorflow as tf; print('Float32:', tf.float32); print('Int32:', tf.int32); print('Default float dtype:', tf.keras.backend.floatx())"
# Check Keras availability and version
python -c "import tensorflow as tf; print('Keras version:', tf.keras.__version__ if hasattr(tf.keras, '__version__') else 'N/A'); print('Keras available:', hasattr(tf, 'keras'))"
Submit the job to the queue:
sbatch run_examine_tensorflow.sh
and follow the content written down in the standard output and standard error files (examine_tensorflow.JOBID.out and examine_tensorflow.JOBID.err) to see the available methods, functions, and constants.
Alternatively, run interactively using srun:
srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
--nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash
# Then in the interactive session:
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
python -c "import tensorflow as tf; print('Available attributes:', [attr for attr in dir(tf) if not attr.startswith('_')])"
# ... (continue with other examination commands)
These tests help to verify:
Which TensorFlow modules are available (keras, nn, optimizers, distribute, etc.)
Build configuration and available features
Available tensor operations and constants
Keras integration and version
Create SLURM configuration for ML/AI workloads on CPU clusters
Resource allocation
CPU-intensive tasks: Request all 256 threads per node
Memory-intensive tasks: Use fat nodes (1TB RAM) when needed
I/O-intensive tasks: Consider network-attached storage performance
Wall time limits: All jobs have maximum runtime limits specified with
#SBATCH --time=. Jobs are automatically terminated when the wall time limit is reached. Plan checkpoint intervals accordingly (see Checkpointing)
Job arrays for parallel experiments
#SBATCH --array=0-99%10 # Run 100 jobs, max 10 concurrent
Memory management
Monitor memory usage with
sstatduring job executionUse memory-efficient libraries (e.g.,
pandaswith chunking)Consider using fat nodes for large datasets
Checkpointing
All SLURM jobs on Discoverer have wall time limitations (specified with #SBATCH --time=). When a job reaches its wall time limit, it is automatically terminated by SLURM, regardless of whether training has completed. Therefore, checkpointing is essential for any training job that may exceed the allocated wall time.
The necessity of making checkpoints:
Jobs are terminated when wall time is reached, even if training is incomplete
Without checkpoints, all progress is lost and training must be restarted from the beginning
Checkpoints allow resuming training from the last saved state after the job is terminated
Best practices:
Implement checkpointing for all long-running training jobs
Save checkpoints to network storage (
/valhalla/projects/<your_slurm_project_account_name>/outputs/)Plan checkpoint intervals based on the wall time limit (e.g., if wall time is 24 hours, save checkpoints every 2-4 hours)
Ensure checkpoint saving happens frequently enough that significant progress is not lost if the job is terminated
Only rank 0 should write checkpoints in distributed training to avoid conflicts
Include epoch number, model state, optimizer state, and any other necessary information in checkpoints
Implement checkpoint loading at the start of training to resume from the last checkpoint if available
Suitable ML/AI workloads for CPU clusters
Traditional machine learning
Scikit-learn workflows
Large-scale feature engineering and preprocessing
Hyperparameter optimization with grid/random search
Cross-validation on large datasets
Ensemble methods (Random Forest, Gradient Boosting)
Model training on datasets that fit in memory (up to 251 GB allocatable per node)
XGBoost/LightGBM/CatBoost
Tabular data classification/regression
Large-scale gradient boosting
Feature importance analysis
Model interpretation
Deep learning (CPU-optimized)
TensorFlow/Keras with CPU optimization
Training smaller neural networks (CNNs, RNNs, MLPs)
Inference workloads
Transfer learning with pre-trained models
Model fine-tuning
Distributed training (see official TensorFlow distributed documentation)
PyTorch with CPU backend
Research prototyping
Model development and debugging
CPU-optimized inference
Distributed training (see official PyTorch distributed documentation)
Natural language processing (NLP)
Traditional NLP pipelines
Large-scale text preprocessing and tokenization
Feature extraction (TF-IDF, word embeddings)
Document classification
Sentiment analysis on large corpora
Transformer models (CPU inference)
Batch inference with Hugging Face models
Text generation (smaller models)
Embedding extraction
Model quantization and optimization
Computer vision
Image processing and feature extraction
Large-scale image preprocessing
Feature extraction (SIFT, SURF, HOG)
Image classification with traditional ML
Batch image transformations
Data science and analytics
Large-scale data processing
Pandas/NumPy operations on large datasets
Dask for large-scale data processing
Feature engineering pipelines
Data validation and cleaning
Hyperparameter optimization
Hyperparameter optimization
Grid search with parallel trials
Random search with parallel trials
Bayesian optimization
AutoML frameworks
Reinforcement learning
CPU-based RL training
Policy gradient methods
Q-learning variants
Multi-agent systems
Environment simulation
Distributed training
Recommended approach: Use MPI for HPC clusters
For multi-user HPC clusters like Discoverer, MPI is the recommended and verified approach for distributed training.
Why MPI over Gloo?
Gloo (PyTorch’s TCP/IP-based backend) has major practical problems on shared HPC clusters: - Port conflicts: Multiple users can pick the same MASTER_PORT, causing job failures - No automatic port management: Manual coordination required between users - Race conditions: Port availability can change between check and use - Administrative burden: Unsuitable for multi-user environments
MPI was designed for multi-user HPC environments and handles all communication automatically without port conflicts. It provides: - No port management - MPI runtime handles all communication - Multi-user safe - designed for shared clusters - Battle-tested - decades of HPC use - SLURM integration - seamless with srun --mpi=
Conclusion: For multi-user HPC clusters, MPI is the correct choice. Gloo should be avoided in shared environments.
TensorFlow distributed training with Horovod
Horovod provides a unified interface for distributed training with both TensorFlow and PyTorch. For a complete working example with TensorFlow, see the Complete working example: Distributed training on MNIST dataset with TensorFlow section below.
Basic TensorFlow with Horovod setup:
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Build and compile model
model = tf.keras.Sequential([
# Add layers here
])
# Scale learning rate by number of workers
opt = tf.keras.optimizers.Adam(0.001 * hvd.size())
# Wrap optimizer with Horovod
opt = hvd.DistributedOptimizer(opt)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy')
# Add Horovod callbacks
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
# Add other callbacks here
]
# Only rank 0 should write checkpoints
if hvd.rank() == 0:
callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
model.fit(dataset, epochs=10, callbacks=callbacks)
Running PyTorch workloads
Package installation for distributed training
For distributed training with Horovod, install PyTorch, Horovod, and torchvision together in the same command. See Install PyTorch-CPU in the Virtual environment setup section for complete installation instructions.
Important
Python version requirement:
Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11
The virtual environment must be created with Python 3.11 for optimal compatibility
Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
Important
MPI implementation for Horovod:
Install PyTorch and Horovod together to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed)
No separate MPI module needs to be loaded - Open MPI is installed as a dependency in the virtual environment when Horovod is installed
Use
srun --mpi=pmixto launch jobs -srunautomatically uses all allocated tasks, no need to specify-npThe
mpiruncommand is also available from the virtual environment (installed as part of Open MPI dependency), butsrunis preferred for SLURM integration
Installation command for distributed training:
# Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba
Note
With conda, the package is named
pytorch-cpu(orpytorch), and it is imported asimport torchtorchvision is required for datasets (e.g., MNIST) and image transformations
pytorch-cpu, horovod, and torchvision must be installed together in the same command to ensure Open MPI is used instead of MPICH and to bring
mpi4pyas a dependencyOpen MPI is installed as a dependency in the virtual environment when Horovod is installed - no separate MPI module needs to be loaded
Always include
nomklwhen installingpytorch-cpuon Discoverer (AMD Zen2 processors)For distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)
PyTorch distributed training with MPI
Option 1: PyTorch native DDP with MPI backend
SLURM batch script:
#!/bin/bash
#SBATCH --job-name=pytorch_distributed
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH -o pytorch_distributed.%j.out
#SBATCH -e pytorch_distributed.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Launch with SLURM's MPI integration
srun --mpi=pmix python train.py
Python code example:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
# Initialize process group with MPI backend
# No MASTER_ADDR, no MASTER_PORT needed - MPI handles everything
dist.init_process_group(backend='mpi')
# Get local rank and world size
local_rank = dist.get_rank()
world_size = dist.get_world_size()
# Create model and move to device
model = YourModel()
model = DDP(model)
# Create distributed sampler for data loading
train_sampler = DistributedSampler(
train_dataset,
num_replicas=world_size,
rank=local_rank
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=batch_size,
sampler=train_sampler
)
# Training loop
for epoch in range(num_epochs):
train_sampler.set_epoch(epoch)
for data, target in train_loader:
# Training code here
pass
# Clean up
dist.destroy_process_group()
Option 2: Horovod with PyTorch (recommended for simplicity)
Horovod works with both TensorFlow and PyTorch and simplifies distributed training setup.
SLURM batch script:
#!/bin/bash
#SBATCH --job-name=horovod_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH -o horovod_pytorch.%j.out
#SBATCH -e horovod_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Launch with srun (srun automatically uses all allocated tasks)
srun --mpi=pmix python train_horovod.py
Python code example:
import torch
import horovod.torch as hvd
# Initialize Horovod
hvd.init()
# Scale learning rate by number of workers
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 * hvd.size())
# Wrap optimizer with Horovod
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters()
)
# Broadcast initial parameters from rank 0 to all workers
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
# Training loop
for epoch in range(num_epochs):
for data, target in train_loader:
optimizer.zero_grad()
loss = criterion(model(data), target)
loss.backward()
optimizer.step()
Complete working example: Distributed training on MNIST dataset
This section provides a complete, working example of distributed training using the MNIST handwritten digit dataset. MNIST is a standard benchmark dataset that’s automatically downloaded when the code is run, making it ideal for testing distributed training setups.
Example files: The complete training script and SLURM batch script are provided as separate, runnable files:
train_mnist_horovod_pytorch.py- Complete Python training scriptrun_mnist_training_pytorch.sh- SLURM batch submission script
Discoverer Storage Requirements:
All data and outputs must be stored under
/valhalla/projects/<your_slurm_project_account_name>/(replace<your_slurm_project_account_name>with the SLURM account name)Never store data in the home directory (
~/) - this violates Discoverer’s best practicesThe example scripts automatically use
/valhalla/projects/<your_slurm_project_account_name>/data/for datasetsOutputs (checkpoints, logs) should go to
/valhalla/projects/<your_slurm_project_account_name>/outputs/The SLURM script creates these directories automatically
These files can be copied and run directly after configuring the SLURM account settings.
Overview
Dataset: MNIST.*Training set: 60,000 images - Test set: 10,000 images - Image size: 28×28 pixels - Data source: Automatically downloaded from PyTorch’s built-in datasets (no manual download needed) - Model: Simple convolutional neural network (CNN) suitable for CPU training - Framework: PyTorch with Horovod - Training: Distributed across multiple nodes using MPI
Step 1: Prepare the working directory
On the Discoverer login node, create a directory for the training job and copy the example files:
# Replace <your_slurm_project_account_name> with the actual SLURM account name
mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
# Copy the example files (adjust paths as needed)
cp /path/to/train_mnist_horovod_pytorch.py .
cp /path/to/run_mnist_training_pytorch.sh .
Note: The example files (train_mnist_horovod_pytorch.py and run_mnist_training_pytorch.sh) are provided as separate, runnable scripts that can be copied and used directly.
Step 2: Get the training script
Copy the training script to the working directory:
# Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp train_mnist_horovod_pytorch.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
The training script (train_mnist_horovod_pytorch.py) contains:
CNN model definition for MNIST digit classification
Distributed data loading with Horovod
Training and validation loops with metric aggregation
Automatic MNIST dataset download (only rank 0 downloads)
Complete distributed training implementation
The full script can be viewed in train_mnist_horovod_pytorch.py or modified as needed.
Step 3: Get the SLURM batch script
Copy the SLURM batch script to the working directory:
# Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp run_mnist_training_pytorch.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
Before submitting, edit run_mnist_training_pytorch.sh and replace:
<your_slurm_project_account_name>with the actual SLURM account name<your_qos>with the actual QoS nameVerify the virtual environment path matches the setup
The SLURM script (run_mnist_training_pytorch.sh) contains:
Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
Module loading (anaconda3 only - Open MPI is installed as a dependency in the virtual environment when Horovod is installed)
Virtual environment setup
Job launch with
srun --mpi=pmix(srun automatically uses all allocated tasks)
Understanding epochs in distributed training:
Epochs are shared across all ranks, not per-rank
With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
Each rank processes a different subset of data during each epoch
All ranks work together on the same epochs simultaneously
256 epochs are not needed for 256 ranks - the number of epochs is independent of the number of ranks
Resource allocation for full node allocation:
Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
Recommended:
--ntasks-per-node=128(one rank per core)Recommended:
--cpus-per-task=2(use both threads per core)Memory: Adjust
--membased on the model size (e.g., 251G for full node - maximum available on Discoverer)Total ranks: 2 nodes × 128 tasks = 256 ranks
Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes
The full script can be viewed in run_mnist_training_pytorch.sh and resource requirements can be modified as needed.
Step 4: Make the script executable and submit the job
On the login node:
# Edit the script first to set the account and QoS (if not already done)
# Then submit the job
sbatch run_mnist_training_pytorch.sh
Before submitting: Ensure run_mnist_training_pytorch.sh has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct.
Step 5: Monitor the job
# Check job status
squeue --me
# Monitor output in real-time (replace JOBID with the actual job ID)
# Note: These commands should only be executed when squeue --me shows the job is in running state
tail -f pytorch_cpu_training.JOBID.out
# Check for errors
tail -f pytorch_cpu_training.JOBID.err
Step 6: Check job runtime and statistics
After the job completes, check how long it took:
# Show job-level summary only (aggregated, no job steps)
sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
# Show detailed accounting with job steps (default behavior)
sacct -j JOBID --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,TotalCPU,MaxRSS,MaxVMSize,NodeList
# Show elapsed time (wall clock time) for a job
sacct -j JOBID -X --format=JobID,JobName,Elapsed,State
# Show resource usage summary
sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,MaxVMSize,ReqMem,AllocCPUS
# Show all jobs for today
sacct --starttime=today -X --format=JobID,JobName,Partition,Elapsed,State,ExitCode
Understanding sacct output:
Without -X flag (default): Shows job steps separately
JOBID: Main job (aggregated across all steps)JOBID.batch: The batch script executionJOBID.0,JOBID.1, etc.: Individual job steps (e.g., MPI processes, srun steps)
With -X flag: Shows only job-level summary (aggregated)
Single line with total statistics for the entire job
Example output without -X:
$ sacct -j 4060972 --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972 mnist_hor+ 00:00:56 05:03:21 512
4060972.bat+ batch 00:00:56 00:00.513 21176K 256
4060972.0 hydra_pmi+ 00:00:55 05:03:21 150700188K 4
Example output with -X (job-level only):
$ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972 mnist_hor+ 00:00:56 00:00:00 512
Common fields:
Elapsed: Wall clock time (how long the job ran)TotalCPU: Total CPU time used (sum across all CPUs)MaxRSS: Maximum resident set size (peak memory usage)MaxVMSize: Maximum virtual memory sizeAllocCPUS: Number of CPUs allocated
Recommendation: Use -X flag for quick job-level summaries. Remove -X to see detailed breakdown by job steps.
Performance comparison: Scaling with number of CPUs
The benefits of fully allocating nodes are clear when comparing runtime. Note that SLURM’s Elapsed time includes job startup/teardown overhead, while the training script reports actual computation time:
Partial allocation (8 tasks per node, 32 CPUs total):
$ sacct -j 4060968 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060968 mnist_hor+ 00:02:35 00:00:00 32
Wall time (SLURM): 2 minutes 35 seconds
Allocated resources: 32 CPUs allocated
Full node allocation (128 tasks per node, 512 CPUs total):
$ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972 mnist_hor+ 00:00:56 00:00:00 512
Wall time (SLURM): 56 seconds
Allocated resources: 512 CPUs (128 tasks × 2 nodes × 2 CPUs per task)
Speedup: ~2.8× faster (155 seconds → 56 seconds)
Large-scale runs (multiple nodes):
8 nodes (1024 CPUs total):
$ sacct -j 4060975 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060975 mnist_hor+ 00:00:52 00:00:00 1024
Wall time (SLURM): 52 seconds
Actual training time (from logs): 18.73 seconds
Average time per epoch: 3.75 seconds
Allocated resources: 1024 CPUs (128 tasks × 8 nodes × 2 CPUs per task)
16 nodes (2048 CPUs total):
$ sacct -j 4060974 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ----------
4060974 mnist_hor+ 00:00:49 00:00:00 2048
Wall time (SLURM): 49 seconds
Actual training time (from logs): 14.75 seconds
Average time per epoch: 2.95 seconds
Allocated resources: 2048 CPUs (128 tasks × 16 nodes × 2 CPUs per task)
Speedup vs 1024 CPUs: ~1.3× faster training time (18.73s → 14.75s)
Analysis
Wall time vs. training time: SLURM’s
Elapsedincludes job startup/teardown overhead. The training script’s reported time is the actual computation time.Scaling efficiency: Training time scales well from 512 → 1024 → 2048 CPUs, showing good parallel efficiency.
Overhead: Job startup/teardown overhead becomes more noticeable at larger scales (e.g., 49s wall time vs 14.75s training time for 2048 CPUs).
Full node allocation: Using 128 tasks per node (one per core) maximizes resource allocation and minimizes training time.
Scalability testing results
The following results demonstrate how to measure scalability and training speed across different node configurations. These are speed/scalability tests, not optimized for accuracy. The accuracy values shown are lower than optimal because the focus is on demonstrating distributed training execution and measuring speedup.
2 nodes (256 ranks):
Training time: 23.80 seconds
Learning rate: 0.16 (sqrt scaling)
Test accuracy: 66.66% → 10.08% (unstable - not optimized for accuracy)
4 nodes (512 ranks):
Training time: 19.07 seconds
Learning rate: 0.226 (sqrt scaling)
Test accuracy: 64.99% → 21.16% (unstable - not optimized for accuracy)
8 nodes (1024 ranks):
Training time: 15.37 seconds
Learning rate: 0.1 (fixed scaling)
Test accuracy: 19.84% → 66.18% (improving but not optimized)
16 nodes (2048 ranks):
Training time: 15.10 seconds
Learning rate: 0.1 (fixed scaling)
Test accuracy: 20.68% → 64.96% (improving but not optimized)
Analysis:
Training speed scales well: 23.80s → 19.07s → 15.37s → 15.10s
Diminishing returns beyond 8 nodes (communication overhead)
These results demonstrate scalability - actual accuracy requires further tuning (see below)
Achieving better accuracy
To achieve high accuracy (e.g., ~98-99% for MNIST) instead of just speed testing, apply these changes to the training script:
Reduce base learning rate: Change
base_lr = 0.01tobase_lr = 0.001or0.0001Train for more epochs: Change
num_epochs = 5tonum_epochs = 10or20Use Adam optimizer: Replace
optim.SGDwithoptim.Adamoroptim.AdamWfor better stabilityAdd gradient clipping: Before
optimizer.step(), addtorch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Use learning rate schedule: Add
StepLRscheduler to decay learning rate over epochsIncrease batch size: Change
batch_size = 64to128or256(if memory allows)
Note
Better accuracy typically requires more epochs, lower learning rates, and more careful tuning, which increases training time. For production training, start with 2-4 nodes to establish baseline accuracy, then scale up while monitoring convergence.
Full node allocation significantly reduces training time. For the same workload, using 128 tasks per node instead of 8 tasks per node provides nearly 3× speedup while fully allocating the available resources. Scaling to multiple nodes further reduces training time, though with diminishing returns due to communication overhead.
Understanding the data flow
Data location:
The MNIST dataset is already available at
/opt/software/test/pytorch/mnist_dataThe dataset is automatically loaded from this location - no download is needed
The dataset is approximately 60 MB and includes both training and test sets
The SLURM script sets
DATA_DIRenvironment variable to this location
Data distribution:
Each process gets a different subset of the training data via
DistributedSamplerWith 16 processes (2 nodes × 8 tasks), each process handles ~3,750 training images
Data is automatically shuffled differently each epoch
Model synchronization:
Initial model parameters are broadcast from rank 0 to all workers
After each training step, gradients are averaged across all processes using Horovod’s ring-allreduce algorithm
All processes maintain identical model parameters throughout training
Output:
Only rank 0 prints training progress to avoid duplicate output
Metrics (loss, accuracy) are averaged across all processes before printing
Checkpoints should be saved to
/valhalla/projects/<your_slurm_project_account_name>/outputs(never use home directory)The SLURM script sets
OUTPUT_DIRenvironment variable for this purpose
Expected output
When checking the output file (pytorch_cpu_training.JOBID.out), the output should be similar to:
Job ID: 4060967
Number of nodes: 2
Tasks per node: 128
Total tasks: 256
CPUs per task: 2
Working directory: /valhalla/projects/your_account/ai_cpu
Data directory: /opt/software/test/pytorch/mnist_data
Output directory: /valhalla/projects/your_account/ai_cpu/outputs
Python: /valhalla/projects/your_account/virt_envs/pytorch/bin/python
Python version: Python 3.11.14
Starting training on 256 processes
Process 0 of 256
Loading MNIST dataset...
Data will be stored in: /opt/software/test/pytorch/mnist_data
Training dataset download completed
Starting training for 5 epochs...
Epoch 1, Batch 0, Loss: 2.3130
Epoch 1 - Average Loss: 0.8803, Average Accuracy: 72.08%
Test Loss: 0.0985, Test Accuracy: 97.07%
Epoch 2, Batch 0, Loss: 0.2280
Epoch 2 - Average Loss: 0.1597, Average Accuracy: 95.36%
Test Loss: 0.0540, Test Accuracy: 98.29%
Epoch 3, Batch 0, Loss: 0.1328
Epoch 3 - Average Loss: 0.1004, Average Accuracy: 97.09%
Test Loss: 0.0488, Test Accuracy: 98.45%
Epoch 4, Batch 0, Loss: 0.0106
Epoch 4 - Average Loss: 0.0834, Average Accuracy: 97.63%
Test Loss: 0.0505, Test Accuracy: 98.45%
Epoch 5, Batch 0, Loss: 0.1305
Epoch 5 - Average Loss: 0.0726, Average Accuracy: 97.86%
Test Loss: 0.0322, Test Accuracy: 98.99%
Training completed in 99.44 seconds
Average time per epoch: 19.89 seconds
Training job completed
Expected output characteristics:
Job information header: Shows resource allocation and paths
Process count: Should match
--nodes × --ntasks-per-node(e.g., 2 × 128 = 256 for full allocation)Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
Training progress: Loss should decrease and accuracy should increase over epochs
Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
Completion message: “Training job completed” indicates successful finish
Note
Only MPI rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.
Understanding process structure
When checking running processes with top or htop on a compute node, the following will be visible:
Main Python training processes: 128 processes (one per
--ntasks-per-node, matching one per core)High CPU usage (180-220%) - using multiple threads (2 threads per process)
Memory usage: ~500-530 MB per process
These are the distributed training workers
Each process uses 2 threads (matching
--cpus-per-task=2)
pt_data_workerprocesses: 256 processes (2 per training process)Lower CPU usage (~2%) - data loading workers
Memory usage: ~280-290 MB per worker
These are PyTorch DataLoader worker processes
The purpose of pt_data_worker processes: - PyTorch’s DataLoader uses num_workers=2 (set in the training script) - Each training process spawns 2 worker processes to load data in parallel - This prevents data loading from blocking training computation - With 128 training processes × 2 workers = 256 pt_data_worker processes per node
Total processes per node (full allocation): - 128 main Python processes (training, one per core) - 256 pt_data_worker processes (data loading) - Total: 384 processes per node - Total across 2 nodes: 768 processes (512 training + 256 data workers)
Resource allocation: - CPU cores: 128 cores per node × 2 nodes = 256 cores (fully utilized) - Threads: 256 ranks × 2 threads = 512 threads (fully utilizing hyperthreading) - Memory: Adjust --mem based on model size (e.g., 251G per node for full allocation - maximum available on Discoverer)
This is normal and expected behavior. The worker processes help keep the GPU/CPU busy by pre-loading the next batch while the current batch is being processed.
To reduce worker processes (if needed for resource constraints): - With 256 ranks, having num_workers=2 creates 512 data worker processes (256 × 2) - Consider reducing to num_workers=1 for large-scale runs (256 worker processes total) - Or set num_workers=0 to disable parallel data loading (slower but minimal overhead) - For full node allocation (256 ranks), num_workers=1 is often a good balance
Troubleshooting
Job stuck in pending state with reason “(None)”:
“(None)” means the job is waiting in the queue without a specific reason code
This is common for large resource requests (e.g., 2 full nodes)
Check detailed job information:
scontrol show job JOBID
Check job priority and position in queue:
sprio -j JOBID
Check partition availability:
sinfo -p cn
Check available nodes:
sinfo -p cn -o '%N %t %C %m'
Common reasons for delay:
No 2 full nodes available simultaneously (most common for full node requests)
Other jobs ahead in queue with higher priority
Account/QoS limits (check with
sacctmgr show assoc user=$USER)Partition limits or maintenance
Solutions:
Wait for resources to become available
Reduce resource request (fewer nodes, fewer tasks per node)
Check with cluster administrators about account limits
Consider using fewer nodes or reducing memory request if possible
Python version compatibility issues:
Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
If Python version errors are encountered, recreate the virtual environment with Python 3.11:
# Remove the existing environment conda env remove --prefix ${VIRTUAL_ENV} -y # Create new environment with Python 3.11 (current optimal version) # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11) # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y # Then install packages again (install pytorch-cpu, horovod, and torchvision together to ensure Open MPI and mpi4py) conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba
Dataset location issues:
The MNIST dataset should already be available at
/opt/software/test/pytorch/mnist_dataIf the dataset is not found, verify the path is correct and accessible from compute nodes
The SLURM script sets
DATA_DIRto this location automatically
Out of memory:
Reduce
batch_sizein the training script (currently 64)Reduce number of workers in DataLoader (
num_workers=2)
Slow training:
Increase
batch_sizeif memory allowsReduce number of epochs for testing
Check that
OMP_NUM_THREADSis set correctly
Training produces NaN loss values:
Cause: Learning rate is too high when scaled linearly with large number of ranks
With 256+ ranks, linear scaling (base_lr × num_ranks) creates extremely high learning rates
Example: 0.01 × 2048 = 20.48 (way too high, causes divergence)
Solution: The training script now uses square root scaling for large numbers of ranks
For ≤32 ranks: linear scaling (base_lr × num_ranks)
For >32 ranks: square root scaling (base_lr × sqrt(num_ranks))
This prevents NaN values while still benefiting from distributed training
If NaN values still appear, try reducing
base_lrfurther (e.g., 0.001 instead of 0.01)
MPI errors:
Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate MPI module needs to be loaded
Use
srun --mpi=pmixto launch jobs -srunautomatically uses all allocated tasks (no need for-np $SLURM_NTASKS)Verify that MPI libraries are available:
which mpirun(should be in the virtual environment, thoughsrunis preferred)Check that Horovod was installed correctly:
conda list horovodEnsure virtual environment has
horovodandmpi4pyinstalledImportant: Install PyTorch, Horovod, and torchvision together (
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba) to ensure Open MPI is used and to bringmpi4pyas a dependency
Adapting for custom datasets
To use a custom dataset:
Replace the dataset loading section:
# Instead of datasets.MNIST, use a custom dataset from torch.utils.data import Dataset class YourDataset(Dataset): def __init__(self, data_path, transform=None): # Load data here pass def __len__(self): return len(self.data) def __getitem__(self, idx): # Return (image, label) tuple return image, label train_dataset = YourDataset(data_path='/path/to/data', transform=transform)
Adjust model architecture:
Modify
MNIST_CNNclass to match the input dimensions and number of classesUpdate normalization values in transforms if needed
Update data directory:
For the example MNIST training, the dataset is at
/opt/software/test/pytorch/mnist_dataFor custom datasets, store them in
/valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/Update the
DATA_DIRenvironment variable in the SLURM script to point to the dataset locationEnsure the path is accessible from all compute nodes (project directories are shared across nodes)
Never store data in home directory - always use
/valhalla/projects/<your_slurm_project_account_name>/
Scaling analysis: Why different node counts require different numbers of epochs
In distributed training with Horovod, the training dataset is sharded (divided) across all worker processes (ranks). Each rank processes only its assigned subset of data during each epoch.
Data sharding formula:
Samples per rank per epoch = Total training samples / Number of ranks
As the number of nodes (and therefore ranks) increases:
Fewer samples per rank: Each rank sees progressively fewer training samples per epoch
2 nodes (256 ranks): 60,000 / 256 = 234.4 samples per rank
4 nodes (512 ranks): 60,000 / 512 = 117.2 samples per rank
8 nodes (1024 ranks): 60,000 / 1024 = 58.6 samples per rank
16 nodes (2048 ranks): 60,000 / 2048 = 29.3 samples per rank
Reduced effective batch size per rank: With fewer samples, batch sizes must be reduced to ensure multiple batches per epoch
More ranks → smaller batch sizes (64 → 32 → 16)
More epochs needed for convergence: With fewer samples per rank, each rank needs more epochs to:
See enough diverse data to learn meaningful patterns
Accumulate sufficient gradient updates for convergence
Overcome the statistical noise from limited data per rank
Learning rate scaling: Learning rates are scaled more conservatively with more ranks to prevent training instability
Explanation: 16 nodes required more epochs than 8 nodes
With 16 nodes (2048 ranks), each rank sees only 29.3 samples per epoch. This extremely limited data per rank means: - Each rank sees very little data diversity per epoch - Gradient estimates become noisier, requiring more iterations to converge - More communication overhead with more ranks can introduce slight inconsistencies - Beyond a certain point, adding more ranks doesn’t linearly improve convergence speed due to the data limitation
The following table shows PyTorch training results across different node configurations:
Metric |
2 Nodes(256 ranks) |
4 Nodes(512 ranks) |
8 Nodes(1024 ranks) |
16 Nodes(2048 ranks) |
|---|---|---|---|---|
Total Ranks |
256 |
512 |
1024 |
2048 |
Samples per Rank |
234.4 |
117.2 |
58.6 |
29.3 |
Batch Size |
64 |
32 |
16 |
16 |
Learning Rate |
0.016000 |
0.012126 |
0.005657 |
0.006727 |
LR Scaling |
sqrt (0.5) |
conservative (0.4) |
conservative (0.25) |
conservative (0.25) |
Max Epochs |
20 |
25 |
30 |
30 |
Epochs to 98% |
17 |
13 |
13 |
28 |
Final Accuracy |
98.14% |
98.01% |
98.11% |
98.12% |
Total Training Time |
71.44s |
52.56s |
52.80s |
94.36s |
Time per Epoch |
4.20s |
4.04s |
4.06s |
3.37s |
Early Stopping |
Yes |
Yes |
Yes |
Yes |
Highlights:
Training time efficiency: From the total training time, 4 nodes (512 MPI ranks) is the most optimal for computation, achieving 98% accuracy in 52.56 seconds. 8 nodes (1024 ranks) is nearly as efficient at 52.80 seconds, while 2 nodes requires 71.44 seconds and 16 nodes requires 94.36 seconds.
Convergence patterns: 4 and 8 nodes show fast convergence (13 epochs), while 16 nodes requires 28 epochs due to extremely limited data per rank
Learning rate scaling: Becomes more conservative as ranks increase to prevent training instability
Batch size adaptation: Automatically reduced with more ranks to ensure sufficient batches per epoch
Recommendations:
Optimal node count for this workload: 4-8 nodes provide the best balance of speed and resource allocation
Avoid over-scaling: Beyond 8 nodes, the benefits diminish due to data sharding limitations
Early stopping is effective: All runs successfully used early stopping at 98%, saving computation time
Adaptive epoch scaling: The script correctly scales max epochs based on rank count, allowing sufficient training time for larger scales
Estimating optimal parallel rank for training jobs:
For each training job, estimate the optimal parallel rank (number of MPI ranks) by:
Running multiple test runs with different node counts (e.g., 2, 4, 8, 16 nodes) using the same model architecture and training setup
Comparing total training time, convergence speed (epochs to target accuracy), and resource allocation across different configurations
Identifying the configuration that achieves the target accuracy in the shortest time while maintaining stable convergence
Running production training jobs based on the estimated optimal configuration from these test runs
Note
The optimal configuration may vary depending on dataset size, model complexity, and target accuracy. Always validate the optimal rank count through empirical testing before running large-scale production jobs.
Complete working example: Distributed training on MNIST dataset with TensorFlow
This section provides a complete, working example of distributed training using TensorFlow/Keras with Horovod on the MNIST handwritten digit dataset. This example follows the same structure as the PyTorch example above but uses TensorFlow/Keras instead.
Example files: The complete training script and SLURM batch script are provided as separate, runnable files: - train_mnist_horovod_tf.py - Complete Python training script for TensorFlow - run_mnist_training_tf.sh - SLURM batch submission script for TensorFlow
Discoverer Storage Requirements:
All outputs must be stored under
/valhalla/projects/<your_slurm_project_account_name>/(replace<your_slurm_project_account_name>with the SLURM account name)Never store data or other large files (like outputs) in the user’s home directory (
~/) - this violates Discoverer CPU partition best file system utilisation practicesThe MNIST dataset is already available locally at
/opt/software/test/tf/mnist_data- no download is neededOutputs (checkpoints, logs) should go to
/valhalla/projects/<your_slurm_project_account_name>/outputs/The SLURM script creates the output directory automatically
Overview
Dataset: MNIST.*containing 70,000 grayscale images of handwritten digits (0-9):
Training set: 60,000 images
Test set: 10,000 images
Image size: 28×28 pixels
Data source: The MNIST dataset is already available locally and is automatically loaded from the local storage location
Model: Simple convolutional neural network (CNN) suitable for CPU training
Framework: TensorFlow/Keras with Horovod
Training: Distributed across multiple nodes using MPI
Prerequisites:
TensorFlow and Horovod must be installed together in the same conda install command to ensure CPU-only build
If Horovod is installed before TensorFlow, it may install CUDA dependencies which are not needed on CPU-only clusters like Discoverer CPU partition
Installation command:
conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmambaSee the TensorFlow installation section for complete installation instructions
Step 1: Prepare the working directory
On the Discoverer login node, create a directory for the training job and copy the example files:
# Replace <your_slurm_project_account_name> with the actual SLURM account name
mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
# Copy the example files (adjust paths as needed)
cp /path/to/train_mnist_horovod_tf.py .
cp /path/to/run_mnist_training_tf.sh .
Note
The example files (train_mnist_horovod_tf.py and run_mnist_training_tf.sh) are provided as separate, runnable scripts that can be copied and used directly.
Step 2: Get the training script
Copy the training script to the working directory:
# Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp train_mnist_horovod_tf.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
The training script (train_mnist_horovod_tf.py) contains:
CNN model definition using TensorFlow/Keras Sequential API
Distributed data loading with TensorFlow Dataset API and Horovod sharding
Training and validation loops with metric aggregation
MNIST dataset loading from local storage
Complete distributed training implementation with Horovod callbacks
The full script can be viewed in train_mnist_horovod_tf.py or modified as needed.
Step 3: Get the SLURM batch script
Copy the SLURM batch script to the working directory:
# Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp run_mnist_training_tf.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
Before submitting, edit run_mnist_training_tf.sh and replace:
<your_slurm_project_account_name>with the actual SLURM account name<your_qos>with the actual QoS nameVerify the virtual environment path matches the setup (should point to the TensorFlow virtual environment)
The SLURM script (run_mnist_training_tf.sh) contains:
Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
Module loading (
anaconda3only is required - Open MPI is already installed as a dependency in the virtual environment during the installation of Horovod)Virtual environment setup
Job launch with
srun --mpi=pmix(srunautomatically uses all allocated tasks)
Understanding epochs in distributed training:
Epochs are shared across all ranks, not per-rank
With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
Each rank processes a different subset of data during each epoch
All ranks work together on the same epochs simultaneously
256 epochs are not needed for 256 ranks - the number of epochs is independent of the number of ranks
Resource allocation for full node allocation:
Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
Recommended:
--ntasks-per-node=128(one rank per core)Recommended:
--cpus-per-task=2(use both threads per core)Memory: Adjust
--membased on the model size (e.g., 251G for full node - maximum available on Discoverer)Total ranks: 2 nodes × 128 tasks = 256 ranks
Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes
Step 4: Submit the job
# Edit the script first to set the account and QoS (if not already done)
# Then submit the job
sbatch run_mnist_training_tf.sh
Before submitting: Ensure run_mnist_training_tf.sh has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct (should point to the TensorFlow virtual environment, not PyTorch).
Step 5: Monitor the job
# Check job status
squeue --me
# View output (replace JOBID with the actual job ID)
# Note: These commands should only be executed when squeue --me shows the job is in running state
tail -f mnist_training_tf.JOBID.out
# View errors (if any)
tail -f mnist_training_tf.JOBID.err
Step 6: Check the results
When the job completes, check the output file for training progress and final metrics.
When checking the output file (mnist_training_tf.JOBID.out), the output should be similar to:
Job ID: 4060967
Number of nodes: 2
Tasks per node: 128
Total tasks: 256
CPUs per task: 2
Working directory: /valhalla/projects/your_account/ai_cpu
Data directory: /opt/software/test/tf/mnist_data
Output directory: /valhalla/projects/your_account/ai_cpu/outputs
Python: /valhalla/projects/your_account/virt_envs/tf/bin/python
Python version: Python 3.11.14
Starting training on 256 processes
Process 0 of 256
Loading MNIST dataset...
Data will be stored in: /opt/software/test/tf/mnist_data
Training dataset download completed
Using learning rate: 0.16 (base: 0.01, ranks: 256, scaling: sqrt)
Starting training for 5 epochs...
Epoch 1, Loss: 0.8803, Accuracy: 72.08%, Val Loss: 0.0985, Val Accuracy: 97.07%, Time: 19.89s
Epoch 2, Loss: 0.1597, Accuracy: 95.36%, Val Loss: 0.0540, Val Accuracy: 98.29%, Time: 19.45s
Epoch 3, Loss: 0.1004, Accuracy: 97.09%, Val Loss: 0.0488, Val Accuracy: 98.45%, Time: 19.32s
Epoch 4, Loss: 0.0834, Accuracy: 97.63%, Val Loss: 0.0505, Val Accuracy: 98.45%, Time: 19.28s
Epoch 5, Loss: 0.0726, Accuracy: 97.86%, Val Loss: 0.0322, Val Accuracy: 98.99%, Time: 19.21s
Training completed in 99.44 seconds
Average time per epoch: 19.89 seconds
Evaluating on test set...
Test Loss: 0.0322, Test Accuracy: 98.99%
Training job completed
Expected output characteristics:
Job information header: Shows resource allocation and paths
Process count: Should match
--nodes × --ntasks-per-node(e.g., 2 × 128 = 256 for full allocation)Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
Training progress: Loss should decrease and accuracy should increase over epochs
Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
Completion message: “Training job completed” indicates successful finish
Note: Only rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.
Differences from PyTorch example
The TensorFlow implementation differs from the PyTorch example in several ways:
Data loading: TensorFlow uses
tf.data.Datasetwith.shard()for distributed data splitting, while PyTorch usesDistributedSamplerModel definition: TensorFlow uses Keras Sequential API, while PyTorch uses
nn.ModuleTraining loop: TensorFlow uses
model.fit()with callbacks, while PyTorch uses manual training loopsMetric aggregation: TensorFlow uses Horovod’s
MetricAverageCallback, while PyTorch useshvd.allreduce()manuallyDataset download: TensorFlow downloads to a
.npzfile, while PyTorch downloads to a directory structure
Troubleshooting
Most troubleshooting steps are the same as the PyTorch example. Additional TensorFlow-specific considerations:
TensorFlow dataset sharding:
Ensure
.shard(hvd.size(), hvd.rank())is called on the dataset before batchingEach MPI rank must process a different subset of data
Horovod callbacks:
Always include
BroadcastGlobalVariablesCallback(0)to synchronize initial model parametersInclude
MetricAverageCallback()to average metrics across MPI ranksOnly MPI rank #0 should write checkpoints (use
if hvd.rank() == 0:)
TensorFlow/Keras compatibility:
Ensure TensorFlow and Horovod versions are compatible
Check with:
python -c "import tensorflow as tf; import horovod.tensorflow as hvd; print('TF:', tf.__version__, 'Horovod:', hvd.__version__)"
Memory issues:
TensorFlow may use more memory than PyTorch for the same model
Reduce
batch_sizeif OOM errors are encounteredConsider using
tf.data.Datasetprefetching carefully
InfiniBand transport errors:
Error: “Transport retry count exceeded on mlx5_0:1/IB” is a known UCX/InfiniBand issue (UCX issue #5199)
This error typically occurs when there is excessive communication between MPI ranks across different nodes, indicating that the job has reached communication saturation and cannot utilize that number of nodes
Solution: Reduce the number of nodes in the job allocation
If reducing nodes is not feasible, set the UCX service level parameter as a workaround:
export UCX_IB_SL=0
This parameter sets the InfiniBand service level to 0, which can resolve transport retry issues
Note:
export UCX_IB_SL=0is already included in all SLURM batch scripts in this documentNote: Disabling InfiniBand (using TCP/IP instead) is not recommended as it significantly reduces performance for multi-node jobs
Adapting for custom datasets
To use a custom dataset with TensorFlow:
Replace the dataset loading section:
# Instead of tf.keras.datasets.mnist.load_data(), use a custom dataset import tensorflow as tf # Load data x_train, y_train = load_your_training_data() x_test, y_test = load_your_test_data() # Normalize and reshape as needed x_train = x_train.astype('float32') / 255.0 x_test = x_test.astype('float32') / 255.0 # Create TensorFlow datasets train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) train_dataset = train_dataset.shard(hvd.size(), hvd.rank()) # Important for distributed training train_dataset = train_dataset.shuffle(10000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
Adjust model architecture:
Modify the
create_model()function to match the input dimensions and number of classesUpdate normalization values if needed
Update data directory:
Store the dataset in
/valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/Update the
DATA_DIRenvironment variable in the SLURM scriptEnsure the path is accessible from all compute nodes (project directories are shared across nodes)
Never store data in home directory - always use
/valhalla/projects/<your_slurm_project_account_name>/
Summary: Best practices for distributed training on Discoverer
Use MPI - Avoid Gloo in multi-user HPC environments due to possible port conflict issues
Proper combination of components:
Horovod + MPI (simplest, works for both TensorFlow and PyTorch)
PyTorch native DDP with MPI backend
Launch methods:
With Horovod:
srun --mpi=pmix python train.py(srun automatically uses all allocated tasks)With PyTorch DDP:
srun --mpi=pmix python train.py
No MPI module needed for Horovod: Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate module loading required
References
Consider checking Discoverer HPC documentation for:
Available software modules
Storage locations and quotas
Job submission best practices
Partition-specific configurations
Getting help
See Getting help