ML/AI Workloads on Discoverer CPU Cluster¶
Table of contents¶
- Discoverer CPU partition compute nodes specifications
- CPU cluster and ML/AI
- Virtual environment setup
- Use conda from the anaconda3 module
- Common setup: Determine Python version compatibility
- Common setup: Create virtual environment with compatible Python version
- Install PyTorch-CPU
- Verify PyTorch-CPU installation
- Examine available PyTorch methods, functions, and constants
- Install TensorFlow-CPU
- Verify TensorFlow installation
- Create SLURM configuration for ML/AI workloads on CPU clusters
- Suitable ML/AI workloads for CPU clusters
- Distributed training
- Running PyTorch workloads
- Package installation for distributed training
- PyTorch distributed training with MPI
- Complete working example: Distributed training on MNIST dataset
- Overview
- Step 1: Prepare the working directory
- Step 2: Get the training script
- Step 3: Get the SLURM batch script
- Step 4: Make the script executable and submit the job
- Step 5: Monitor the job
- Step 6: Check job runtime and statistics
- Scalability testing results
- Achieving better accuracy
- Understanding the data flow
- Expected output
- Understanding process structure
- Troubleshooting
- Adapting for custom datasets
- Scaling analysis: Why different node counts require different numbers of epochs
- Summary: Best practices for distributed training on Discoverer
- References
- Getting help
Discoverer CPU partition compute nodes specifications¶
For more datails see Discoverer Reseource Overview:
CPU cluster and ML/AI¶
CPU clusters like Discoverer are well-suited for ML/AI workloads, particularly when leveraging distributed training across multiple nodes. The large number of CPU cores (144,384 hardware cores, 288,768 logical CPUs with hyperthreading) and high-speed InfiniBand interconnect (200 Gbps per node) enable efficient parallel processing of machine learning tasks.
Distributed training as a basic principle¶
Distributed training is fundamental for effectively utilizing CPU clusters for ML/AI workloads. Rather than training models on a single node, distributed training divides the workload across multiple nodes and processes, enabling:
- Faster training times: Multiple nodes work simultaneously on different data shards
- Scalability: Training can scale from a few nodes to hundreds of nodes depending on the workload
- Efficient resource allocation: Enables allocation of CPU cores across multiple nodes
- Large dataset handling: Enables training on datasets that may not fit in a single node’s memory
On Discoverer, distributed training uses MPI (Message Passing Interface) for inter-node communication, which is the standard approach for HPC clusters. Frameworks like Horovod provide a unified interface for distributed training with both PyTorch and TensorFlow, simplifying the implementation of distributed training workflows.
Considerations for ML/AI on CPU clusters
- Data sharding: In distributed training, datasets are divided (sharded) across all worker processes. Each process sees only a subset of the data per epoch, which affects convergence behavior and requires careful tuning of learning rates and epoch counts.
- Communication overhead: While distributed training speeds up training, communication between nodes adds overhead. The optimal number of nodes depends on the specific workload, dataset size, and model complexity.
- Resource allocation: Full node allocation (128 tasks per node, using all 256 logical CPUs) maximizes training efficiency. Partial allocation wastes resources and increases training time.
- Storage requirements: All data and outputs must be stored in project directories (
/valhalla/projects/<your_slurm_project_account_name>/), never in home directories or/tmp. - Python version compatibility: Python 3.11 is currently the optimal version for ML/AI workloads on Discoverer, providing best compatibility with TensorFlow, PyTorch, and Horovod.
Workload types suitable for CPU clusters
CPU clusters excel at: - Traditional machine learning (scikit-learn, XGBoost, etc.) - Deep learning with distributed training (PyTorch, TensorFlow) - Large-scale data preprocessing and feature engineering - Hyperparameter optimization with parallel trials - CPU-optimized inference workloads - Research and development workflows
For production training of very large models or extremely large datasets, GPU clusters may provide better performance, but CPU clusters are highly effective for many ML/AI workloads when properly configured for distributed training.
Limitations and considerations
Workloads suitable for CPU clusters: - Traditional ML algorithms (scikit-learn, XGBoost) - Data preprocessing and feature engineering - CPU-optimized inference - Hyperparameter optimization - Smaller neural networks - Batch inference workloads
What’s better on GPU clusters (Discoverer+ GPU partition - Large-scale deep learning training - Transformer model training - Real-time inference requirements - Computer vision with large CNNs - High-throughput training workloads
Virtual environment setup¶
Note
Virtual environments replace the need for running containers and simplify the setup and installation process for ML/AI workloads on Discoverer CPU partition. They provide isolated Python environments with all necessary dependencies without the overhead of container management.
Use conda from the anaconda3 module¶
All virtual environments on Discoverer must be managed by conda from the anaconda3 environment module that is already installed on the system. Users must not download, install, or use their own Anaconda or Miniconda installations. Conda is always available on Discoverer through environment modules and must be accessed by loading the anaconda3 module:
module load anaconda3
After loading the module, the conda executable and the libraries that support that executable become available. All virtual environment creation, package installation, and management must be done using this system-provided Anaconda3 installation. Do not attempt to install Anaconda3/Miniconda3 yourself - conda tool is already available through the environment module system.
Note
Users who prefer containers can use Singularity/Apptainer on the Discoverer CPU partition, but this document does not cover container-based workflows. The examples and instructions in this guide focus on virtual environments as the recommended approach.
Warning
Do not use conda activate when working with virtual environments on Discoverer. Always use --prefix with environment variable exports to expose the environment. Set VIRTUAL_ENV to the path of the virtual environment directory (e.g., /valhalla/projects/<your_slurm_project_account_name>/virt_envs/torch), then export:
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
See the examples below for complete virtual environment installation scripts.
The libmamba solver is recommended to suppoprt all conda-driven installations because it:
- Resolves dependencies faster than the default solver
- Handles complex dependency conflicts more reliably
- Reduces warnings about deprecated version specifications
Common setup: Determine Python version compatibility
Before creating a virtual environment, determine the highest Python version supported by the target packages (pytorch-cpu or tensorflow-cpu). That my need some research or negative feedback organised installation procedure.
3.11 is the current optimal Python version for ML/AI workloads on Discoverer. TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.
Note
In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions.
Since packages from conda-forge may not include Python version in the search metadata, test installation with different Python versions in different virtual environments (point to the selected environment by using --prefix):
#!/bin/bash
#SBATCH --job-name=search_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_pytorch.%j.out
#SBATCH -e search_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Search for CPU-only PyTorch packages
conda search -c conda-forge pytorch-cpu
# Get detailed information
conda search -c conda-forge pytorch-cpu --info
#!/bin/bash
#SBATCH --job-name=test_python_version
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:30:00
#SBATCH -o test_python_version.%j.out
#SBATCH -e test_python_version.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Test which Python version works by trying to install the target package
# Replace PACKAGE_NAME with pytorch-cpu or tensorflow-cpu
# Replace <your_slurm_project_account_name> with the actual project account name
PACKAGE_NAME=pytorch-cpu # or tensorflow-cpu
# Note: Python 3.11 is the current optimal version for ML/AI workloads on Discoverer.
# TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11.
# This script tests higher versions first, but Python 3.11 should be used unless testing shows
# that a higher version is compatible. In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible
# with higher Python versions.
# Test Python 3.13 (to test compatibility with future versions)
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py313
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
conda create --prefix ${VIRTUAL_ENV} python=3.13 --solver=libmamba -y
if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
echo "Python 3.13 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
echo "A production environment can now be created with Python 3.13"
exit 0
else
echo "Python 3.13 is not supported. Cleaning up test environment and trying lower version."
conda env remove --prefix ${VIRTUAL_ENV} -y
fi
# Test Python 3.12 (to test compatibility with future versions)
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/test_py312
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
conda create --prefix ${VIRTUAL_ENV} python=3.12 --solver=libmamba -y
if conda install --prefix ${VIRTUAL_ENV} -c conda-forge ${PACKAGE_NAME} --solver=libmamba --dry-run; then
echo "Python 3.12 is supported by ${PACKAGE_NAME}. Test environment kept at ${VIRTUAL_ENV}"
echo "A production environment can now be created with Python 3.12"
exit 0
else
echo "Python 3.12 is not supported. Cleaning up test environment and trying lower version."
conda env remove --prefix ${VIRTUAL_ENV} -y
fi
# Python 3.11 is the current optimal version - use this for production
# Continue with 3.10, etc. only if 3.11 doesn't work
Common setup: Create virtual environment with compatible Python version
Once the compatible Python version is determined, create the virtual environment using the template SLURM batch script provided below.
Warning
Do not use conda activate when working with virtual environments on Discoverer. Always use export PATH=${VIRTUAL_ENV}/bin:${PATH}, export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH} to expose the environment to the shell.
#!/bin/bash
#SBATCH --partition=cn
#SBATCH --job-name=install
#SBATCH --time=00:30:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH -o install.%j.out
#SBATCH -e install.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
# Check if the target folder already exists
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
# Create Python virtual environment with Python 3.11
# Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
# (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
# Note: In the future, those packages might become compatible with higher Python versions
conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
if [ $? -ne 0 ]; then
echo "Conda Python virtual environment creation failed" >&2
exit 1
fi
echo "Python virtual environment successfully created!"
# Fully expose the Python virtual environment
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Verify Python version
python --version
Install PyTorch-CPU
On Discoverer CPU cluster, PyTorch-CPU must always be installed with nomkl because MKL (Intel Math Kernel Library) does not work well on AMD Zen2 processors (AMD EPYC 7H12 installed on Discoverer CPU compute nodes). The nomkl package ensures that MKL libraries are not used, preventing performance issues and compatibility problems on AMD architecture.
For distributed training with Horovod, install pytorch-cpu, and torchvision packages together with horovod in the same command line to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed), and to install mpi4py as a dependency.
Optional: Search for available PyTorch CPU packages
To check which PyTorch CPU versions are available before installing, save the SLURM batch script below as run_search_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=search_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_pytorch.%j.out
#SBATCH -e search_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Search for CPU-only PyTorch packages
conda search -c conda-forge pytorch-cpu
# Get detailed information
conda search -c conda-forge pytorch-cpu --info
Submit the job to the queue:
sbatch run_search_pytorch.sh
and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).
Installation script
Save the SLURM batch script provided below as run_install_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=install_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=01:30:00
#SBATCH -o install_pytorch.%j.out
#SBATCH -e install_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
# Create virtual environment if it doesn't exist
if [ ! -d ${VIRTUAL_ENV} ]; then
# Create Python virtual environment with Python 3.11
# Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
# (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
# Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
if [ $? -ne 0 ]; then
echo "Conda Python virtual environment creation failed" >&2
exit 1
fi
echo "Python virtual environment successfully created!"
else
echo "Virtual environment already exists at ${VIRTUAL_ENV}"
fi
# Expose the virtual environment (do not use conda activate)
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# For distributed training with Horovod:
# Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba
# For basic PyTorch installation (without distributed training):
# Install PyTorch CPU version in nomkl mode (required)
# torchvision is needed for datasets and image transformations
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge pytorch-cpu torchvision nomkl --solver=libmamba
# Install additional packages if needed
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge transformers --solver=libmamba
# Install multiple packages at once (pytorch-cpu must always include nomkl)
# conda install --prefix ${VIRTUAL_ENV} -c conda-forge \
# pytorch-cpu \
# torchvision \
# nomkl \
# transformers \
# scikit-learn \
# pandas \
# numpy \
# --solver=libmamba
Submit the job to the queue:
sbatch run_install_pytorch.sh
and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).
Note
- With
condaas installer, the PyTorch package to install is namedpytorch-cpu, but within the Python code the installation is imported as a module under the nametorch(import torch) torchvisionis required for datasets (e.g., MNIST) and image transformations- For distributed training:
pytorch-cpu,horovod, andtorchvisionmust be installed together in the same command line to ensure Open MPI is used instead of MPICH and to bringmpi4pyas a dependency - Always include
nomklwhen installingpytorch-cpuon Discoverer CPU partition because of the used AMD Zen2 processor microarchitecture - For distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)
Verify PyTorch-CPU installation
After installing pytorch-cpu, verify the installation. Save the SLURM batch script below as run_verify_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=verify_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o verify_pytorch.%j.out
#SBATCH -e verify_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Check PyTorch version
python -c "import torch; print('PyTorch version:', torch.__version__)"
# Check torchvision version
python -c "import torchvision; print('torchvision version:', torchvision.__version__)"
# Verify PyTorch is CPU-only (should return False for CUDA availability)
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
# Check number of CPU threads PyTorch can use as if SMT is not there
python -c "import torch; print('Number of threads:', torch.get_num_threads())"
# Test basic tensor operations
python -c "import torch; x = torch.randn(5, 3); print('Tensor shape:', x.shape); print('Tensor device:', x.device)"
Submit the job to the queue:
sbatch run_verify_pytorch.sh
and follow the content written down in the standard output and standard error files (verify_pytorch.JOBID.out and verify_pytorch.JOBID.err) to see the verification results.
Expected output should show:
- PyTorch version number
- torchvision version number
- CUDA available: False (since this is CPU-only)
- Number of threads available (no SMT is taken into account)
- Tensor operations working correctly
Examine available PyTorch methods, functions, and constants
To inspect what methods, functions, and constants are available in the installed PyTorch, save the SLURM batch script below as run_examine_pytorch.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=examine_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o examine_pytorch.%j.out
#SBATCH -e examine_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
# Check if the target folder exists
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# List all available attributes in torch module
python -c "import torch; print('Available attributes:'); print([attr for attr in dir(torch) if not attr.startswith('_')])"
# Check specific submodules
python -c "import torch; print('torch.nn available:', hasattr(torch, 'nn')); print('torch.optim available:', hasattr(torch, 'optim')); print('torch.distributed available:', hasattr(torch, 'distributed'))"
# Check BLAS backend (should show OpenBLAS when nomkl is used)
python -c "import torch; print('BLAS library:', torch.__config__.show())"
# Check if MKLDNN is actually installed and available at runtime
python -c "import torch; print('MKLDNN available:', torch.backends.mkldnn.is_available() if hasattr(torch.backends, 'mkldnn') else 'mkldnn backend not found')"
python -c "import torch; print('MKLDNN enabled:', torch.backends.mkldnn.enabled if hasattr(torch.backends, 'mkldnn') else 'N/A')"
# List available tensor operations
python -c "import torch; ops = [attr for attr in dir(torch) if callable(getattr(torch, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"
# Check constants
python -c "import torch; print('Float32:', torch.float32); print('CPU device:', torch.device('cpu')); print('Default dtype:', torch.get_default_dtype())"
Submit the job to the queue:
sbatch run_examine_pytorch.sh
and follow the content written down in the standard output and standard error files (examine_pytorch.JOBID.out and examine_pytorch.JOBID.err) to see the available methods, functions, and constants.
Alternatively, run interactively using srun:
srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
--nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash
# Then in the interactive session:
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
python -c "import torch; print('Available attributes:', [attr for attr in dir(torch) if not attr.startswith('_')])"
# ... (continue with other examination commands)
These tests help to verify:
- Which PyTorch modules are available (nn, optim, distributed, etc.)
- Which BLAS library is being used (should be OpenBLAS with nomkl)
- Available tensor operations and constants
The build configuration should show BLAS_INFO=open and LAPACK_INFO=open, confirming that OpenBLAS is used for BLAS and LAPACK operations instead of MKL. This is the expected behavior when using nomkl. The USE_MKLDNN=1 setting indicates that MKLDNN (Intel MKL-DNN, now oneDNN) is available. MKLDNN is embedded in the libtorch/pytorch package and statically linked into PyTorch, so it remains available even when nomkl is used. This configuration is correct: nomkl prevents MKL BLAS/LAPACK usage (which does not work well on AMD Zen2 processors), while MKLDNN (for neural network operations) remains available as it is part of the PyTorch build itself.
Install TensorFlow-CPU
Optional: Search for available TensorFlow packages
To check which TensorFlow CPU versions are available before installing, save the SLURM batch script below as run_search_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=search_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=00:10:00
#SBATCH -o search_tensorflow.%j.out
#SBATCH -e search_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Search for TensorFlow CPU packages
conda search -c conda-forge tensorflow-cpu
# Search for specific package with build string
conda search -c conda-forge "tensorflow-cpu=*"
# Check what's available in the configured channels
conda search tensorflow-cpu
Submit the job to the queue:
sbatch run_search_tensorflow.sh
and follow the content written down in the standard output and standard error files (search_tensorflow.JOBID.out and search_tensorflow.JOBID.err) to see which TensorFlow CPU versions are available.
Installation script
To use Horovod for distributed training, tensorflow-cpu and horovod must be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer. Installing them together ensures Horovod is built with CPU-only support.
Important note on compatibility: Even when installing together, conda’s dependency resolver may select versions that are technically compatible at the package level but have runtime API incompatibilities (e.g., Horovod callbacks may not work with certain TensorFlow versions). Training scripts should handle this by falling back to manual synchronization when callbacks are incompatible.
Save the SLURM batch script provided below as run_install_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=install_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=01:30:00
#SBATCH -o install_tensorflow.%j.out
#SBATCH -e install_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
# Create virtual environment if it doesn't exist
if [ ! -d ${VIRTUAL_ENV} ]; then
# Create Python virtual environment with Python 3.11
# Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
# (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11)
# Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y
if [ $? -ne 0 ]; then
echo "Conda Python virtual environment creation failed" >&2
exit 1
fi
echo "Python virtual environment successfully created!"
else
echo "Virtual environment already exists at ${VIRTUAL_ENV}"
fi
# Expose the virtual environment (do not use conda activate)
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Set channel priority to flexible (required when packages exist in multiple channels)
conda config --set channel_priority flexible
# Install TensorFlow CPU and Horovod together (required for CPU-only clusters)
# If Horovod is installed before TensorFlow, it will install CUDA dependencies which are not needed on CPU-only clusters
# Installing them together ensures Horovod is built with CPU-only support and conda resolves compatible versions
# Option 1: Let conda choose latest compatible versions (recommended for best compatibility)
conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmamba
# Option 2: Pin TensorFlow version, let conda choose compatible Horovod
# Use this to install a specific TensorFlow version (replace <version> with the desired version)
# conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu=<version> horovod nomkl --solver=libmamba
Submit the job to the queue:
sbatch run_install_tensorflow.sh
and follow the content written down in the standard output and standard error files (they will appear once the job is brought into running state).
Important
- Install TensorFlow and Horovod together: To use Horovod for distributed training,
tensorflow-cpuandhorovodmust be installed in the same command. Installing Horovod before TensorFlow will cause it to install CUDA dependencies, which are not needed on CPU-only clusters like Discoverer. - Version compatibility: Conda’s dependency resolver does not guarantee runtime API compatibility. Even when installing together, runtime API incompatibilities may be encountered (e.g., Horovod callbacks failing with
AttributeError: 'Variable' object has no attribute 'ref'). The training script handles this automatically by falling back to manual synchronization. - Use flexible channel priority: When “excluded by strict repo priority” appears, set
channel_prioritytoflexibleto allow conda to use packages from any configured channel - Do not specify build string: If pinning a version, install by version only (e.g.,
tensorflow-cpu=<version>), not with the build string (e.g.,tensorflow-cpu=<version>=cpu_py313hbca4264_0) - Include nomkl: Similar to PyTorch, include
nomklfor AMD Zen2 processors
Runtime API compatibility: TensorFlow and Horovod may have runtime API incompatibilities even when conda installs them together. If errors like Horovod has been shut down or AttributeError: 'Variable' object has no attribute 'ref' are encountered, this indicates the installed versions are incompatible. The training script attempts to handle this, but if Horovod shuts down during initialization, training will fail. In this case, different version combinations may need to be tried or PyTorch may be used instead, which has better compatibility with Horovod.
Verify TensorFlow installation
After installing tensorflow-cpu, verify the installation. Save the SLURM batch script below as run_verify_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=verify_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o verify_tensorflow.%j.out
#SBATCH -e verify_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# Check TensorFlow version
python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"
# Check available CPU devices (TensorFlow shows logical CPU devices, not physical sockets)
python -c "import tensorflow as tf; print('CPU devices:', tf.config.list_physical_devices('CPU'))"
# Check number of CPU threads TensorFlow can use
python -c "import tensorflow as tf; print('Inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"
# Check total CPU cores available to TensorFlow
python -c "import tensorflow as tf; import os; print('Total CPU cores available:', len(os.sched_getaffinity(0)) if hasattr(os, 'sched_getaffinity') else 'N/A')"
# Configure TensorFlow to use all available CPU threads (256 threads per node on Discoverer)
python -c "import tensorflow as tf; tf.config.threading.set_inter_op_parallelism_threads(256); tf.config.threading.set_intra_op_parallelism_threads(256); print('Configured inter-op threads:', tf.config.threading.get_inter_op_parallelism_threads()); print('Configured intra-op threads:', tf.config.threading.get_intra_op_parallelism_threads())"
# Check if TensorFlow has MPI integration
python -c "import tensorflow as tf; print('tf.distribute available:', hasattr(tf, 'distribute')); print('tf.distribute.Strategy available:', hasattr(tf.distribute, 'Strategy') if hasattr(tf, 'distribute') else False)"
# Check TensorFlow build configuration for MPI support
python -c "import tensorflow as tf; config = tf.sysconfig.get_build_info(); print('Build flags:', config.get('cpu_compiler_flags', 'N/A')); import sysconfig; print('Has MPI:', 'mpi' in str(sysconfig.get_config_var('LIBS')).lower() or 'mpi' in str(tf.sysconfig.get_compile_flags()).lower())"
# Check Horovod installation (if installed with TensorFlow)
python -c "try: import horovod.tensorflow as hvd; print('Horovod version:', hvd.__version__); print('Horovod TensorFlow support: OK'); except ImportError: print('Horovod not installed or TensorFlow support not available')"
# Test basic tensor operations
python -c "import tensorflow as tf; x = tf.constant([[1.0, 2.0], [3.0, 4.0]]); print('Tensor shape:', x.shape); print('Tensor device:', x.device); y = tf.matmul(x, x); print('Matrix multiplication result shape:', y.shape)"
Submit the job to the queue:
sbatch run_verify_tensorflow.sh
and follow the content written down in the standard output and standard error files (verify_tensorflow.JOBID.out and verify_tensorflow.JOBID.err) to see the verification results.
Expected output should show:
- TensorFlow version number
- CPU devices available
- Thread configuration (inter-op and intra-op parallelism)
- Horovod installation status (if installed)
- Tensor operations working correctly
Examine available TensorFlow methods, functions, and constants
To inspect what methods, functions, and constants are available in the installed TensorFlow, save the SLURM batch script below as run_examine_tensorflow.sh and provide the correct account and qos values:
#!/bin/bash
#SBATCH --job-name=examine_tensorflow
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH -o examine_tensorflow.%j.out
#SBATCH -e examine_tensorflow.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
# Check if the target folder exists
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
# Ensure the virtual environment is exposed
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
# List all available attributes in tensorflow module
python -c "import tensorflow as tf; print('Available attributes:'); print([attr for attr in dir(tf) if not attr.startswith('_')])"
# Check specific submodules
python -c "import tensorflow as tf; print('tf.keras available:', hasattr(tf, 'keras')); print('tf.nn available:', hasattr(tf, 'nn')); print('tf.optimizers available:', hasattr(tf, 'optimizers')); print('tf.distribute available:', hasattr(tf, 'distribute'))"
# Check TensorFlow build configuration
python -c "import tensorflow as tf; print('Build info:', tf.sysconfig.get_build_info())"
# Check available operations
python -c "import tensorflow as tf; ops = [attr for attr in dir(tf) if callable(getattr(tf, attr, None)) and not attr.startswith('_')]; print('Available operations (sample):', ops[:20])"
# Check constants and dtypes
python -c "import tensorflow as tf; print('Float32:', tf.float32); print('Int32:', tf.int32); print('Default float dtype:', tf.keras.backend.floatx())"
# Check Keras availability and version
python -c "import tensorflow as tf; print('Keras version:', tf.keras.__version__ if hasattr(tf.keras, '__version__') else 'N/A'); print('Keras available:', hasattr(tf, 'keras'))"
Submit the job to the queue:
sbatch run_examine_tensorflow.sh
and follow the content written down in the standard output and standard error files (examine_tensorflow.JOBID.out and examine_tensorflow.JOBID.err) to see the available methods, functions, and constants.
Alternatively, run interactively using srun:
srun --partition=cn --account=<your_slurm_project_account_name> --qos=<your_qos> \
--nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --mem=4G --time=00:10:00 --pty /bin/bash
# Then in the interactive session:
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
python -c "import tensorflow as tf; print('Available attributes:', [attr for attr in dir(tf) if not attr.startswith('_')])"
# ... (continue with other examination commands)
These tests help to verify:
- Which TensorFlow modules are available (keras, nn, optimizers, distribute, etc.)
- Build configuration and available features
- Available tensor operations and constants
- Keras integration and version
Create SLURM configuration for ML/AI workloads on CPU clusters
- Resource allocation
- CPU-intensive tasks: Request all 256 threads per node
- Memory-intensive tasks: Use fat nodes (1TB RAM) when needed
- I/O-intensive tasks: Consider network-attached storage performance
- Wall time limits: All jobs have maximum runtime limits specified with
#SBATCH --time=. Jobs are automatically terminated when the wall time limit is reached. Plan checkpoint intervals accordingly (see Checkpointing)
- Job arrays for parallel experiments
#SBATCH --array=0-99%10 # Run 100 jobs, max 10 concurrent
- Memory management
- Monitor memory usage with
sstatduring job execution - Use memory-efficient libraries (e.g.,
pandaswith chunking) - Consider using fat nodes for large datasets
- Checkpointing
All SLURM jobs on Discoverer have wall time limitations (specified with #SBATCH --time=). When a job reaches its wall time limit, it is automatically terminated by SLURM, regardless of whether training has completed. Therefore, checkpointing is essential for any training job that may exceed the allocated wall time.
The necessity of making checkpoints:
- Jobs are terminated when wall time is reached, even if training is incomplete
- Without checkpoints, all progress is lost and training must be restarted from the beginning
- Checkpoints allow resuming training from the last saved state after the job is terminated
Best practices:
- Implement checkpointing for all long-running training jobs
- Save checkpoints to network storage (
/valhalla/projects/<your_slurm_project_account_name>/outputs/) - Plan checkpoint intervals based on the wall time limit (e.g., if wall time is 24 hours, save checkpoints every 2-4 hours)
- Ensure checkpoint saving happens frequently enough that significant progress is not lost if the job is terminated
- Only rank 0 should write checkpoints in distributed training to avoid conflicts
- Include epoch number, model state, optimizer state, and any other necessary information in checkpoints
- Implement checkpoint loading at the start of training to resume from the last checkpoint if available
Suitable ML/AI workloads for CPU clusters
- Traditional machine learning
Scikit-learn workflows
- Large-scale feature engineering and preprocessing
- Hyperparameter optimization with grid/random search
- Cross-validation on large datasets
- Ensemble methods (Random Forest, Gradient Boosting)
- Model training on datasets that fit in memory (up to 251 GB allocatable per node)
XGBoost/LightGBM/CatBoost
- Tabular data classification/regression
- Large-scale gradient boosting
- Feature importance analysis
- Model interpretation
- Deep learning (CPU-optimized)
TensorFlow/Keras with CPU optimization
- Training smaller neural networks (CNNs, RNNs, MLPs)
- Inference workloads
- Transfer learning with pre-trained models
- Model fine-tuning
- Distributed training (see official TensorFlow distributed documentation)
PyTorch with CPU backend
- Research prototyping
- Model development and debugging
- CPU-optimized inference
- Distributed training (see official PyTorch distributed documentation)
- Natural language processing (NLP)
Traditional NLP pipelines
- Large-scale text preprocessing and tokenization
- Feature extraction (TF-IDF, word embeddings)
- Document classification
- Sentiment analysis on large corpora
Transformer models (CPU inference)
- Batch inference with Hugging Face models
- Text generation (smaller models)
- Embedding extraction
- Model quantization and optimization
- Computer vision
Image processing and feature extraction
- Large-scale image preprocessing
- Feature extraction (SIFT, SURF, HOG)
- Image classification with traditional ML
- Batch image transformations
- Data science and analytics
Large-scale data processing
- Pandas/NumPy operations on large datasets
- Dask for large-scale data processing
- Feature engineering pipelines
- Data validation and cleaning
- Hyperparameter optimization
Hyperparameter optimization
- Grid search with parallel trials
- Random search with parallel trials
- Bayesian optimization
- AutoML frameworks
- Reinforcement learning
CPU-based RL training
- Policy gradient methods
- Q-learning variants
- Multi-agent systems
- Environment simulation
Distributed training
Recommended approach: Use MPI for HPC clusters
For multi-user HPC clusters like Discoverer, MPI is the recommended and verified approach for distributed training.
Why MPI over Gloo?
Gloo (PyTorch’s TCP/IP-based backend) has major practical problems on shared HPC clusters: - Port conflicts: Multiple users can pick the same MASTER_PORT, causing job failures - No automatic port management: Manual coordination required between users - Race conditions: Port availability can change between check and use - Administrative burden: Unsuitable for multi-user environments
MPI was designed for multi-user HPC environments and handles all communication automatically without port conflicts. It provides: - No port management - MPI runtime handles all communication - Multi-user safe - designed for shared clusters - Battle-tested - decades of HPC use - SLURM integration - seamless with srun --mpi=
Conclusion: For multi-user HPC clusters, MPI is the correct choice. Gloo should be avoided in shared environments.
TensorFlow distributed training with Horovod
Horovod provides a unified interface for distributed training with both TensorFlow and PyTorch. For a complete working example with TensorFlow, see the Complete working example: Distributed training on MNIST dataset with TensorFlow section below.
Basic TensorFlow with Horovod setup:
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Build and compile model
model = tf.keras.Sequential([
# Add layers here
])
# Scale learning rate by number of workers
opt = tf.keras.optimizers.Adam(0.001 * hvd.size())
# Wrap optimizer with Horovod
opt = hvd.DistributedOptimizer(opt)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy')
# Add Horovod callbacks
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
# Add other callbacks here
]
# Only rank 0 should write checkpoints
if hvd.rank() == 0:
callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
model.fit(dataset, epochs=10, callbacks=callbacks)
Running PyTorch workloads
Package installation for distributed training
For distributed training with Horovod, install PyTorch, Horovod, and torchvision together in the same command. See Install PyTorch-CPU in the Virtual environment setup section for complete installation instructions.
Important
Python version requirement:
- Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
- TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11
- The virtual environment must be created with Python 3.11 for optimal compatibility
- Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions
Important
MPI implementation for Horovod:
- Install PyTorch and Horovod together to ensure Open MPI is used as the dependency (otherwise, MPICH might be installed)
- No separate MPI module needs to be loaded - Open MPI is installed as a dependency in the virtual environment when Horovod is installed
- Use
srun --mpi=pmixto launch jobs -srunautomatically uses all allocated tasks, no need to specify-np - The
mpiruncommand is also available from the virtual environment (installed as part of Open MPI dependency), butsrunis preferred for SLURM integration
Installation command for distributed training:
# Install PyTorch, Horovod, and torchvision together (required to bring Open MPI and mpi4py as dependencies)
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba
Note
- With conda, the package is named
pytorch-cpu(orpytorch), and it is imported asimport torch - torchvision is required for datasets (e.g., MNIST) and image transformations
- pytorch-cpu, horovod, and torchvision must be installed together in the same command to ensure Open MPI is used instead of MPICH and to bring
mpi4pyas a dependency - Open MPI is installed as a dependency in the virtual environment when Horovod is installed - no separate MPI module needs to be loaded
- Always include
nomklwhen installingpytorch-cpuon Discoverer (AMD Zen2 processors) - For distributed training with Horovod, use Python 3.11 (current optimal version for best compatibility)
PyTorch distributed training with MPI
Option 1: PyTorch native DDP with MPI backend
SLURM batch script:
#!/bin/bash
#SBATCH --job-name=pytorch_distributed
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH -o pytorch_distributed.%j.out
#SBATCH -e pytorch_distributed.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Launch with SLURM's MPI integration
srun --mpi=pmix python train.py
Python code example:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
# Initialize process group with MPI backend
# No MASTER_ADDR, no MASTER_PORT needed - MPI handles everything
dist.init_process_group(backend='mpi')
# Get local rank and world size
local_rank = dist.get_rank()
world_size = dist.get_world_size()
# Create model and move to device
model = YourModel()
model = DDP(model)
# Create distributed sampler for data loading
train_sampler = DistributedSampler(
train_dataset,
num_replicas=world_size,
rank=local_rank
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=batch_size,
sampler=train_sampler
)
# Training loop
for epoch in range(num_epochs):
train_sampler.set_epoch(epoch)
for data, target in train_loader:
# Training code here
pass
# Clean up
dist.destroy_process_group()
Option 2: Horovod with PyTorch (recommended for simplicity)
Horovod works with both TensorFlow and PyTorch and simplifies distributed training setup.
SLURM batch script:
#!/bin/bash
#SBATCH --job-name=horovod_pytorch
#SBATCH --partition=cn
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos>
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH -o horovod_pytorch.%j.out
#SBATCH -e horovod_pytorch.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export UCX_IB_SL=0
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export LD_LIBRARY_PATH=${VIRTUAL_ENV}/lib:${LD_LIBRARY_PATH}
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Launch with srun (srun automatically uses all allocated tasks)
srun --mpi=pmix python train_horovod.py
Python code example:
import torch
import horovod.torch as hvd
# Initialize Horovod
hvd.init()
# Scale learning rate by number of workers
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 * hvd.size())
# Wrap optimizer with Horovod
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters()
)
# Broadcast initial parameters from rank 0 to all workers
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
# Training loop
for epoch in range(num_epochs):
for data, target in train_loader:
optimizer.zero_grad()
loss = criterion(model(data), target)
loss.backward()
optimizer.step()
Complete working example: Distributed training on MNIST dataset
This section provides a complete, working example of distributed training using the MNIST handwritten digit dataset. MNIST is a standard benchmark dataset that’s automatically downloaded when the code is run, making it ideal for testing distributed training setups.
Example files: The complete training script and SLURM batch script are provided as separate, runnable files:
train_mnist_horovod_pytorch.py- Complete Python training scriptrun_mnist_training_pytorch.sh- SLURM batch submission script
Discoverer Storage Requirements:
- All data and outputs must be stored under
/valhalla/projects/<your_slurm_project_account_name>/(replace<your_slurm_project_account_name>with the SLURM account name) - Never store data in the home directory (
~/) - this violates Discoverer’s best practices - The example scripts automatically use
/valhalla/projects/<your_slurm_project_account_name>/data/for datasets - Outputs (checkpoints, logs) should go to
/valhalla/projects/<your_slurm_project_account_name>/outputs/ - The SLURM script creates these directories automatically
These files can be copied and run directly after configuring the SLURM account settings.
Overview¶
Dataset: MNIST.*Training set: 60,000 images - Test set: 10,000 images - Image size: 28×28 pixels - Data source: Automatically downloaded from PyTorch’s built-in datasets (no manual download needed) - Model: Simple convolutional neural network (CNN) suitable for CPU training - Framework: PyTorch with Horovod - Training: Distributed across multiple nodes using MPI
Step 1: Prepare the working directory¶
On the Discoverer login node, create a directory for the training job and copy the example files:
# Replace <your_slurm_project_account_name> with the actual SLURM account name
mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
# Copy the example files (adjust paths as needed)
cp /path/to/train_mnist_horovod_pytorch.py .
cp /path/to/run_mnist_training_pytorch.sh .
Note: The example files (train_mnist_horovod_pytorch.py and run_mnist_training_pytorch.sh) are provided as separate, runnable scripts that can be copied and used directly.
Step 2: Get the training script¶
Copy the training script to the working directory:
# Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp train_mnist_horovod_pytorch.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
The training script (train_mnist_horovod_pytorch.py) contains:
- CNN model definition for MNIST digit classification
- Distributed data loading with Horovod
- Training and validation loops with metric aggregation
- Automatic MNIST dataset download (only rank 0 downloads)
- Complete distributed training implementation
The full script can be viewed in train_mnist_horovod_pytorch.py or modified as needed.
Step 3: Get the SLURM batch script¶
Copy the SLURM batch script to the working directory:
# Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp run_mnist_training_pytorch.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example
Before submitting, edit run_mnist_training_pytorch.sh and replace:
<your_slurm_project_account_name>with the actual SLURM account name<your_qos>with the actual QoS name- Verify the virtual environment path matches the setup
The SLURM script (run_mnist_training_pytorch.sh) contains:
- Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
- Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
- Module loading (anaconda3 only - Open MPI is installed as a dependency in the virtual environment when Horovod is installed)
- Virtual environment setup
- Job launch with
srun --mpi=pmix(srun automatically uses all allocated tasks)
Understanding epochs in distributed training:
- Epochs are shared across all ranks, not per-rank
- With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
- Each rank processes a different subset of data during each epoch
- All ranks work together on the same epochs simultaneously
- 256 epochs are not needed for 256 ranks - the number of epochs is independent of the number of ranks
Resource allocation for full node allocation:
- Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
- Recommended:
--ntasks-per-node=128(one rank per core) - Recommended:
--cpus-per-task=2(use both threads per core) - Memory: Adjust
--membased on the model size (e.g., 251G for full node - maximum available on Discoverer) - Total ranks: 2 nodes × 128 tasks = 256 ranks
- Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes
The full script can be viewed in run_mnist_training_pytorch.sh and resource requirements can be modified as needed.
Step 4: Make the script executable and submit the job¶
On the login node:
# Edit the script first to set the account and QoS (if not already done)
# Then submit the job
sbatch run_mnist_training_pytorch.sh
Before submitting: Ensure run_mnist_training_pytorch.sh has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct.
Step 5: Monitor the job¶
# Check job status
squeue --me
# Monitor output in real-time (replace JOBID with the actual job ID)
# Note: These commands should only be executed when squeue --me shows the job is in running state
tail -f pytorch_cpu_training.JOBID.out
# Check for errors
tail -f pytorch_cpu_training.JOBID.err
Step 6: Check job runtime and statistics¶
After the job completes, check how long it took:
# Show job-level summary only (aggregated, no job steps)
sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
# Show detailed accounting with job steps (default behavior)
sacct -j JOBID --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Elapsed,TotalCPU,MaxRSS,MaxVMSize,NodeList
# Show elapsed time (wall clock time) for a job
sacct -j JOBID -X --format=JobID,JobName,Elapsed,State
# Show resource usage summary
sacct -j JOBID -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,MaxVMSize,ReqMem,AllocCPUS
# Show all jobs for today
sacct --starttime=today -X --format=JobID,JobName,Partition,Elapsed,State,ExitCode
Understanding sacct output:
Without -X flag (default): Shows job steps separately
JOBID: Main job (aggregated across all steps)JOBID.batch: The batch script executionJOBID.0,JOBID.1, etc.: Individual job steps (e.g., MPI processes, srun steps)
With -X flag: Shows only job-level summary (aggregated)
- Single line with total statistics for the entire job
Example output without -X:
$ sacct -j 4060972 --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972 mnist_hor+ 00:00:56 05:03:21 512
4060972.bat+ batch 00:00:56 00:00.513 21176K 256
4060972.0 hydra_pmi+ 00:00:55 05:03:21 150700188K 4
Example output with -X (job-level only):
$ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972 mnist_hor+ 00:00:56 00:00:00 512
Common fields:
Elapsed: Wall clock time (how long the job ran)TotalCPU: Total CPU time used (sum across all CPUs)MaxRSS: Maximum resident set size (peak memory usage)MaxVMSize: Maximum virtual memory sizeAllocCPUS: Number of CPUs allocated
Recommendation: Use -X flag for quick job-level summaries. Remove -X to see detailed breakdown by job steps.
Performance comparison: Scaling with number of CPUs
The benefits of fully allocating nodes are clear when comparing runtime. Note that SLURM’s Elapsed time includes job startup/teardown overhead, while the training script reports actual computation time:
Partial allocation (8 tasks per node, 32 CPUs total):
$ sacct -j 4060968 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060968 mnist_hor+ 00:02:35 00:00:00 32
- Wall time (SLURM): 2 minutes 35 seconds
- Allocated resources: 32 CPUs allocated
Full node allocation (128 tasks per node, 512 CPUs total):
$ sacct -j 4060972 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060972 mnist_hor+ 00:00:56 00:00:00 512
- Wall time (SLURM): 56 seconds
- Allocated resources: 512 CPUs (128 tasks × 2 nodes × 2 CPUs per task)
- Speedup: ~2.8× faster (155 seconds → 56 seconds)
Large-scale runs (multiple nodes):
8 nodes (1024 CPUs total):
$ sacct -j 4060975 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ---------- ----------
4060975 mnist_hor+ 00:00:52 00:00:00 1024
- Wall time (SLURM): 52 seconds
- Actual training time (from logs): 18.73 seconds
- Average time per epoch: 3.75 seconds
- Allocated resources: 1024 CPUs (128 tasks × 8 nodes × 2 CPUs per task)
16 nodes (2048 CPUs total):
$ sacct -j 4060974 -X --format=JobID,JobName,Elapsed,TotalCPU,MaxRSS,AllocCPUS
JobID JobName Elapsed TotalCPU MaxRSS AllocCPUS
------------ ---------- ---------- ---------- ----------
4060974 mnist_hor+ 00:00:49 00:00:00 2048
- Wall time (SLURM): 49 seconds
- Actual training time (from logs): 14.75 seconds
- Average time per epoch: 2.95 seconds
- Allocated resources: 2048 CPUs (128 tasks × 16 nodes × 2 CPUs per task)
- Speedup vs 1024 CPUs: ~1.3× faster training time (18.73s → 14.75s)
Analysis
- Wall time vs. training time: SLURM’s
Elapsedincludes job startup/teardown overhead. The training script’s reported time is the actual computation time.- Scaling efficiency: Training time scales well from 512 → 1024 → 2048 CPUs, showing good parallel efficiency.
- Overhead: Job startup/teardown overhead becomes more noticeable at larger scales (e.g., 49s wall time vs 14.75s training time for 2048 CPUs).
- Full node allocation: Using 128 tasks per node (one per core) maximizes resource allocation and minimizes training time.
Scalability testing results
The following results demonstrate how to measure scalability and training speed across different node configurations. These are speed/scalability tests, not optimized for accuracy. The accuracy values shown are lower than optimal because the focus is on demonstrating distributed training execution and measuring speedup.
2 nodes (256 ranks):
- Training time: 23.80 seconds
- Learning rate: 0.16 (sqrt scaling)
- Test accuracy: 66.66% → 10.08% (unstable - not optimized for accuracy)
4 nodes (512 ranks):
- Training time: 19.07 seconds
- Learning rate: 0.226 (sqrt scaling)
- Test accuracy: 64.99% → 21.16% (unstable - not optimized for accuracy)
8 nodes (1024 ranks):
- Training time: 15.37 seconds
- Learning rate: 0.1 (fixed scaling)
- Test accuracy: 19.84% → 66.18% (improving but not optimized)
16 nodes (2048 ranks):
- Training time: 15.10 seconds
- Learning rate: 0.1 (fixed scaling)
- Test accuracy: 20.68% → 64.96% (improving but not optimized)
Analysis:
- Training speed scales well: 23.80s → 19.07s → 15.37s → 15.10s
- Diminishing returns beyond 8 nodes (communication overhead)
- These results demonstrate scalability - actual accuracy requires further tuning (see below)
Achieving better accuracy
To achieve high accuracy (e.g., ~98-99% for MNIST) instead of just speed testing, apply these changes to the training script:
- Reduce base learning rate: Change
base_lr = 0.01tobase_lr = 0.001or0.0001 - Train for more epochs: Change
num_epochs = 5tonum_epochs = 10or20 - Use Adam optimizer: Replace
optim.SGDwithoptim.Adamoroptim.AdamWfor better stability - Add gradient clipping: Before
optimizer.step(), addtorch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) - Use learning rate schedule: Add
StepLRscheduler to decay learning rate over epochs - Increase batch size: Change
batch_size = 64to128or256(if memory allows)
Note
Better accuracy typically requires more epochs, lower learning rates, and more careful tuning, which increases training time. For production training, start with 2-4 nodes to establish baseline accuracy, then scale up while monitoring convergence.
Full node allocation significantly reduces training time. For the same workload, using 128 tasks per node instead of 8 tasks per node provides nearly 3× speedup while fully allocating the available resources. Scaling to multiple nodes further reduces training time, though with diminishing returns due to communication overhead.
Understanding the data flow
- Data location:
- The MNIST dataset is already available at
/opt/software/test/pytorch/mnist_data - The dataset is automatically loaded from this location - no download is needed
- The dataset is approximately 60 MB and includes both training and test sets
- The SLURM script sets
DATA_DIRenvironment variable to this location
- The MNIST dataset is already available at
- Data distribution:
- Each process gets a different subset of the training data via
DistributedSampler - With 16 processes (2 nodes × 8 tasks), each process handles ~3,750 training images
- Data is automatically shuffled differently each epoch
- Each process gets a different subset of the training data via
- Model synchronization:
- Initial model parameters are broadcast from rank 0 to all workers
- After each training step, gradients are averaged across all processes using Horovod’s ring-allreduce algorithm
- All processes maintain identical model parameters throughout training
- Output:
- Only rank 0 prints training progress to avoid duplicate output
- Metrics (loss, accuracy) are averaged across all processes before printing
- Checkpoints should be saved to
/valhalla/projects/<your_slurm_project_account_name>/outputs(never use home directory) - The SLURM script sets
OUTPUT_DIRenvironment variable for this purpose
Expected output
When checking the output file (pytorch_cpu_training.JOBID.out), the output should be similar to:
Job ID: 4060967
Number of nodes: 2
Tasks per node: 128
Total tasks: 256
CPUs per task: 2
Working directory: /valhalla/projects/your_account/ai_cpu
Data directory: /opt/software/test/pytorch/mnist_data
Output directory: /valhalla/projects/your_account/ai_cpu/outputs
Python: /valhalla/projects/your_account/virt_envs/pytorch/bin/python
Python version: Python 3.11.14
Starting training on 256 processes
Process 0 of 256
Loading MNIST dataset...
Data will be stored in: /opt/software/test/pytorch/mnist_data
Training dataset download completed
Starting training for 5 epochs...
Epoch 1, Batch 0, Loss: 2.3130
Epoch 1 - Average Loss: 0.8803, Average Accuracy: 72.08%
Test Loss: 0.0985, Test Accuracy: 97.07%
Epoch 2, Batch 0, Loss: 0.2280
Epoch 2 - Average Loss: 0.1597, Average Accuracy: 95.36%
Test Loss: 0.0540, Test Accuracy: 98.29%
Epoch 3, Batch 0, Loss: 0.1328
Epoch 3 - Average Loss: 0.1004, Average Accuracy: 97.09%
Test Loss: 0.0488, Test Accuracy: 98.45%
Epoch 4, Batch 0, Loss: 0.0106
Epoch 4 - Average Loss: 0.0834, Average Accuracy: 97.63%
Test Loss: 0.0505, Test Accuracy: 98.45%
Epoch 5, Batch 0, Loss: 0.1305
Epoch 5 - Average Loss: 0.0726, Average Accuracy: 97.86%
Test Loss: 0.0322, Test Accuracy: 98.99%
Training completed in 99.44 seconds
Average time per epoch: 19.89 seconds
Training job completed
Expected output characteristics:
- Job information header: Shows resource allocation and paths
- Process count: Should match
--nodes × --ntasks-per-node(e.g., 2 × 128 = 256 for full allocation) - Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
- Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
- Training progress: Loss should decrease and accuracy should increase over epochs
- Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
- Completion message: “Training job completed” indicates successful finish
Note
Only MPI rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.
Understanding process structure
When checking running processes with top or htop on a compute node, the following will be visible:
- Main Python training processes: 128 processes (one per
--ntasks-per-node, matching one per core)- High CPU usage (180-220%) - using multiple threads (2 threads per process)
- Memory usage: ~500-530 MB per process
- These are the distributed training workers
- Each process uses 2 threads (matching
--cpus-per-task=2)
pt_data_workerprocesses: 256 processes (2 per training process)- Lower CPU usage (~2%) - data loading workers
- Memory usage: ~280-290 MB per worker
- These are PyTorch DataLoader worker processes
The purpose of pt_data_worker processes: - PyTorch’s DataLoader uses num_workers=2 (set in the training script) - Each training process spawns 2 worker processes to load data in parallel - This prevents data loading from blocking training computation - With 128 training processes × 2 workers = 256 pt_data_worker processes per node
Total processes per node (full allocation): - 128 main Python processes (training, one per core) - 256 pt_data_worker processes (data loading) - Total: 384 processes per node - Total across 2 nodes: 768 processes (512 training + 256 data workers)
Resource allocation: - CPU cores: 128 cores per node × 2 nodes = 256 cores (fully utilized) - Threads: 256 ranks × 2 threads = 512 threads (fully utilizing hyperthreading) - Memory: Adjust --mem based on model size (e.g., 251G per node for full allocation - maximum available on Discoverer)
This is normal and expected behavior. The worker processes help keep the GPU/CPU busy by pre-loading the next batch while the current batch is being processed.
To reduce worker processes (if needed for resource constraints): - With 256 ranks, having num_workers=2 creates 512 data worker processes (256 × 2) - Consider reducing to num_workers=1 for large-scale runs (256 worker processes total) - Or set num_workers=0 to disable parallel data loading (slower but minimal overhead) - For full node allocation (256 ranks), num_workers=1 is often a good balance
Troubleshooting
Job stuck in pending state with reason “(None)”:
“(None)” means the job is waiting in the queue without a specific reason code
This is common for large resource requests (e.g., 2 full nodes)
Check detailed job information:
scontrol show job JOBID
Check job priority and position in queue:
sprio -j JOBID
Check partition availability:
sinfo -p cn
Check available nodes:
sinfo -p cn -o '%N %t %C %m'Common reasons for delay:
- No 2 full nodes available simultaneously (most common for full node requests)
- Other jobs ahead in queue with higher priority
- Account/QoS limits (check with
sacctmgr show assoc user=$USER) - Partition limits or maintenance
Solutions:
- Wait for resources to become available
- Reduce resource request (fewer nodes, fewer tasks per node)
- Check with cluster administrators about account limits
- Consider using fewer nodes or reducing memory request if possible
Python version compatibility issues:
Python 3.11 is the current optimal version for ML/AI workloads on Discoverer
If Python version errors are encountered, recreate the virtual environment with Python 3.11:
# Remove the existing environment conda env remove --prefix ${VIRTUAL_ENV} -y # Create new environment with Python 3.11 (current optimal version) # Python 3.11 is the current optimal version for ML/AI workloads on Discoverer # (TensorFlow, PyTorch, Horovod, and related packages have best compatibility with Python 3.11) # Note: In the future, TensorFlow, PyTorch, Horovod, and related packages might become compatible with higher Python versions conda create --prefix ${VIRTUAL_ENV} python=3.11 --solver=libmamba -y # Then install packages again (install pytorch-cpu, horovod, and torchvision together to ensure Open MPI and mpi4py) conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba
Dataset location issues:
- The MNIST dataset should already be available at
/opt/software/test/pytorch/mnist_data - If the dataset is not found, verify the path is correct and accessible from compute nodes
- The SLURM script sets
DATA_DIRto this location automatically
- The MNIST dataset should already be available at
Out of memory:
- Reduce
batch_sizein the training script (currently 64) - Reduce number of workers in DataLoader (
num_workers=2)
- Reduce
Slow training:
- Increase
batch_sizeif memory allows - Reduce number of epochs for testing
- Check that
OMP_NUM_THREADSis set correctly
- Increase
Training produces NaN loss values:
- Cause: Learning rate is too high when scaled linearly with large number of ranks
- With 256+ ranks, linear scaling (base_lr × num_ranks) creates extremely high learning rates
- Example: 0.01 × 2048 = 20.48 (way too high, causes divergence)
- Solution: The training script now uses square root scaling for large numbers of ranks
- For ≤32 ranks: linear scaling (base_lr × num_ranks)
- For >32 ranks: square root scaling (base_lr × sqrt(num_ranks))
- This prevents NaN values while still benefiting from distributed training
- If NaN values still appear, try reducing
base_lrfurther (e.g., 0.001 instead of 0.01)
MPI errors:
- Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate MPI module needs to be loaded
- Use
srun --mpi=pmixto launch jobs -srunautomatically uses all allocated tasks (no need for-np $SLURM_NTASKS) - Verify that MPI libraries are available:
which mpirun(should be in the virtual environment, thoughsrunis preferred) - Check that Horovod was installed correctly:
conda list horovod - Ensure virtual environment has
horovodandmpi4pyinstalled - Important: Install PyTorch, Horovod, and torchvision together (
conda install --prefix ${VIRTUAL_ENV} pytorch-cpu nomkl horovod torchvision --solver=libmamba) to ensure Open MPI is used and to bringmpi4pyas a dependency
Adapting for custom datasets
To use a custom dataset:
Replace the dataset loading section:
# Instead of datasets.MNIST, use a custom dataset from torch.utils.data import Dataset class YourDataset(Dataset): def __init__(self, data_path, transform=None): # Load data here pass def __len__(self): return len(self.data) def __getitem__(self, idx): # Return (image, label) tuple return image, label train_dataset = YourDataset(data_path='/path/to/data', transform=transform)
Adjust model architecture:
- Modify
MNIST_CNNclass to match the input dimensions and number of classes - Update normalization values in transforms if needed
- Modify
Update data directory:
- For the example MNIST training, the dataset is at
/opt/software/test/pytorch/mnist_data - For custom datasets, store them in
/valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/ - Update the
DATA_DIRenvironment variable in the SLURM script to point to the dataset location - Ensure the path is accessible from all compute nodes (project directories are shared across nodes)
- Never store data in home directory - always use
/valhalla/projects/<your_slurm_project_account_name>/
- For the example MNIST training, the dataset is at
Scaling analysis: Why different node counts require different numbers of epochs
In distributed training with Horovod, the training dataset is sharded (divided) across all worker processes (ranks). Each rank processes only its assigned subset of data during each epoch.
Data sharding formula:
Samples per rank per epoch = Total training samples / Number of ranks
As the number of nodes (and therefore ranks) increases:
- Fewer samples per rank: Each rank sees progressively fewer training samples per epoch
- 2 nodes (256 ranks): 60,000 / 256 = 234.4 samples per rank
- 4 nodes (512 ranks): 60,000 / 512 = 117.2 samples per rank
- 8 nodes (1024 ranks): 60,000 / 1024 = 58.6 samples per rank
- 16 nodes (2048 ranks): 60,000 / 2048 = 29.3 samples per rank
- Reduced effective batch size per rank: With fewer samples, batch sizes must be reduced to ensure multiple batches per epoch
- More ranks → smaller batch sizes (64 → 32 → 16)
- More epochs needed for convergence: With fewer samples per rank, each rank needs more epochs to:
- See enough diverse data to learn meaningful patterns
- Accumulate sufficient gradient updates for convergence
- Overcome the statistical noise from limited data per rank
- Learning rate scaling: Learning rates are scaled more conservatively with more ranks to prevent training instability
Explanation: 16 nodes required more epochs than 8 nodes
With 16 nodes (2048 ranks), each rank sees only 29.3 samples per epoch. This extremely limited data per rank means: - Each rank sees very little data diversity per epoch - Gradient estimates become noisier, requiring more iterations to converge - More communication overhead with more ranks can introduce slight inconsistencies - Beyond a certain point, adding more ranks doesn’t linearly improve convergence speed due to the data limitation
The following table shows PyTorch training results across different node configurations:
| Metric | 2 Nodes(256 ranks) | 4 Nodes(512 ranks) | 8 Nodes(1024 ranks) | 16 Nodes(2048 ranks) |
|---|---|---|---|---|
| Total Ranks | 256 | 512 | 1024 | 2048 |
| Samples per Rank | 234.4 | 117.2 | 58.6 | 29.3 |
| Batch Size | 64 | 32 | 16 | 16 |
| Learning Rate | 0.016000 | 0.012126 | 0.005657 | 0.006727 |
| LR Scaling | sqrt (0.5) | conservative (0.4) | conservative (0.25) | conservative (0.25) |
| Max Epochs | 20 | 25 | 30 | 30 |
| Epochs to 98% | 17 | 13 | 13 | 28 |
| Final Accuracy | 98.14% | 98.01% | 98.11% | 98.12% |
| Total Training Time | 71.44s | 52.56s | 52.80s | 94.36s |
| Time per Epoch | 4.20s | 4.04s | 4.06s | 3.37s |
| Early Stopping | Yes | Yes | Yes | Yes |
Highlights:
- Training time efficiency: From the total training time, 4 nodes (512 MPI ranks) is the most optimal for computation, achieving 98% accuracy in 52.56 seconds. 8 nodes (1024 ranks) is nearly as efficient at 52.80 seconds, while 2 nodes requires 71.44 seconds and 16 nodes requires 94.36 seconds.
- Convergence patterns: 4 and 8 nodes show fast convergence (13 epochs), while 16 nodes requires 28 epochs due to extremely limited data per rank
- Learning rate scaling: Becomes more conservative as ranks increase to prevent training instability
- Batch size adaptation: Automatically reduced with more ranks to ensure sufficient batches per epoch
Recommendations:
- Optimal node count for this workload: 4-8 nodes provide the best balance of speed and resource allocation
- Avoid over-scaling: Beyond 8 nodes, the benefits diminish due to data sharding limitations
- Early stopping is effective: All runs successfully used early stopping at 98%, saving computation time
- Adaptive epoch scaling: The script correctly scales max epochs based on rank count, allowing sufficient training time for larger scales
Estimating optimal parallel rank for training jobs:
For each training job, estimate the optimal parallel rank (number of MPI ranks) by:
- Running multiple test runs with different node counts (e.g., 2, 4, 8, 16 nodes) using the same model architecture and training setup
- Comparing total training time, convergence speed (epochs to target accuracy), and resource allocation across different configurations
- Identifying the configuration that achieves the target accuracy in the shortest time while maintaining stable convergence
- Running production training jobs based on the estimated optimal configuration from these test runs
Note
The optimal configuration may vary depending on dataset size, model complexity, and target accuracy. Always validate the optimal rank count through empirical testing before running large-scale production jobs.
Complete working example: Distributed training on MNIST dataset with TensorFlow
This section provides a complete, working example of distributed training using TensorFlow/Keras with Horovod on the MNIST handwritten digit dataset. This example follows the same structure as the PyTorch example above but uses TensorFlow/Keras instead.
Example files: The complete training script and SLURM batch script are provided as separate, runnable files: - train_mnist_horovod_tf.py - Complete Python training script for TensorFlow - run_mnist_training_tf.sh - SLURM batch submission script for TensorFlow
Discoverer Storage Requirements:
- All outputs must be stored under
/valhalla/projects/<your_slurm_project_account_name>/(replace<your_slurm_project_account_name>with the SLURM account name) - Never store data or other large files (like outputs) in the user’s home directory (
~/) - this violates Discoverer CPU partition best file system utilisation practices - The MNIST dataset is already available locally at
/opt/software/test/tf/mnist_data- no download is needed - Outputs (checkpoints, logs) should go to
/valhalla/projects/<your_slurm_project_account_name>/outputs/ - The SLURM script creates the output directory automatically
Overview¶
Dataset: MNIST.*containing 70,000 grayscale images of handwritten digits (0-9):
- Training set: 60,000 images
- Test set: 10,000 images
- Image size: 28×28 pixels
- Data source: The MNIST dataset is already available locally and is automatically loaded from the local storage location
- Model: Simple convolutional neural network (CNN) suitable for CPU training
- Framework: TensorFlow/Keras with Horovod
- Training: Distributed across multiple nodes using MPI
Prerequisites:
- TensorFlow and Horovod must be installed together in the same conda install command to ensure CPU-only build
- If Horovod is installed before TensorFlow, it may install CUDA dependencies which are not needed on CPU-only clusters like Discoverer CPU partition
- Installation command:
conda install --prefix ${VIRTUAL_ENV} tensorflow-cpu horovod nomkl --solver=libmamba - See the TensorFlow installation section for complete installation instructions
Step 1: Prepare the working directory¶
On the Discoverer login node, create a directory for the training job and copy the example files:
# Replace <your_slurm_project_account_name> with the actual SLURM account name
mkdir -p /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
# Copy the example files (adjust paths as needed)
cp /path/to/train_mnist_horovod_tf.py .
cp /path/to/run_mnist_training_tf.sh .
Note
The example files (train_mnist_horovod_tf.py and run_mnist_training_tf.sh) are provided as separate, runnable scripts that can be copied and used directly.
Step 2: Get the training script¶
Copy the training script to the working directory:
# Copy the training script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp train_mnist_horovod_tf.py /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
The training script (train_mnist_horovod_tf.py) contains:
- CNN model definition using TensorFlow/Keras Sequential API
- Distributed data loading with TensorFlow Dataset API and Horovod sharding
- Training and validation loops with metric aggregation
- MNIST dataset loading from local storage
- Complete distributed training implementation with Horovod callbacks
The full script can be viewed in train_mnist_horovod_tf.py or modified as needed.
Step 3: Get the SLURM batch script¶
Copy the SLURM batch script to the working directory:
# Copy the SLURM script (replace <your_slurm_project_account_name> with the actual SLURM account name)
cp run_mnist_training_tf.sh /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf/
cd /valhalla/projects/<your_slurm_project_account_name>/distributed_training_example_tf
Before submitting, edit run_mnist_training_tf.sh and replace:
<your_slurm_project_account_name>with the actual SLURM account name<your_qos>with the actual QoS name- Verify the virtual environment path matches the setup (should point to the TensorFlow virtual environment)
The SLURM script (run_mnist_training_tf.sh) contains:
- Resource allocation (2 nodes, 128 tasks per node, 256 total processes)
- Full node allocation: 128 tasks per node = one per core, 2 threads per task = uses both threads per core
- Module loading (
anaconda3only is required - Open MPI is already installed as a dependency in the virtual environment during the installation of Horovod) - Virtual environment setup
- Job launch with
srun --mpi=pmix(srunautomatically uses all allocated tasks)
Understanding epochs in distributed training:
- Epochs are shared across all ranks, not per-rank
- With 256 ranks, training still uses the same number of epochs (e.g., 5 epochs)
- Each rank processes a different subset of data during each epoch
- All ranks work together on the same epochs simultaneously
- 256 epochs are not needed for 256 ranks - the number of epochs is independent of the number of ranks
Resource allocation for full node allocation:
- Discoverer nodes: 128 cores per node, 2 threads per core = 256 logical CPUs per node
- Recommended:
--ntasks-per-node=128(one rank per core) - Recommended:
--cpus-per-task=2(use both threads per core) - Memory: Adjust
--membased on the model size (e.g., 251G for full node - maximum available on Discoverer) - Total ranks: 2 nodes × 128 tasks = 256 ranks
- Total threads: 256 ranks × 2 threads = 512 threads across 2 nodes
Step 4: Submit the job¶
# Edit the script first to set the account and QoS (if not already done)
# Then submit the job
sbatch run_mnist_training_tf.sh
Before submitting: Ensure run_mnist_training_tf.sh has been edited to set the SLURM account name and QoS, and verify the virtual environment path is correct (should point to the TensorFlow virtual environment, not PyTorch).
Step 5: Monitor the job¶
# Check job status
squeue --me
# View output (replace JOBID with the actual job ID)
# Note: These commands should only be executed when squeue --me shows the job is in running state
tail -f mnist_training_tf.JOBID.out
# View errors (if any)
tail -f mnist_training_tf.JOBID.err
Step 6: Check the results¶
When the job completes, check the output file for training progress and final metrics.
When checking the output file (mnist_training_tf.JOBID.out), the output should be similar to:
Job ID: 4060967
Number of nodes: 2
Tasks per node: 128
Total tasks: 256
CPUs per task: 2
Working directory: /valhalla/projects/your_account/ai_cpu
Data directory: /opt/software/test/tf/mnist_data
Output directory: /valhalla/projects/your_account/ai_cpu/outputs
Python: /valhalla/projects/your_account/virt_envs/tf/bin/python
Python version: Python 3.11.14
Starting training on 256 processes
Process 0 of 256
Loading MNIST dataset...
Data will be stored in: /opt/software/test/tf/mnist_data
Training dataset download completed
Using learning rate: 0.16 (base: 0.01, ranks: 256, scaling: sqrt)
Starting training for 5 epochs...
Epoch 1, Loss: 0.8803, Accuracy: 72.08%, Val Loss: 0.0985, Val Accuracy: 97.07%, Time: 19.89s
Epoch 2, Loss: 0.1597, Accuracy: 95.36%, Val Loss: 0.0540, Val Accuracy: 98.29%, Time: 19.45s
Epoch 3, Loss: 0.1004, Accuracy: 97.09%, Val Loss: 0.0488, Val Accuracy: 98.45%, Time: 19.32s
Epoch 4, Loss: 0.0834, Accuracy: 97.63%, Val Loss: 0.0505, Val Accuracy: 98.45%, Time: 19.28s
Epoch 5, Loss: 0.0726, Accuracy: 97.86%, Val Loss: 0.0322, Val Accuracy: 98.99%, Time: 19.21s
Training completed in 99.44 seconds
Average time per epoch: 19.89 seconds
Evaluating on test set...
Test Loss: 0.0322, Test Accuracy: 98.99%
Training job completed
Expected output characteristics:
- Job information header: Shows resource allocation and paths
- Process count: Should match
--nodes × --ntasks-per-node(e.g., 2 × 128 = 256 for full allocation) - Full allocation: With 128 tasks per node and 2 CPUs per task, all 256 logical CPUs per node are allocated
- Dataset download: “Training dataset download completed” confirms only rank 0 downloaded
- Training progress: Loss should decrease and accuracy should increase over epochs
- Final accuracy: Should reach ~98-99% for MNIST (this is expected performance)
- Completion message: “Training job completed” indicates successful finish
Note: Only rank 0 prints output to avoid duplicate messages. All 256 processes are working, but output is only visible from one process.
Differences from PyTorch example
The TensorFlow implementation differs from the PyTorch example in several ways:
- Data loading: TensorFlow uses
tf.data.Datasetwith.shard()for distributed data splitting, while PyTorch usesDistributedSampler - Model definition: TensorFlow uses Keras Sequential API, while PyTorch uses
nn.Module - Training loop: TensorFlow uses
model.fit()with callbacks, while PyTorch uses manual training loops - Metric aggregation: TensorFlow uses Horovod’s
MetricAverageCallback, while PyTorch useshvd.allreduce()manually - Dataset download: TensorFlow downloads to a
.npzfile, while PyTorch downloads to a directory structure
Troubleshooting
Most troubleshooting steps are the same as the PyTorch example. Additional TensorFlow-specific considerations:
TensorFlow dataset sharding:
- Ensure
.shard(hvd.size(), hvd.rank())is called on the dataset before batching - Each MPI rank must process a different subset of data
- Ensure
Horovod callbacks:
- Always include
BroadcastGlobalVariablesCallback(0)to synchronize initial model parameters - Include
MetricAverageCallback()to average metrics across MPI ranks - Only MPI rank #0 should write checkpoints (use
if hvd.rank() == 0:)
- Always include
TensorFlow/Keras compatibility:
- Ensure TensorFlow and Horovod versions are compatible
- Check with:
python -c "import tensorflow as tf; import horovod.tensorflow as hvd; print('TF:', tf.__version__, 'Horovod:', hvd.__version__)"
Memory issues:
- TensorFlow may use more memory than PyTorch for the same model
- Reduce
batch_sizeif OOM errors are encountered - Consider using
tf.data.Datasetprefetching carefully
InfiniBand transport errors:
Error: “Transport retry count exceeded on mlx5_0:1/IB” is a known UCX/InfiniBand issue (UCX issue #5199)
This error typically occurs when there is excessive communication between MPI ranks across different nodes, indicating that the job has reached communication saturation and cannot utilize that number of nodes
Solution: Reduce the number of nodes in the job allocation
If reducing nodes is not feasible, set the UCX service level parameter as a workaround:
export UCX_IB_SL=0
This parameter sets the InfiniBand service level to 0, which can resolve transport retry issues
Note:
export UCX_IB_SL=0is already included in all SLURM batch scripts in this documentNote: Disabling InfiniBand (using TCP/IP instead) is not recommended as it significantly reduces performance for multi-node jobs
Adapting for custom datasets
To use a custom dataset with TensorFlow:
Replace the dataset loading section:
# Instead of tf.keras.datasets.mnist.load_data(), use a custom dataset import tensorflow as tf # Load data x_train, y_train = load_your_training_data() x_test, y_test = load_your_test_data() # Normalize and reshape as needed x_train = x_train.astype('float32') / 255.0 x_test = x_test.astype('float32') / 255.0 # Create TensorFlow datasets train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) train_dataset = train_dataset.shard(hvd.size(), hvd.rank()) # Important for distributed training train_dataset = train_dataset.shuffle(10000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
Adjust model architecture:
- Modify the
create_model()function to match the input dimensions and number of classes - Update normalization values if needed
- Modify the
Update data directory:
- Store the dataset in
/valhalla/projects/<your_slurm_project_account_name>/data/dataset_name/ - Update the
DATA_DIRenvironment variable in the SLURM script - Ensure the path is accessible from all compute nodes (project directories are shared across nodes)
- Never store data in home directory - always use
/valhalla/projects/<your_slurm_project_account_name>/
- Store the dataset in
Summary: Best practices for distributed training on Discoverer
- Use MPI - Avoid Gloo in multi-user HPC environments due to possible port conflict issues
- Proper combination of components:
- Horovod + MPI (simplest, works for both TensorFlow and PyTorch)
- PyTorch native DDP with MPI backend
- Launch methods:
- With Horovod:
srun --mpi=pmix python train.py(srun automatically uses all allocated tasks) - With PyTorch DDP:
srun --mpi=pmix python train.py
- With Horovod:
- No MPI module needed for Horovod: Open MPI is installed as a dependency in the virtual environment when Horovod is installed (Open MPI when PyTorch and Horovod are installed together) - no separate module loading required
References
- Discoverer HPC Resource Overview
- SLURM Documentation
- PyTorch Distributed Training Documentation
- TensorFlow Distributed Training Guide
- Consider checking Discoverer HPC documentation for:
- Available software modules
- Storage locations and quotas
- Job submission best practices
- Partition-specific configurations
Getting help
See Getting help