PyTorch

Important

Discoverer HPC provides public access to the version of PyTorch included in Intel Distribution for Python.

See Python for mode details about the benefits of running Intel Distribution for Python.

Warning

No GPU accelerators are currently available on Discoverer HPC. Torchvision cannot be run without a working CUDA DNN.

Versions supported

The version of PyTorch supported on Discoverer HPC as part of the Intel Distribution for Python publicly available in the public software repository. Running that version does not require setting a virtual environment by Conda or pip.

Note

PyTorch comes with Torchvision module installed.

Running PyTorch

Warning

Do not run PyTorch directly on the login node. Always do that as a Slurm batch job.

To load the PyTorch environment, load the module intel.universe.pytorch from within your Slurm batch script:

module load intel.universe.pytorch

Once loaded, that module provides access to the correct Python interpreter and PyTorch 2 module. In case you need to combine PyTorch 2 with some specific modules that are not included by default in the distribution, you can create a virtual environment based on the same Python interpreter.

Checking the version

The easiest way to check the version of the PyTorch and Torchvision available in the software repository is to execute the following Slurm batch script:

#!/bin/bash
#
#SBATCH --partition=cn         # Partition name (ask the support team to clarify it)
#SBATCH --job-name=torch_version
#SBATCH --time=00:01:00        # WallTime - one minute is more than enough here

#SBATCH --nodes           1    # May vary
#SBATCH --ntasks-per-node 1    # Must be 1
#SBATCH --cpus-per-task   1    # Must be 1

#SBATCH -o slurm.check_pytorch_version.out        # STDOUT
#SBATCH -e slurm.check_pytorch_version.err        # STDERR

module purge
module load intel.universe.pytorch

cd $SLURM_SUBMIT_DIR

python -c "import torch;print('Torch:',torch.version.__version__)"
python -c "import torchvision;print('Torchvision:',torchvision.version.__version__)"

To achieve that, store the script content into a file, for example /discofs/${USER}/check_pytorch_version.sbatch and submit it as a job to the queue:

cd /discofs/${USER}/check_pytorch_version.sbatch
sbatch check_pytorch_version.sbatch

Then check the content of the file slurm.check_pytorch_version.out to find out which versions of PyTorch and Torchvision are reported there.

Thread control

The version of PyTorch that comes with Intel Distribution for Python adopts TBB thread model. To understand better the way the thread control can be imposed on PyTorch, read the following document by keeping in mind the PyTorch installed on Discoverer HPC is built against MKL-DNN:

CPU threading and TorchScript inference

Slurm batch script (example)

Given below is an example of a Slurm batch script that runs a Python code invoking PyTorch:

#!/bin/bash
#
#SBATCH --partition=cn         # Partition name (ask the support team to clarify it)
#SBATCH --job-name=torch_run
#SBATCH --time=512:00:00       # WallTime - set it accordningly

#SBATCH --nodes           1    # May vary
#SBATCH --ntasks-per-node 1    # Must be 1 for non-MPI processes
#SBATCH --cpus-per-task   16   # See the 'Thread control` above to understand what number
                               # to supply here instead of 16 (16 is an example). You may
                               # run a series of benchmarks varying that number until reach
                               # an optimal speed.

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load intel.universe.pytorch

cd $SLURM_SUBMIT_DIR

python my_torch_based_code.py

where my_torch_based_code.py is your PyTorch-based Python code.

Specify the parameters and resources required for successfully running and completing the job:

  • Slurm partition of compute nodes, based on your project resource reservation (--partition)
  • job name, under which the job will be seen in the queue (--job-name)
  • wall time for running the job (--time)
  • number of threads to use (--cpus-per-task) - see “Thread control” above

Save the complete Slurm job description as a file, for example /discofs/$USER/run_torch/torch.batch and submit it to the queue:

cd /discofs/$USER/run_torch
sbatch torch.batch

Follow the information stored by the running job in slurm.%j.out and slurm.%j.err, where %j stands for the actual ID number assigned to the job in the queue (you will get that number upon submission).

Getting help

See Getting help