PyTorch¶
Important
Discoverer HPC provides public access to the version of PyTorch included in Intel Distribution for Python.
See Python for mode details about the benefits of running Intel Distribution for Python.
Warning
No GPU accelerators are currently available on Discoverer HPC. Torchvision cannot be run without a working CUDA DNN.
Versions supported¶
The version of PyTorch supported on Discoverer HPC as part of the Intel Distribution for Python publicly available in the public software repository. Running that version does not require setting a virtual environment by Conda or pip.
Note
PyTorch comes with Torchvision module installed.
Running PyTorch¶
Warning
Do not run PyTorch directly on the login node. Always do that as a Slurm batch job.
To load the PyTorch environment, load the module intel.universe.pytorch
from within your Slurm batch script:
module load intel.universe.pytorch
Once loaded, that module provides access to the correct Python interpreter and PyTorch 2 module. In case you need to combine PyTorch 2 with some specific modules that are not included by default in the distribution, you can create a virtual environment based on the same Python interpreter.
Checking the version¶
The easiest way to check the version of the PyTorch and Torchvision available in the software repository is to execute the following Slurm batch script:
#!/bin/bash
#
#SBATCH --partition=cn # Partition name (ask the support team to clarify it)
#SBATCH --job-name=torch_version
#SBATCH --time=00:01:00 # WallTime - one minute is more than enough here
#SBATCH --nodes 1 # May vary
#SBATCH --ntasks-per-node 1 # Must be 1
#SBATCH --cpus-per-task 1 # Must be 1
#SBATCH -o slurm.check_pytorch_version.out # STDOUT
#SBATCH -e slurm.check_pytorch_version.err # STDERR
module purge
module load intel.universe.pytorch
cd $SLURM_SUBMIT_DIR
python -c "import torch;print('Torch:',torch.version.__version__)"
python -c "import torchvision;print('Torchvision:',torchvision.version.__version__)"
To achieve that, store the script content into a file, for example /discofs/${USER}/check_pytorch_version.sbatch
and submit it as a job to the queue:
cd /discofs/${USER}/check_pytorch_version.sbatch
sbatch check_pytorch_version.sbatch
Then check the content of the file slurm.check_pytorch_version.out
to find out which versions of PyTorch and Torchvision are reported there.
Thread control¶
The version of PyTorch that comes with Intel Distribution for Python adopts TBB thread model. To understand better the way the thread control can be imposed on PyTorch, read the following document by keeping in mind the PyTorch installed on Discoverer HPC is built against MKL-DNN:
Slurm batch script (example)¶
Given below is an example of a Slurm batch script that runs a Python code invoking PyTorch:
#!/bin/bash
#
#SBATCH --partition=cn # Partition name (ask the support team to clarify it)
#SBATCH --job-name=torch_run
#SBATCH --time=512:00:00 # WallTime - set it accordningly
#SBATCH --nodes 1 # May vary
#SBATCH --ntasks-per-node 1 # Must be 1 for non-MPI processes
#SBATCH --cpus-per-task 16 # See the 'Thread control` above to understand what number
# to supply here instead of 16 (16 is an example). You may
# run a series of benchmarks varying that number until reach
# an optimal speed.
#SBATCH -o slurm.%j.out # STDOUT
#SBATCH -e slurm.%j.err # STDERR
module purge
module load intel.universe.pytorch
cd $SLURM_SUBMIT_DIR
python my_torch_based_code.py
where my_torch_based_code.py
is your PyTorch-based Python code.
Specify the parameters and resources required for successfully running and completing the job:
- Slurm partition of compute nodes, based on your project resource reservation (
--partition
)- job name, under which the job will be seen in the queue (
--job-name
)- wall time for running the job (
--time
)- number of threads to use (
--cpus-per-task
) - see “Thread control” above
Save the complete Slurm job description as a file, for example /discofs/$USER/run_torch/torch.batch
and submit it to the queue:
cd /discofs/$USER/run_torch
sbatch torch.batch
Follow the information stored by the running job in slurm.%j.out
and slurm.%j.err
, where %j
stands for the actual ID number assigned to the job in the queue (you will get that number upon submission).
Getting help¶
See Getting help