PyTorch (GPU)

About

This document shows how to install and use PyTorch in a Python virtual environment on Discoverer+ GPU cluster. Note that the method used does not lock the shell environment into the virtual environment.

The guide covers the complete workflow from creating a conda environment to running PyTorch jobs, ensuring that users can overcome common Slurm configuration challenges and successfully utilize the GPU resources available on the cluster.

Use Conda to install PyTorch with NVIDIA CUDA support on Discoverer+ GPU cluster

Note that we need to use a Python version that is appropriate for the latest stable PyTorch release. In our case, that is 3.11. While Python 3.13 and 3.14 are available, PyTorch doesn’t have full support for these newer versions yet, and we cannot rely on bleeding-edge technology for running productive jobs on HPC systems.

Here we use Slurm interactive session bind to the project Slurm account, but only on CPU basis. This way no GPU resources from the account will be spent. This is supported by the QoS with name “2cpu-single-host”.

Start an interactive Bash session on some of your compute nodes (that implies the invocation of srun tool). The example below creates an interactive Bash session that will last 30 minutes:

srun -N 1 -n 2 --partition=common \
   --account=your_slurm_project_account_name \
  --qos 2cpu-single-host --time=00:30:00 --pty /bin/bash

Wait for the session to start. Only then follow the instructions given below.

We will use the running session to create a Python virtual environment and install PyTorch with CUDA support therein. That means all commands provided below are related to that same Bash interactive session. Do not execute those commands directly on the login node.

module load anaconda3
module load nvidia/cuda/12/12.8
conda create \
 --prefix /valhalla/projects/your_slurm_project_account_name/pytorch_env/ \
   python=3.11
conda install \
 --prefix /valhalla/projects/your_slurm_project_account_name/pytorch_env/ \
   pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia

Of course, you need to type “y”, whenever Conda asks you about allowing the installation of packages.

In case of success (no errors displayed), you will obtain a Python 3.11 virtual environment with the latest PyTorch with CUDA support. That environment will be located in the following folder:

/valhalla/projects/your_slurm_project_account_name/pytorch_env/

You can test the integrity of the installation in that same interactive Bash session (or another interactive session):

/valhalla/projects/your_slurm_project_account_name/pytorch_env/bin/python \
-c "import torch; print('PyTorch version:', torch.__version__)"
/valhalla/projects/your_slurm_project_account_name/pytorch_env/bin/python \
-c "import torchvision; print('Torchvision version:', torchvision.version.git_version)"
/valhalla/projects/your_slurm_project_account_name/pytorch_env/bin/python \
-c "import torchvision; print('Torchvision CUDA version:', torchvision.version.cuda)"

You should get results like these:

PyTorch version: 2.5.1
12010 # That stands for CUDA 12.1+ compatibility
Torchvision version: 3ac97aa9120137381ed1060f37237e44485ac2aa

Now you can type Ctrl-D and terminate the interactive Bash session controlled by Slurm. Otherwise, you may leave that session open, but Slurm will terminate it after it runs for more than 30 minutes.

Running PyTorch on Discoverer+

Once the installation is performed successfully as explained above, the PyTorch installation can be utilized through a Slurm job, or run interactively by utilizing srun. In this case, the Slurm must utilize the default QoS to the Slurm account, which in this case is the QoS named “your_slurm_project_account_name”. Otherwise Torchvision will not be able to access the GPU devices on the compute nodes.

Running PyTorch interactively

This is not a recommended way of running PyTorch. Use this example for checks only!

For the sake of tests, we need a Python helper code that can be downloaded at:

https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/pytorch_gpu_detection.py

To download the code:

cd /valhalla/projects/your_slurm_project_account_name/
wget https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/pytorch_gpu_detection.py

In the example below we request the utilization of 2 GPUs (--gres=gpu:2):

srun -N 1 -n 2 --gres=gpu:2 \
   --partition=common \
   --account=your_slurm_project_account_name \
   --qos your_slurm_project_account_name \
   --time=00:30:00 --pty /bin/bash

Once the interactive session is started, we need to access the CUDA library and run the test Python script that calls PyTorch:

module load nvidia/cuda/12/12.8
export PATH="/valhalla/projects/your_slurm_project_account_name/pytorch_env/bin:$PATH"
export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/pytorch_env"
python /valhalla/projects/your_slurm_project_account_name/pytorch_gpu_detection.py

In case of successful execution, the following result will be displayed:

============================================================
 PyTorch & Torchvision GPU Detection Script
============================================================
PyTorch version: 2.5.1
Torchvision version: 0.20.1
CUDA available: True
CUDA version: 12.1
cuDNN version: 90100
Number of GPUs: 2

============================================================
 GPU Details
============================================================

GPU 0:
  Name: NVIDIA H200
  Memory Total: 139.83 GB
  Memory Allocated: 0.00 GB
  Memory Cached: 0.00 GB
  Compute Capability: 9.0
  Multiprocessors: 132
  Warp Size: 32
  Available attributes: ['L2_cache_size', 'gcnArchName',
                        'is_integrated', 'is_multi_gpu_board',
                        'major', 'max_threads_per_multi_processor',
                        'minor', 'multi_processor_count',
                        'name', 'regs_per_multiprocessor',
                        'total_memory', 'uuid', 'warp_size']

GPU 1:
  Name: NVIDIA H200
  Memory Total: 139.83 GB
  Memory Allocated: 0.00 GB
  Memory Cached: 0.00 GB
  Compute Capability: 9.0
  Multiprocessors: 132
  Warp Size: 32
  Available attributes: ['L2_cache_size', 'gcnArchName',
                         'is_integrated', 'is_multi_gpu_board',
                         'major', 'max_threads_per_multi_processor',
                         'minor', 'multi_processor_count', 'name',
                         'regs_per_multiprocessor', 'total_memory',
                         'uuid', 'warp_size']

============================================================
 GPU Memory Test
============================================================
Current GPU: 0
Successfully allocated test tensor on GPU 0
Tensor shape: torch.Size([1000, 1000])
Tensor device: cuda:0
Tensor dtype: torch.float32
Successfully performed matrix multiplication on GPU 0
Result shape: torch.Size([1000, 1000])
Memory cleaned up successfully

============================================================
 GPU Performance Test
============================================================
Running simple performance test on GPU 0...
Matrix multiplication (5000x5000) completed in 0.0076 seconds
Result shape: torch.Size([5000, 5000])

============================================================
 Torchvision GPU Test
============================================================
Testing Torchvision on GPU 0...
Created test image tensor on GPU: torch.Size([1, 3, 224, 224])
Applied normalization transform: torch.Size([1, 3, 224, 224])
Loaded ResNet18 model and moved to GPU 0
Model forward pass successful: torch.Size([1, 1000])
Torchvision datasets module loaded successfully
[SUCCESS] Torchvision GPU test completed successfully!

============================================================
 Environment Information
============================================================
Python version: 3.11.13 (main, Jun  5 2025, 13:12:00) [GCC 11.2.0]
Platform: linux
Current working directory: /home/tfraunholz
CUDA_HOME: /usr/local/cuda-12.8
CUDA_PATH: /usr/local/cuda-12.8
LD_LIBRARY_PATH: /usr/local/cuda-12.8/lib64

============================================================
 Script Complete
============================================================

Running PyTorch within a Slurm batch script

Create the following Slurm batch script:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=test_pytorch
#SBATCH --time=00:30:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2

#SBATCH --account=ehpc-aif-2025pg01-214
#SBATCH --qos ehpc-aif-2025pg01-214

#SBATCH -o test_pytorch.%j.out
#SBATCH -e test_pytorch.%j.err

export PATH="/valhalla/projects/your_slurm_project_account_name/pytorch_env/bin:$PATH"
export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/pytorch_env"

module load nvidia/cuda/12/12.8

cd $SLURM_SUBMIT_DIR

python /valhalla/projects/your_slurm_project_account_name/pytorch_gpu_detection.py

and save it as /valhalla/projects/your_slurm_project_account_name/test_pytorch.sbatch.

If you don’t find pytorch_gpu_detection.py download it from here:

https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/pytorch_gpu_detection.py

To submit the job to the queue:

sbatch /valhalla/projects/your_slurm_project_account_name/test_pytorch.sbatch

Once successfully submitted, you can check if the job is running by executing:

squeue --me

If the job is running at the moment, information about its execution will be presented as:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 1980    common test_pyt username  R       0:06      1 dgx1

The execution of the job will create two files in the current directory - one capturing the standard output, and another - where the standard error messages are collected:

test_pytorch.1980.err
test_pytorch.1980.out

Here 1980 is the job id. That number in your case will be different.

The file test_pytorch.1980.out will contain the results (should be the same as those reported for the interactive execution).