PyTorch (GPU)¶

Table of Contents

PyTorch (GPU)

About ¶

This document shows how to install and use PyTorch with GPU support and Transformers in a Python virtual environment on the Discoverer+ GPU cluster. Note that the method used does not lock the shell environment into the Python virtual environment. Therefore, we do not leave any Conda artefacts in the Bash profile of the user.

Note that here we install PyTorch and Transformers from the conda-forge channel. You may install other packages from other channels in the Python virtual environment as well, but you need to ensure that the packages are compatible with the PyTorch and Transformers versions you installed.

Note

In most cases, installing, running, and version controlling PyTorch on the host system using Python virtual environments is as effective as using containers, and sometimes it may even provide higher effectiveness. Python virtual environments offer a lightweight alternative to containers without the overhead of container runtime, while still providing isolation for different package versions and dependencies.

Important

You can read the document Python virtual environments (GPU) for more information about why we recommend the use of Python virtual environments on Discoverer+.

The best practice in this case is to create a separate Python virtual environment to host the PyTorch installation.

Warning

We do not recommend installing TensorFlow and PyTorch in the same Python virtual environment, because of possible package version conflicts that may not be easily resolved.

The best practice is to create a separate Python virtual environment to host the PyTorch installation. That same virtual environment can be later used to run PyTorch jobs on the cluster using Slurm.

Overview ¶

Use a dedicated conda environment with Python 3.11.
Install only pytorch and transformers from conda-forge.
Use CONDA_OVERRIDE_CUDA environment variable during installation to tell conda which CUDA version to use, so GPU resources are not required during installation.
Use a CPU-only QoS (e.g., 2cpu-single-host) for the install step. Use a GPU-capable partition/QoS when running PyTorch code in Slurm batch jobs.

Prerequisites ¶

To proceed with the installation, you need to have:

Discoverer+ user account and a project account (e.g. ehpc-aif-2025pg01-226).

Slurm account with a QoS that allows running CPU-only jobs (e.g. 2cpu-single-host).

Access to the environment modules system anaconda3.

The installation part is described in Unattended installation (Slurm batch job).

To use PyTorch or Discoverer+ you need to have:

Access to GPUs on the cluster through a Slurm account on Discoverer+ with loaded GRES resources in it.

Python code (script or Jupyter notebook) to run that uses PyTorch with GPU support.

Slurm batch script to run the PyTorch Python code on the cluster as Slurm provisioned and controlled job.

Slurm interactive session to run the PyTorch Python code on the cluster through interactive Bash session with srun.

The test part is described in Running PyTorch on Discoverer+.

Unattended installation (Slurm batch job)¶

Create a Slurm script (e.g. install_pytorch_tf.sh) with the content below. This script installs PyTorch with CUDA support and Transformers. Adjust the account, partition/QoS, and target directory as needed.

Warning

We strongly discourage running installation tasks directly on the login node on the Discoverer+ system. This is because the login node is shared by all users and the installation tasks, which are usually highly I/O intensive, may actively compete for resources with other users’ scripts and processes and exhaust the login node resources. Therefore, the installation tasks should be run only as Slurm jobs.

Note

Allocation of GPU resources from the project account is not required for installing PyTorch with CUDA support through Slurm. This is because the Slurm installation job script provided below incorporates the use of the CONDA_OVERRIDE_CUDA environment variable to tell conda which CUDA version to use. Without properly initialising the variable CONDA_OVERRIDE_CUDA, the conda tool would fail with errors indicating that the __cuda component is missing on the target system, even if the environment module nvidia/cuda/12 is loaded. If the variable CONDA_OVERRIDE_CUDA is not set in the script, conda will fail indicating that __cuda is missing on the system. Therefore, loading the environment module nvidia/cuda/12 is not enough for the dependency solver incorporated in conda to detect the presence of CUDA libraries.

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=install
#SBATCH --time=00:30:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G

#SBATCH -o install.%j.out
#SBATCH -e install.%j.err

# Ensure that install.%j.out and install.%j.err are saved in the directory
# where you submit the job. Set the working directory of the Bash shell to the
# folder from which the script is launched.
cd ${SLURM_SUBMIT_DIR}

# Unload all previously loaded modules (in case you inherit the Bash environment)
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the module anaconda to access the conda tool
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

# Check if the target folder already exists.
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }

# Now use conda to create fresh Python virtual environment
conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

if [ $? -ne 0 ]; then
  echo "Conda Python virtual environment creation failed" >&2
  exit 1
fi

echo "Python virtual environment successfully created!"

# Fully expose the Python virtual environment to the installation next
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# This is the actual installation:
# Note: CONDA_OVERRIDE_CUDA tells conda which CUDA version to use, so GPU
# resources are not required during installation. This also allows specifying
# the full build string. Based on the build string cuda129, we use CUDA 12.9.
CONDA_OVERRIDE_CUDA=12.9 conda install --prefix ${VIRTUAL_ENV} -c conda-forge \
   pytorch=2.8.0=cuda129_generic_py311_h469a2b5_201 \
   transformers=4.57.1=pyhd8ed1ab_0 -y

if [ $? -ne 0 ]; then
  echo "Conda installation failed" >&2
  exit 1
fi

echo "Successful installation of PyTorch and Transformers!"

Then submit the script to the Slurm queue:

sbatch install_pytorch_tf.sh

and wait for the job to finish.

Once the job is finished, you can check if the installation was successful by checking the standard output stream stored in the file install.%j.out:

cat install.%j.out

and the standard error stream stored in the file install.%j.err:

cat install.%j.err

If the installation was successful, you should see in install.%j.out the following messages:

Python virtual environment successfully created!
Successful installation of PyTorch and Transformers!

If the installation failed, you should see error messages indicating the cause of failure.

The file install.%j.err will contain the error message you can use for troubleshooting. For instance, see section Troubleshooting for more details.

Interactive installation (via Slurm `srun`)¶

Warning

This is not a recommended way of installing PyTorch. Use this example for checks only!

If you prefer to run the installation commands yourself in a Bash session, request starting such a session using srun:

srun --partition=common \
     --account=<your_slurm_project_account_name> \
     --qos=2cpu-single-host \
     --nodes=1 \
     --ntasks-per-node=2 \
     --cpus-per-task=1 \
     --mem=16G \
     --time=00:30:00 \
     --pty bash

Once the requested Bash session is started run the following commands in it (copy and paste the commands shown below into the interactive session):

module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }
conda create --prefix ${VIRTUAL_ENV} python=3.11 -y
export PATH=${VIRTUAL_ENV}/bin:${PATH}
CONDA_OVERRIDE_CUDA=12.9 conda install --prefix ${VIRTUAL_ENV} -c conda-forge \
   pytorch=2.8.0=cuda129_generic_py311_h469a2b5_201 \
   transformers=4.57.1=pyhd8ed1ab_0 -y

This installation may fail if the folder containing the Python virtual environment already exists. In that case, you need to remove it first:

rm -rf ${VIRTUAL_ENV}

Then retry the installation again.

Troubleshooting ¶

If you get this type of message displayed in the standard error stream stored in the file verify.%j.err:

LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides __cuda needed by pytorch-2.8.0-cuda129_generic_py311_h469a2b5_201

Could not solve for environment specs
The following package could not be installed
└─ pytorch ==2.8.0 cuda129_generic_py311_h469a2b5_201 is not installable because it requires
   └─ __cuda, which is missing on the system.

Ensure that the CONDA_OVERRIDE_CUDA environment variable is set in your installation script (e.g., CONDA_OVERRIDE_CUDA=12.9) right before the conda install command. This tells conda which CUDA version to use without requiring GPU CUDA devices to be available during installation. The installation scripts provided above already include this variable.

Verification ¶

Create and submit this complete Slurm script to verify the installation. It requests one GPU, loads required modules, exposes the environment, and runs a minimal PyTorch CUDA check:

#!/bin/bash
#SBATCH --partition=common
#SBATCH --qos=<your_slurm_project_account_name>
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --job-name=verify-pytorch
#SBATCH --time=00:10:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --gres=gpu:1

#SBATCH -o verify.%j.out
#SBATCH -e verify.%j.err

# Ensure that verify.%j.out and verify.%j.err are saved in the directory
# where you submit the job. Set the working directory of the Bash shell to the
# folder from which the script is launched.
cd ${SLURM_SUBMIT_DIR}

# Unload all previously loaded modules (in case you inherit the Bash environment)
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the module anaconda to access the conda tool
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
module load nvidia/cuda/12 || { echo "Failed to load CUDA module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

# Check if the target folder already exists.
[ -d ${VIRTUAL_ENV} ] || \
   { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

# Fully expose the Python virtual environment
export PATH=${VIRTUAL_ENV}/bin:${PATH}

python - <<'PY'
import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU count:", torch.cuda.device_count())
    print("GPU name:", torch.cuda.get_device_name(0))
PY

The result should be printed out in the standard output stream stored in the file verify.%j.out and will look similar to this:

PyTorch version: 2.8.0
CUDA available: True
GPU count: 1
GPU name: NVIDIA H200

Notes on versioning ¶

The exact build pinned above (cuda129_generic_py311) requires conda to know which CUDA version to use. Using CONDA_OVERRIDE_CUDA=12.9 (matching the cuda129 in the build string) tells conda which CUDA version to use without requiring GPU resources during installation.
Keep channels consistent (here: only conda-forge). Unless you have a specific reason to use other channels, stick to conda-forge.

Running PyTorch on Discoverer+¶

Once the installation is performed successfully as explained above, the GPU-accelerated PyTorch installation can be utilised through a Slurm job, or run interactively by utilising srun. In this case, the Slurm job must utilise an account and QoS that allow the allocation of GPUs as GRES resources (e.g. --gres=gpu:1). Otherwise PyTorch will not be able to utilise the GPU devices on the compute nodes.

If you don’t currently have the Python script to run, then you may download it first:

cd /valhalla/projects/${SLURM_JOB_ACCOUNT}/
wget https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/pytorch_gpu_detection.py

and then use it in the examples below that show how to use PyTorch on Discoverer+ using Slurm batch scripts and interactively.

Running PyTorch using Slurm batch script ¶

Inside your Slurm batch script you need to expose the environment and run your own Python code:

#!/bin/bash
#SBATCH --partition=common
#SBATCH --qos=<your_slurm_project_account_name>
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --job-name=run-my-code
#SBATCH --time=00:10:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --gres=gpu:2

#SBATCH -o run-my-code.%j.out
#SBATCH -e run-my-code.%j.err

# Ensure that run-my-code.%j.out and run-my-code.%j.err are saved in the directory
# where you submit the job. Set the working directory of the Bash shell to the
# folder from which the script is launched.
cd ${SLURM_SUBMIT_DIR}

# Unload all previously loaded modules (in case you inherit the Bash environment)
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the module anaconda to access the conda tool
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }
module load nvidia/cuda/12 || { echo "Failed to load CUDA module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch

# Check if the target folder already exists.
[ -d ${VIRTUAL_ENV} ] || \
   { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

# Fully expose the Python virtual environment
export PATH=${VIRTUAL_ENV}/bin:${PATH}

python my_code.py

Here we presume the Python script my_code.py is located in the same folder where the Slurm batch script will be submitted to the queue:

cd /valhalla/projects/<your_slurm_project_account_name>/my_code
sbatch my_code.sh

where my_code.sh is the file containing the Slurm batch script shown above.

Once successfully submitted, you can check if the job is running by executing:

squeue --me

If the job is running at the moment, information about its execution will be presented as:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 1980    common run-my-c username  R       0:06      1 dgx1

The execution of the job will create two files in the current directory - one capturing the standard output, and another - where the standard error messages are collected:

run-my-code.1980.err
run-my-code.1980.out

Here 1980 is the job id. That number in your case will be different.

The file run-my-code.1980.out will contain the results (should be the same as those reported for the interactive execution). See Results of the proper execution of the test script for more details about what to expect to get as a result.

Running PyTorch interactively ¶

Warning

This is not a recommended way of running PyTorch. Use this example for checks only!

In the example below we request the execution of an interactive Bash session with Slurm that utilises 1 GPU (--gres=gpu:1) for 30 minutes (we may exit the session after the test by pressing Ctrl+D without waiting for the time to elapse):

srun --nodes=1 --ntasks-per-node=2 --gres=gpu:1 \
   --job-name=run_pytorch_gpu \
   --partition=common \
   --account=<your_slurm_project_account_name> \
   --qos=<your_slurm_project_account_name> \
   --time=00:30:00 --pty /bin/bash

Once the interactive session is successfully started, we need to expose the CUDA library installation to the environment and run the test Python script that calls the GPU-accelerated PyTorch (copy and paste the commands shown below into the interactive session):

module load nvidia/cuda/12
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/torch
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
python /valhalla/projects/${SLURM_JOB_ACCOUNT}/pytorch_gpu_detection.py

See Results of the proper execution of the test script for more details about what to expect to get as a result.

Results of the proper execution of the test script ¶

In case the test code is downloaded successfully and you installed PyTorch correctly, you should get the following result:

============================================================
 PyTorch & Torchvision GPU Detection Script
============================================================
PyTorch version: 2.5.1
Torchvision version: 0.20.1
CUDA available: True
CUDA version: 12.1
cuDNN version: 90100
Number of GPUs: 2

============================================================
 GPU Details
============================================================

GPU 0:
  Name: NVIDIA H200
  Memory Total: 139.83 GB
  Memory Allocated: 0.00 GB
  Memory Cached: 0.00 GB
  Compute Capability: 9.0
  Multiprocessors: 132
  Warp Size: 32
  Available attributes: ['L2_cache_size', 'gcnArchName',
                        'is_integrated', 'is_multi_gpu_board',
                        'major', 'max_threads_per_multi_processor',
                        'minor', 'multi_processor_count',
                        'name', 'regs_per_multiprocessor',
                        'total_memory', 'uuid', 'warp_size']

GPU 1:
  Name: NVIDIA H200
  Memory Total: 139.83 GB
  Memory Allocated: 0.00 GB
  Memory Cached: 0.00 GB
  Compute Capability: 9.0
  Multiprocessors: 132
  Warp Size: 32
  Available attributes: ['L2_cache_size', 'gcnArchName',
                         'is_integrated', 'is_multi_gpu_board',
                         'major', 'max_threads_per_multi_processor',
                         'minor', 'multi_processor_count', 'name',
                         'regs_per_multiprocessor', 'total_memory',
                         'uuid', 'warp_size']

============================================================
 GPU Memory Test
============================================================
Current GPU: 0
Successfully allocated test tensor on GPU 0
Tensor shape: torch.Size([1000, 1000])
Tensor device: cuda:0
Tensor dtype: torch.float32
Successfully performed matrix multiplication on GPU 0
Result shape: torch.Size([1000, 1000])
Memory cleaned up successfully

============================================================
 GPU Performance Test
============================================================
Running simple performance test on GPU 0...
Matrix multiplication (5000x5000) completed in 0.0076 seconds
Result shape: torch.Size([5000, 5000])

============================================================
 Torchvision GPU Test
============================================================
Testing Torchvision on GPU 0...
Created test image tensor on GPU: torch.Size([1, 3, 224, 224])
Applied normalization transform: torch.Size([1, 3, 224, 224])
Loaded ResNet18 model and moved to GPU 0
Model forward pass successful: torch.Size([1, 1000])
Torchvision datasets module loaded successfully
[SUCCESS] Torchvision GPU test completed successfully!

============================================================
 Environment Information
============================================================
Python version: 3.11.13 (main, Jun  5 2025, 13:12:00) [GCC 11.2.0]
Platform: linux
Current working directory: /home/tfraunholz
CUDA_HOME: /usr/local/cuda-12.8
CUDA_PATH: /usr/local/cuda-12.8
LD_LIBRARY_PATH: /usr/local/cuda-12.8/lib64

============================================================
 Script Complete
============================================================

Help ¶

If you experience issues with the installation, contact the Discoverer HPC support team (see Getting help).