TensorFlow (GPU)¶

Table of Contents

TensorFlow (GPU)

About ¶

This document shows how to install and use TensorFlow with GPU support in a Python virtual environment on Discoverer+ GPU cluster. Note that the method used does not lock the shell environment into the Python virtual environment. Therefore, we do not leave any Conda artefacts into the Bash profile of the user.

Important

You can read the document Python virtual environments (GPU) for more information about why we recommend the use of Python virtual environments on Discoverer+.

Note

In most cases, installing, running, and version controlling TensorFlow on the host system using Python virtual environments is as effective as using containers, and sometimes it may even provide higher effectiveness. Python virtual environments offer a lightweight alternative to containers without the overhead of container runtime, while still providing isolation for different package versions and dependencies.

The best practice in this case is to create a separate Python virtual environment to host the TensorFlow installation.

Warning

We do not recommend installing TensorFlow and PyTorch in the same Python virtual environment, because of possible package version conflicts that may not be easily resolved.

The best practice is to create a separate Python virtual environment to host the TensorFlow installation. That same virtual environment can be later used to run TensorFlow jobs on the cluster using Slurm.

Prerequisites ¶

To proceed with the installation, you need to have:

Discoverer+ user account and a project account (e.g. ehpc-aif-2025pg01-226).

Slurm account with a QoS that allows running CPU-only jobs (e.g. 2cpu-single-host).

Access to the environment modules system anaconda3.

The installation part is described in Performing the installation.

To test the installation, you need to have:

Access to GPUs on the cluster through a Slurm account on Discoverer+ with loaded GRES resources in it.

Python code (script or Jupyter notebook) to run that uses TensorFlow with GPU support.

Slurm batch script to run the TensorFlow Python code on the cluster as Slurm provisioned and controlled job.

Slurm interactive session to run the TensorFlow Python code on the cluster through interactive Bash session with srun.

The test part is described in Running TensorFlow on Discoverer+.

List of the versions available for the installation ¶

To install TensorFlow with GPU support on Discoverer+, we need to request the installation of the TensorFlow-GPU package from the conda-forge channel. Then the TensorFlow package will be automatically installed as a dependency of the TensorFlow-GPU package.

The first step towards estimating which TensorFlow-GPU versions and builds are available is to execute the following command on the login node of Discoverer+:

module load anaconda3
conda search tensorflow-gpu -c conda-forge | grep tensorflow-gpu

Note

Even though we specify -c conda-forge, conda search may still show results from other configured channels (such as pkgs/main) in the output. The output includes a column indicating which channel each package comes from. This is normal behavior - conda search shows all available packages across all configured channels, but the -c conda-forge flag prioritizes conda-forge when installing.

Look at the end of the output where the latest versions are listed; it may look like:

tensorflow-gpu                2.18.0 cuda126py310h418687c_200  conda-forge
tensorflow-gpu                2.18.0 cuda126py310h418687c_250  conda-forge
tensorflow-gpu                2.18.0 cuda126py310h418687c_251  conda-forge
tensorflow-gpu                2.18.0 cuda126py311h418687c_200  conda-forge
tensorflow-gpu                2.18.0 cuda126py311h418687c_250  conda-forge
tensorflow-gpu                2.18.0 cuda126py312h418687c_200  conda-forge
tensorflow-gpu                2.18.0 cuda126py312h418687c_201  conda-forge
tensorflow-gpu                2.18.0 cuda126py312h418687c_250  conda-forge
tensorflow-gpu                2.18.0 cuda126py39h418687c_200  conda-forge
tensorflow-gpu                2.18.0 cuda126py39h418687c_250  conda-forge
tensorflow-gpu                2.18.0 cuda128h6316801_202  conda-forge
tensorflow-gpu                2.18.0 cuda128h6316801_252  conda-forge
tensorflow-gpu                2.18.1 cuda124py310hd65659d_200  pkgs/main
tensorflow-gpu                2.18.1 cuda124py311hd65659d_200  pkgs/main
tensorflow-gpu                2.18.1 cuda124py312hd65659d_200  pkgs/main
tensorflow-gpu                2.18.1 cuda124py39hd65659d_200  pkgs/main
tensorflow-gpu                2.19.0 cuda128h6316801_202  conda-forge
tensorflow-gpu                2.19.0 cuda128h6316801_203  conda-forge
tensorflow-gpu                2.19.0 cuda128h6316801_253  conda-forge
tensorflow-gpu                2.19.1 cuda124py310hd65659d_200  pkgs/main
tensorflow-gpu                2.19.1 cuda124py311hd65659d_200  pkgs/main
tensorflow-gpu                2.19.1 cuda124py312hd65659d_200  pkgs/main
tensorflow-gpu                2.19.1 cuda124py39hd65659d_200  pkgs/main

We are most interested in the second and third columns, where the versions and build strings are listed. Note that the versions of CUDA and Python base are encoded in the build strings. For instance, cuda124py312hd65659d_200 for version 2.19.1 shows CUDA 12.4 and Python 3.12 compatibility.

When planning the installation of TensorFlow-GPU package, be sure you create in advance a Python virtual environment with the same Python base as the one used in the build string. Let us choose to be Python 3.12 compatible. In that case, we may choose to install TensorFlow-GPU version 2.19.1 that is compatible with Python 3.12. When you specify the version (e.g., tensorflow-gpu=2.19.1), conda will automatically select a compatible build for your Python virtual environment.

Performing the installation ¶

The installation of TensorFlow in general should be performed on a per-user or per-project basis. We do not provide pre-installed TensorFlow-GPU packages on the cluster, so you should not look for environment modules to load a specific TensorFlow installation. Instead, each user or project should install TensorFlow individually into their project folder, using the Conda tool. This way the installation is not shared with other users and projects, can include a specific version of TensorFlow-GPU, or other packages, and may be updated later at the user’s discretion.

Therefore, you need to create a separate Python virtual environment to host each TensorFlow installation. That same virtual environment can be later used to run TensorFlow jobs on the cluster using Slurm.

Warning

We strongly discourage running installation tasks directly on the login node on the Discoverer+ system. This is because the login node is shared by all users and the installation tasks, which are usually highly I/O intensive, may actively compete for resources with other users’ scripts and processes and exhaust the login node resources. Therefore, the installation tasks should be run only as Slurm jobs.

Note

Allocation of GPU resources from the project account is not required for installing TensorFlow-GPU packages through Slurm. This is because the Slurm installation job script provided below incorporates the use of the CONDA_OVERRIDE_CUDA environment variable to tell conda which CUDA version to use. Without properly initialising the variable CONDA_OVERRIDE_CUDA, the conda tool would fail with errors indicating that the __cuda component is missing on the target system, even if the environment module nvidia/cuda/12 is loaded. If the variable CONDA_OVERRIDE_CUDA is not set in the script, conda will fail indicating that __cuda is missing on the system. Therefore, loading the environment module nvidia/cuda/12 is not enough for the dependency solver incorporated in conda to detect the presence of CUDA libraries.

Regarding the versions of TensorFlow-GPU installed, you have two options:

Install a specific version (e.g., tensorflow-gpu=2.19.1=cuda124py312hd65659d_200 or tensorflow-gpu=2.19.1), see Install a specific version

This installs specific TensorFlow and TensorFlow-GPU versions, usually from a selected Conda channel. Both the package version and Conda channel must be quoted to perform the installation correctly. Be aware that by performing such an installation, you may see warnings about STRICT_REPO_PRIORITY reported to the standard error stream, but these can be safely ignored—the installation will complete successfully.

Note

Version specification requires quotes: you must quote both the channel and the package specification. For example: conda install --prefix ${VIRTUAL_ENV} -c "pkgs/main" "tensorflow-gpu=2.19.1" -y. Without quotes, conda will fail with conda: error: unrecognized arguments.
Install the latest version (e.g., tensorflow-gpu), see Install the latest version

This installs the latest available TensorFlow and TensorFlow-GPU versions from the enabled Conda channels. Usually, this approach relies on the standard conda installation from the conda-forge channel and doesn’t require setting any special channel priority settings.

You may install additional packages related to the TensorFlow installation, such as:

tensorflow-datasets
tensorflow-hub
tensorflow-text
tensorflow-probability
tensorflow-ranking
tensorflow-serving
tensorflow-transform
tensorflow-addons

If you need any of them, just add it to the conda install command in the installation scripts shown below.

Install a specific version ¶

Create a Slurm script (e.g. install_tensorflow_gpu.sh) with the content shown below. This script installs a specific version of TensorFlow-GPU (2.19.1). Replace <your_slurm_project_account_name> with the actual Slurm project account name. You may also need to change the VIRTUAL_ENV folder, if you are planning to use a different one.

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=install
#SBATCH --time=00:30:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G

#SBATCH -o install.%j.out
#SBATCH -e install.%j.err

# Ensure that install.%j.out and install.%j.err are saved in the directory
# where you submit the job. Set the working directory of the Bash shell to the
# folder from which the script is launched.
cd ${SLURM_SUBMIT_DIR}

# Unload all previously loaded modules (in case you inherit the Bash environment)
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the module anaconda to access the conda tool
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf

# Check if the target folder already exists.
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }

# Now use conda to create fresh Python virtual environment
conda create --prefix ${VIRTUAL_ENV} python=3.12 -y

if [ $? -ne 0 ]; then
  echo "Conda Python virtual environment creation failed" >&2
  exit 1
fi

echo "Python virtual environment successfully created!"

# Fully expose the Python virtual environment to the installation next
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# Install TensorFlow-GPU version 2.19.1
# Note: CONDA_OVERRIDE_CUDA tells conda which CUDA version to use, so GPU
# resources are not required during installation. This also allows specifying
# the full build string. The package specification must be quoted. This command
# may produce warnings about STRICT_REPO_PRIORITY in the standard error output,
# but the installation will complete successfully.
CONDA_OVERRIDE_CUDA=12.4 conda install --prefix ${VIRTUAL_ENV} \
   -c "pkgs/main" "tensorflow-gpu=2.19.1=cuda124py312hd65659d_200" -y

if [ $? -ne 0 ]; then
  echo "Conda installation failed" >&2
  exit 1
fi

echo "Successful installation of TensorFlow-GPU!"

To submit the installation job to the Slurm queue see Submitting the installation jobs to the Slurm queue.

Install the latest version ¶

Create a Slurm script (e.g. install_tensorflow_gpu_latest.sh) with the content shown below. This script installs the latest available version of TensorFlow-GPU. Replace <your_slurm_project_account_name> with the actual Slurm project account name. You may also need to change the VIRTUAL_ENV folder if you are planning to use a different one.

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=install
#SBATCH --time=00:30:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G

#SBATCH -o install.%j.out
#SBATCH -e install.%j.err

# Ensure that install.%j.out and install.%j.err are saved in the directory
# where you submit the job. Set the working directory of the Bash shell to the
# folder from which the script is launched.
cd ${SLURM_SUBMIT_DIR}

# Unload all previously loaded modules (in case you inherit the Bash environment)
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the module anaconda to access the conda tool
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf

# Check if the target folder already exists.
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }

# Now use conda to create fresh Python virtual environment
conda create --prefix ${VIRTUAL_ENV} python=3.12 -y

if [ $? -ne 0 ]; then
  echo "Conda Python virtual environment creation failed" >&2
  exit 1
fi

echo "Python virtual environment successfully created!"

# Fully expose the Python virtual environment to the installation next
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# Install latest TensorFlow-GPU version
# Note: CONDA_OVERRIDE_CUDA tells conda which CUDA version to use, so GPU
# resources are not required during installation. Using conda-forge channel
# for standard installation of the latest version. Conda will automatically
# select the latest compatible version and build for Python 3.12.
CONDA_OVERRIDE_CUDA=12.4 conda install --prefix ${VIRTUAL_ENV} \
   -c conda-forge tensorflow-gpu -y

if [ $? -ne 0 ]; then
  echo "Conda installation failed" >&2
  exit 1
fi

echo "Successful installation of TensorFlow-GPU!"

To submit the installation job to the Slurm queue see Submitting the installation jobs to the Slurm queue.

Submitting the installation jobs to the Slurm queue ¶

To submit either script to the Slurm queue, use:

sbatch install_tensorflow_gpu.sh

and wait for the job to be accepted and finished.

Once the job is finished, you can check if the installation was successful by inspecting the standard output stream stored in the file install.%j.out:

cat install.%j.out

and the standard error stream stored in the file install.%j.err:

cat install.%j.err

If the installation was successful, you should see in install.%j.out the following messages:

Python virtual environment successfully created!
Successful installation of TensorFlow-GPU!

If the installation failed, you should see in install.%j.err error messages indicating the cause of failure.

The file install.%j.err will contain the error message you can use for troubleshooting. For instance, see section Troubleshooting for more details.

Running TensorFlow on Discoverer+¶

Once the installation is performed successfully as explained above, the GPU-accelerated TensorFlow installation can be utilised through a Slurm job, or run interactively by utilising srun. In this case, the Slurm job must utilise an account and QoS that allow the allocation of GPUs as GRES resources (e.g. --gres=gpu:1). Otherwise TensorFlow will not be able to utilise the GPU devices on the compute nodes.

For the sake of tests, we provide a Python helper code to download at:

https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

Download the code into the project folder:

cd /valhalla/projects/<your_slurm_project_account_name>/
wget https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

Then use one of the examples below to run the downloaded Python script that calls the GPU-accelerated TensorFlow.

Running TensorFlow using Slurm batch script ¶

Create a Slurm script (e.g. run_tensorflow_gpu.sh) with the content shown below. Replace <your_slurm_project_account_name> with the actual Slurm project account name. You may also need to change the VIRTUAL_ENV folder, if you set different one during the installation.

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=run_tensorflow_gpu
#SBATCH --time=00:30:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_slurm_project_account_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --gres=gpu:1

#SBATCH -o run_tensorflow_gpu.%j.out
#SBATCH -e run_tensorflow_gpu.%j.err

# Ensure that run_tensorflow_gpu.%j.out and run_tensorflow_gpu.%j.err are saved
# in the directory where you submit the job. Set the working directory of the
# Bash shell to the folder from which the script is launched.
cd ${SLURM_SUBMIT_DIR}

# Unload all previously loaded modules (in case you inherit the Bash environment)
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the CUDA module to access the CUDA libraries
module load nvidia/cuda/12 || { echo "Failed to load nvidia/cuda/12 module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf

# Check if the target folder already exists.
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }

# Fully expose the Python virtual environment to the TensorFlow execution next
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# Run the TensorFlow GPU detection script
python tensorflow_gpu_detection.py

To submit the job to the queue:

sbatch run_tensorflow_gpu.sh

Once successfully submitted, you can check if the job is queued and/or running by executing:

squeue --me

If the job is running at the moment, information about its execution will be presented as:

JOBID PARTITION NAME    USER      ST      TIME  NODES NODELIST(REASON)
 1980 common    run_ten username  R       0:06      1 dgx1

The execution of the job will create two files in the current directory - one capturing the standard output, and another - where the standard error messages are collected:

run_tensorflow_gpu.1980.err
run_tensorflow_gpu.1980.out

Here 1980 is the job id. That number in your case will be different.

The file run_tensorflow_gpu.1980.out will contain the results (should be the same as those reported for the interactive execution). See Results of the proper execution of the test script for more details about what to expect to find in the file.

Running TensorFlow interactively ¶

Warning

This is not a recommended way of running TensorFlow. Use this example for checks only!

In the example below we request the execution of an interactive Bash session with Slurm that utilises 1 GPU (--gres=gpu:1) for 30 minutes (we may exit the session after the test by pressing Ctrl+D without waiting for the time to elapse):

srun --nodes=1 --ntasks-per-node=2 --gres=gpu:1 \
   --job-name=run_tensorflow_gpu \
   --partition=common \
   --account=<your_slurm_project_account_name> \
   --qos=<your_slurm_project_account_name> \
   --time=00:30:00 --pty /bin/bash

Once the interactive session is successfully started, we need to expose the CUDA library installation to the environment and run the test Python script that calls the GPU-accelerated TensorFlow within the downloaded Python script (Copy and paste the commands shown below into the interactive session):

module load nvidia/cuda/12/12.8
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/tf
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
python /valhalla/projects/${SLURM_JOB_ACCOUNT}/tensorflow_gpu_detection.py

See Results of the proper execution of the test script for more details about what to expect to get as a result.

Results of the proper execution of the test script ¶

In case of successful execution, the following result will be displayed:

============================================================
 TensorFlow GPU Detection Script
============================================================

Library Import Check
--------------------
✓ TensorFlow version: 2.16.1
✓ NumPy version: 1.26.4
✓ CUDA available: True

CUDA and GPU Information
------------------------
CUDA available: True
CUDA version: 12.8
Number of GPUs: 1

GPU Details
-----------

GPU 0:
  Name: /physical_device:GPU:0
  Memory Total: 139.83 GB
  Memory Allocated: 0.00 GB
  Memory Cached: 0.00 GB
  Compute Capability: 9.0

TensorFlow GPU Test
-------------------
Creating test tensors...
Tensor A shape: (1000, 1000)
Tensor B shape: (1000, 1000)
Running matrix multiplication on GPU...
Result shape: (1000, 1000)
Computation time: 0.0123 seconds
Device: /GPU:0
GPU memory after computation: 0.08 GB
Memory cleaned up successfully
[SUCCESS] TensorFlow GPU test completed successfully!

Neural Network Test
-------------------
Creating simple neural network...
Model created successfully
Running forward pass...
Input shape: (32, 784)
Output shape: (32, 10)
Forward pass time: 0.0045 seconds
Device: /GPU:0
[SUCCESS] Neural network test completed successfully!

Environment Information
-----------------------
Python version: 3.11.13 (main, Jun  5 2025, 13:12:00) [GCC 11.2.0]
Platform: linux
Current working directory: /home/username
CUDA_HOME: /usr/local/cuda-12.8
CUDA_PATH: /usr/local/cuda-12.8
LD_LIBRARY_PATH: /usr/local/cuda-12.8/lib64
VIRTUAL_ENV: /valhalla/projects/<your_slurm_project_account_name>/virt_envs/tf

============================================================
 Test Summary
============================================================
Tests passed: 3/3
[SUCCESS] All tests passed! TensorFlow is working correctly.

Help ¶

If you experience issues with the installation, contact the Discoverer HPC support team (see Getting help).