TensorFlow (GPU)

About

This document shows how to install and use TensorFlow with GPU support in a Python virtual environment on Discoverer+ GPU cluster. Note that the method used does not lock the shell environment into the virtual environment.

Important: TensorFlow cannot be installed in the same conda environment as PyTorch due to dependency conflicts. You must create a separate conda environment for TensorFlow. If you need both frameworks, you will need to maintain separate environments and switch between them as needed.

The guide covers the complete workflow from creating a conda environment to running TensorFlow jobs, ensuring that users can overcome common Slurm configuration challenges and successfully utilize the GPU resources available on the cluster.

Use Conda to install TensorFlow with NVIDIA CUDA support on Discoverer+ GPU cluster

Note that we need to use a Python version that is appropriate for the latest stable TensorFlow release. In our case, that is 3.11. While Python 3.13 and 3.14 are available, TensorFlow doesn’t have full support for these newer versions yet, and we cannot rely on bleeding-edge technology for running productive jobs on HPC systems.

Here we use Slurm interactive session bind to the project Slurm account, but only on CPU basis. This way no GPU resources from the account will be spent. This is supported by the QoS with name “2cpu-single-host”.

Start an interactive Bash session on some of your compute nodes (that implies the invocation of srun tool). The example below creates an interactive Bash session that will last 30 minutes:

srun -N 1 -n 2 --partition=common \
   --account=your_slurm_project_account_name \
  --qos 2cpu-single-host --time=00:30:00 --pty /bin/bash

Wait for the session to start. Only then follow the instructions given below.

We will use the running session to create a separate Python virtual environment and install TensorFlow with CUDA support therein. Note that this creates a new environment (tensorflow_env) that is separate from any existing PyTorch environment (pytorch_env). That means all commands provided below are related to that same Bash interactive session. Do not execute those commands directly on the login node.

module load anaconda3
module load nvidia/cuda/12/12.8
conda create \
 --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
   python=3.11
conda install \
 --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
   tensorflow-gpu -c conda-forge

Of course, you need to type “y”, whenever Conda asks you about allowing the installation of packages.

In case of success (no errors displayed), you will obtain a Python 3.11 virtual environment with the latest TensorFlow with CUDA support. That environment will be located in the following folder:

/valhalla/projects/your_slurm_project_account_name/tensorflow_env/

You can test the integrity of the installation in that same interactive Bash session (or another interactive session):

/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin/python \
-c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"
/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin/python \
-c "import tensorflow as tf; print('CUDA available:', tf.config.list_physical_devices('GPU'))"

You should get results like these:

TensorFlow version: 2.16.1
CUDA available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Now you can type Ctrl-D and terminate the interactive Bash session controlled by Slurm. Otherwise, you may leave that session open, but Slurm will terminate it after it runs for more than 30 minutes.

Running TensorFlow on Discoverer+

Once the installation is performed successfully as explained above, the TensorFlow installation can be utilized through a Slurm job, or run interactively by utilizing srun. In this case, the Slurm must utilize the default QoS to the Slurm account, which in this case is the QoS named “your_slurm_project_account_name”. Otherwise TensorFlow will not be able to access the GPU devices on the compute nodes.

Running TensorFlow interactively

This is not a recommended way of running TensorFlow. Use this example for checks only!

For the sake of tests, we need a Python helper code that can be downloaded at:

https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

To download the code:

cd /valhalla/projects/your_slurm_project_account_name/
wget https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

In the example below we request the utilization of 2 GPUs (--gres=gpu:2):

srun -N 1 -n 2 --gres=gpu:2 \
   --partition=common \
   --account=your_slurm_project_account_name \
   --qos your_slurm_project_account_name \
   --time=00:30:00 --pty /bin/bash

Once the interactive session is started, we need to access the CUDA library and run the test Python script that calls TensorFlow:

module load nvidia/cuda/12/12.8
export PATH="/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin:$PATH"
export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/tensorflow_env"
python /valhalla/projects/your_slurm_project_account_name/tensorflow_gpu_detection.py

In case of successful execution, the following result will be displayed:

============================================================
 TensorFlow GPU Detection Script
============================================================

Library Import Check
--------------------
✓ TensorFlow version: 2.16.1
✓ NumPy version: 1.26.4
✓ CUDA available: True

CUDA and GPU Information
------------------------
CUDA available: True
CUDA version: 12.1
Number of GPUs: 1

GPU Details
-----------

GPU 0:
  Name: /physical_device:GPU:0
  Memory Total: 139.83 GB
  Memory Allocated: 0.00 GB
  Memory Cached: 0.00 GB
  Compute Capability: 9.0

TensorFlow GPU Test
-------------------
Creating test tensors...
Tensor A shape: (1000, 1000)
Tensor B shape: (1000, 1000)
Running matrix multiplication on GPU...
Result shape: (1000, 1000)
Computation time: 0.0123 seconds
Device: /GPU:0
GPU memory after computation: 0.08 GB
Memory cleaned up successfully
[SUCCESS] TensorFlow GPU test completed successfully!

Neural Network Test
-------------------
Creating simple neural network...
Model created successfully
Running forward pass...
Input shape: (32, 784)
Output shape: (32, 10)
Forward pass time: 0.0045 seconds
Device: /GPU:0
[SUCCESS] Neural network test completed successfully!

Environment Information
-----------------------
Python version: 3.11.13 (main, Jun  5 2025, 13:12:00) [GCC 11.2.0]
Platform: linux
Current working directory: /home/username
CUDA_HOME: /usr/local/cuda-12.8
CUDA_PATH: /usr/local/cuda-12.8
LD_LIBRARY_PATH: /usr/local/cuda-12.8/lib64
VIRTUAL_ENV: /valhalla/projects/your_slurm_project_account_name/tensorflow_env

============================================================
 Test Summary
============================================================
Tests passed: 3/3
[SUCCESS] All tests passed! TensorFlow is working correctly.

Running TensorFlow within a Slurm batch script

Create the following Slurm batch script:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=test_tensorflow
#SBATCH --time=00:30:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2

#SBATCH --account=your_slurm_project_account_name
#SBATCH --qos your_slurm_project_account_name

#SBATCH -o test_tensorflow.%j.out
#SBATCH -e test_tensorflow.%j.err

export PATH="/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin:$PATH"
export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/tensorflow_env"

module load nvidia/cuda/12/12.8

cd $SLURM_SUBMIT_DIR

python /valhalla/projects/your_slurm_project_account_name/tensorflow_gpu_detection.py

and save it as /valhalla/projects/your_slurm_project_account_name/test_tensorflow.sbatch.

If you don’t find tensorflow_gpu_detection.py download it from here:

https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

To submit the job to the queue:

sbatch /valhalla/projects/your_slurm_project_account_name/test_tensorflow.sbatch

Once successfully submitted, you can check if the job is running by executing:

squeue --me

If the job is running at the moment, information about its execution will be presented as:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 1980    common test_tf username  R       0:06      1 dgx1

The execution of the job will create two files in the current directory - one capturing the standard output, and another - where the standard error messages are collected:

test_tensorflow.1980.err
test_tensorflow.1980.out

Here 1980 is the job id. That number in your case will be different.

The file test_tensorflow.1980.out will contain the results (should be the same as those reported for the interactive execution).

Additional TensorFlow Libraries

You may also want to install additional TensorFlow libraries depending on your use case:

# For TensorFlow Extended (TFX) - ML pipeline platform
conda install \
 --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
 tfx -c conda-forge

# For TensorFlow Probability - probabilistic programming
conda install \
 --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
 tensorflow-probability -c conda-forge

# For TensorFlow Datasets - ready-to-use datasets
conda install \
 --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
 tensorflow-datasets -c conda-forge

# For TensorFlow Hub - pre-trained models
conda install \
 --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
 tensorflow-hub -c conda-forge

Example Usage

Here’s a simple example of how to use TensorFlow in your Python scripts:

import tensorflow as tf
import numpy as np

# Check GPU availability
print("GPUs available:", tf.config.list_physical_devices('GPU'))

# Create a simple neural network
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Generate some dummy data
x_train = np.random.random((1000, 784)).astype(np.float32)
y_train = np.random.randint(0, 10, (1000,)).astype(np.int32)

# Train the model
with tf.device('/GPU:0'):
    history = model.fit(x_train, y_train, epochs=5, batch_size=32, verbose=1)

print("Training completed successfully!")
print(f"Final accuracy: {history.history['accuracy'][-1]:.4f}")