TensorFlow (GPU)
================

.. contents:: Table of Contents
   :depth: 2

About
-----

This document shows how to install and use TensorFlow with GPU support in a Python virtual environment on Discoverer+ GPU cluster. Note that the method used does not lock the shell environment into the virtual environment.

**Important**: TensorFlow cannot be installed in the same conda environment as PyTorch due to dependency conflicts. You must create a separate conda environment for TensorFlow. If you need both frameworks, you will need to maintain separate environments and switch between them as needed.

The guide covers the complete workflow from creating a conda environment to running TensorFlow jobs, ensuring that users can overcome common Slurm configuration challenges and successfully utilize the GPU resources available on the cluster.

Use Conda to install TensorFlow with NVIDIA CUDA support on Discoverer+ GPU cluster
------------------------------------------------------------------------------------

Note that we need to use a Python version that is appropriate for the latest stable TensorFlow release. In our case, that is 3.11. While Python 3.13 and 3.14 are available, TensorFlow doesn't have full support for these newer versions yet, and we cannot rely on bleeding-edge technology for running productive jobs on HPC systems.

Here we use Slurm interactive session bind to the project Slurm account, but only on CPU basis. This way no GPU resources from the account will be spent. This is supported by the QoS with name "2cpu-single-host".

Start an interactive Bash session on some of your compute nodes (that implies the invocation of ``srun`` tool). The example below creates an interactive Bash session that will last 30 minutes:

.. code-block:: bash

   srun -N 1 -n 2 --partition=common \
      --account=your_slurm_project_account_name \
     --qos 2cpu-single-host --time=00:30:00 --pty /bin/bash

Wait for the session to start. Only then follow the instructions given below.

We will use the running session to create a **separate** Python virtual environment and install TensorFlow with CUDA support therein. Note that this creates a new environment (`tensorflow_env`) that is separate from any existing PyTorch environment (`pytorch_env`). That means all commands provided below are related to that same Bash interactive session. Do not execute those commands directly on the login node.

.. code-block:: bash

   module load anaconda3
   module load nvidia/cuda/12/12.8
   conda create \
    --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
      python=3.11
   conda install \
    --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
      tensorflow-gpu -c conda-forge

Of course, you need to type "y", whenever Conda asks you about allowing the installation of packages.

In case of success (no errors displayed), you will obtain a Python 3.11 virtual environment with the latest TensorFlow with CUDA support. That environment will be located in the following folder:

.. code-block:: bash

   /valhalla/projects/your_slurm_project_account_name/tensorflow_env/

You can test the integrity of the installation in that same interactive Bash session (or another interactive session):

.. code-block:: bash

   /valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin/python \
   -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"
   /valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin/python \
   -c "import tensorflow as tf; print('CUDA available:', tf.config.list_physical_devices('GPU'))"

You should get results like these:

.. code-block:: bash

   TensorFlow version: 2.16.1
   CUDA available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Now you can type Ctrl-D and terminate the interactive Bash session controlled by Slurm. Otherwise, you may leave that session open, but Slurm will terminate it after it runs for more than 30 minutes.

Running TensorFlow on Discoverer+
---------------------------------

Once the installation is performed successfully as explained above, the TensorFlow installation can be utilized through a Slurm job, or run interactively by utilizing ``srun``. In this case, the Slurm must utilize the default QoS to the Slurm account, which in this case is the QoS named "your_slurm_project_account_name". Otherwise TensorFlow will not be able to access the GPU devices on the compute nodes.

Running TensorFlow interactively
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is not a recommended way of running TensorFlow. Use this example for checks only!

For the sake of tests, we need a Python helper code that can be downloaded at:

https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

To download the code:

.. code-block:: bash

   cd /valhalla/projects/your_slurm_project_account_name/
   wget https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

In the example below we request the utilization of 2 GPUs (``--gres=gpu:2``):

.. code-block:: bash

   srun -N 1 -n 2 --gres=gpu:2 \
      --partition=common \
      --account=your_slurm_project_account_name \
      --qos your_slurm_project_account_name \
      --time=00:30:00 --pty /bin/bash

Once the interactive session is started, we need to access the CUDA library and run the test Python script that calls TensorFlow:

.. code-block:: bash

   module load nvidia/cuda/12/12.8
   export PATH="/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin:$PATH"
   export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/tensorflow_env"
   python /valhalla/projects/your_slurm_project_account_name/tensorflow_gpu_detection.py

In case of successful execution, the following result will be displayed:

.. code-block:: bash

   ============================================================
    TensorFlow GPU Detection Script
   ============================================================

   Library Import Check
   --------------------
   ✓ TensorFlow version: 2.16.1
   ✓ NumPy version: 1.26.4
   ✓ CUDA available: True

   CUDA and GPU Information
   ------------------------
   CUDA available: True
   CUDA version: 12.1
   Number of GPUs: 1

   GPU Details
   -----------

   GPU 0:
     Name: /physical_device:GPU:0
     Memory Total: 139.83 GB
     Memory Allocated: 0.00 GB
     Memory Cached: 0.00 GB
     Compute Capability: 9.0

   TensorFlow GPU Test
   -------------------
   Creating test tensors...
   Tensor A shape: (1000, 1000)
   Tensor B shape: (1000, 1000)
   Running matrix multiplication on GPU...
   Result shape: (1000, 1000)
   Computation time: 0.0123 seconds
   Device: /GPU:0
   GPU memory after computation: 0.08 GB
   Memory cleaned up successfully
   [SUCCESS] TensorFlow GPU test completed successfully!

   Neural Network Test
   -------------------
   Creating simple neural network...
   Model created successfully
   Running forward pass...
   Input shape: (32, 784)
   Output shape: (32, 10)
   Forward pass time: 0.0045 seconds
   Device: /GPU:0
   [SUCCESS] Neural network test completed successfully!

   Environment Information
   -----------------------
   Python version: 3.11.13 (main, Jun  5 2025, 13:12:00) [GCC 11.2.0]
   Platform: linux
   Current working directory: /home/username
   CUDA_HOME: /usr/local/cuda-12.8
   CUDA_PATH: /usr/local/cuda-12.8
   LD_LIBRARY_PATH: /usr/local/cuda-12.8/lib64
   VIRTUAL_ENV: /valhalla/projects/your_slurm_project_account_name/tensorflow_env

   ============================================================
    Test Summary
   ============================================================
   Tests passed: 3/3
   [SUCCESS] All tests passed! TensorFlow is working correctly.

Running TensorFlow within a Slurm batch script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create the following Slurm batch script:

.. code-block:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=test_tensorflow
   #SBATCH --time=00:30:00

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:2

   #SBATCH --account=your_slurm_project_account_name
   #SBATCH --qos your_slurm_project_account_name

   #SBATCH -o test_tensorflow.%j.out
   #SBATCH -e test_tensorflow.%j.err

   export PATH="/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin:$PATH"
   export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/tensorflow_env"

   module load nvidia/cuda/12/12.8

   cd $SLURM_SUBMIT_DIR

   python /valhalla/projects/your_slurm_project_account_name/tensorflow_gpu_detection.py

and save it as ``/valhalla/projects/your_slurm_project_account_name/test_tensorflow.sbatch``.

If you don't find ``tensorflow_gpu_detection.py`` download it from here:

https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py

To submit the job to the queue:

.. code-block:: bash

   sbatch /valhalla/projects/your_slurm_project_account_name/test_tensorflow.sbatch

Once successfully submitted, you can check if the job is running by executing:

.. code-block:: bash

   squeue --me

If the job is running at the moment, information about its execution will be presented as:

.. code-block:: bash

   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    1980    common test_tf username  R       0:06      1 dgx1

The execution of the job will create two files in the current directory - one capturing the standard output, and another - where the standard error messages are collected:

.. code-block:: bash

   test_tensorflow.1980.err
   test_tensorflow.1980.out

Here ``1980`` is the job id. That number in your case will be different.

The file ``test_tensorflow.1980.out`` will contain the results (should be the same as those reported for the interactive execution).

Additional TensorFlow Libraries
-------------------------------

You may also want to install additional TensorFlow libraries depending on your use case:

.. code-block:: bash

   # For TensorFlow Extended (TFX) - ML pipeline platform
   conda install \
    --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
    tfx -c conda-forge

   # For TensorFlow Probability - probabilistic programming
   conda install \
    --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
    tensorflow-probability -c conda-forge

   # For TensorFlow Datasets - ready-to-use datasets
   conda install \
    --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
    tensorflow-datasets -c conda-forge

   # For TensorFlow Hub - pre-trained models
   conda install \
    --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \
    tensorflow-hub -c conda-forge

Example Usage
-------------

Here's a simple example of how to use TensorFlow in your Python scripts:

.. code-block:: python

   import tensorflow as tf
   import numpy as np

   # Check GPU availability
   print("GPUs available:", tf.config.list_physical_devices('GPU'))

   # Create a simple neural network
   model = tf.keras.Sequential([
       tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
       tf.keras.layers.Dense(64, activation='relu'),
       tf.keras.layers.Dense(10, activation='softmax')
   ])

   # Compile the model
   model.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])

   # Generate some dummy data
   x_train = np.random.random((1000, 784)).astype(np.float32)
   y_train = np.random.randint(0, 10, (1000,)).astype(np.int32)

   # Train the model
   with tf.device('/GPU:0'):
       history = model.fit(x_train, y_train, epochs=5, batch_size=32, verbose=1)

   print("Training completed successfully!")
   print(f"Final accuracy: {history.history['accuracy'][-1]:.4f}")