TensorFlow (GPU) ================ .. contents:: Table of Contents :depth: 2 About ----- This document shows how to install and use TensorFlow with GPU support in a Python virtual environment on Discoverer+ GPU cluster. Note that the method used does not lock the shell environment into the virtual environment. **Important**: TensorFlow cannot be installed in the same conda environment as PyTorch due to dependency conflicts. You must create a separate conda environment for TensorFlow. If you need both frameworks, you will need to maintain separate environments and switch between them as needed. The guide covers the complete workflow from creating a conda environment to running TensorFlow jobs, ensuring that users can overcome common Slurm configuration challenges and successfully utilize the GPU resources available on the cluster. Use Conda to install TensorFlow with NVIDIA CUDA support on Discoverer+ GPU cluster ------------------------------------------------------------------------------------ Note that we need to use a Python version that is appropriate for the latest stable TensorFlow release. In our case, that is 3.11. While Python 3.13 and 3.14 are available, TensorFlow doesn't have full support for these newer versions yet, and we cannot rely on bleeding-edge technology for running productive jobs on HPC systems. Here we use Slurm interactive session bind to the project Slurm account, but only on CPU basis. This way no GPU resources from the account will be spent. This is supported by the QoS with name "2cpu-single-host". Start an interactive Bash session on some of your compute nodes (that implies the invocation of ``srun`` tool). The example below creates an interactive Bash session that will last 30 minutes: .. code-block:: bash srun -N 1 -n 2 --partition=common \ --account=your_slurm_project_account_name \ --qos 2cpu-single-host --time=00:30:00 --pty /bin/bash Wait for the session to start. Only then follow the instructions given below. We will use the running session to create a **separate** Python virtual environment and install TensorFlow with CUDA support therein. Note that this creates a new environment (`tensorflow_env`) that is separate from any existing PyTorch environment (`pytorch_env`). That means all commands provided below are related to that same Bash interactive session. Do not execute those commands directly on the login node. .. code-block:: bash module load anaconda3 module load nvidia/cuda/12/12.8 conda create \ --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \ python=3.11 conda install \ --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \ tensorflow-gpu -c conda-forge Of course, you need to type "y", whenever Conda asks you about allowing the installation of packages. In case of success (no errors displayed), you will obtain a Python 3.11 virtual environment with the latest TensorFlow with CUDA support. That environment will be located in the following folder: .. code-block:: bash /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ You can test the integrity of the installation in that same interactive Bash session (or another interactive session): .. code-block:: bash /valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin/python \ -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)" /valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin/python \ -c "import tensorflow as tf; print('CUDA available:', tf.config.list_physical_devices('GPU'))" You should get results like these: .. code-block:: bash TensorFlow version: 2.16.1 CUDA available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] Now you can type Ctrl-D and terminate the interactive Bash session controlled by Slurm. Otherwise, you may leave that session open, but Slurm will terminate it after it runs for more than 30 minutes. Running TensorFlow on Discoverer+ --------------------------------- Once the installation is performed successfully as explained above, the TensorFlow installation can be utilized through a Slurm job, or run interactively by utilizing ``srun``. In this case, the Slurm must utilize the default QoS to the Slurm account, which in this case is the QoS named "your_slurm_project_account_name". Otherwise TensorFlow will not be able to access the GPU devices on the compute nodes. Running TensorFlow interactively ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is not a recommended way of running TensorFlow. Use this example for checks only! For the sake of tests, we need a Python helper code that can be downloaded at: https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py To download the code: .. code-block:: bash cd /valhalla/projects/your_slurm_project_account_name/ wget https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py In the example below we request the utilization of 2 GPUs (``--gres=gpu:2``): .. code-block:: bash srun -N 1 -n 2 --gres=gpu:2 \ --partition=common \ --account=your_slurm_project_account_name \ --qos your_slurm_project_account_name \ --time=00:30:00 --pty /bin/bash Once the interactive session is started, we need to access the CUDA library and run the test Python script that calls TensorFlow: .. code-block:: bash module load nvidia/cuda/12/12.8 export PATH="/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin:$PATH" export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/tensorflow_env" python /valhalla/projects/your_slurm_project_account_name/tensorflow_gpu_detection.py In case of successful execution, the following result will be displayed: .. code-block:: bash ============================================================ TensorFlow GPU Detection Script ============================================================ Library Import Check -------------------- ✓ TensorFlow version: 2.16.1 ✓ NumPy version: 1.26.4 ✓ CUDA available: True CUDA and GPU Information ------------------------ CUDA available: True CUDA version: 12.1 Number of GPUs: 1 GPU Details ----------- GPU 0: Name: /physical_device:GPU:0 Memory Total: 139.83 GB Memory Allocated: 0.00 GB Memory Cached: 0.00 GB Compute Capability: 9.0 TensorFlow GPU Test ------------------- Creating test tensors... Tensor A shape: (1000, 1000) Tensor B shape: (1000, 1000) Running matrix multiplication on GPU... Result shape: (1000, 1000) Computation time: 0.0123 seconds Device: /GPU:0 GPU memory after computation: 0.08 GB Memory cleaned up successfully [SUCCESS] TensorFlow GPU test completed successfully! Neural Network Test ------------------- Creating simple neural network... Model created successfully Running forward pass... Input shape: (32, 784) Output shape: (32, 10) Forward pass time: 0.0045 seconds Device: /GPU:0 [SUCCESS] Neural network test completed successfully! Environment Information ----------------------- Python version: 3.11.13 (main, Jun 5 2025, 13:12:00) [GCC 11.2.0] Platform: linux Current working directory: /home/username CUDA_HOME: /usr/local/cuda-12.8 CUDA_PATH: /usr/local/cuda-12.8 LD_LIBRARY_PATH: /usr/local/cuda-12.8/lib64 VIRTUAL_ENV: /valhalla/projects/your_slurm_project_account_name/tensorflow_env ============================================================ Test Summary ============================================================ Tests passed: 3/3 [SUCCESS] All tests passed! TensorFlow is working correctly. Running TensorFlow within a Slurm batch script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create the following Slurm batch script: .. code-block:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=test_tensorflow #SBATCH --time=00:30:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:2 #SBATCH --account=your_slurm_project_account_name #SBATCH --qos your_slurm_project_account_name #SBATCH -o test_tensorflow.%j.out #SBATCH -e test_tensorflow.%j.err export PATH="/valhalla/projects/your_slurm_project_account_name/tensorflow_env/bin:$PATH" export VIRTUAL_ENV="/valhalla/projects/your_slurm_project_account_name/tensorflow_env" module load nvidia/cuda/12/12.8 cd $SLURM_SUBMIT_DIR python /valhalla/projects/your_slurm_project_account_name/tensorflow_gpu_detection.py and save it as ``/valhalla/projects/your_slurm_project_account_name/test_tensorflow.sbatch``. If you don't find ``tensorflow_gpu_detection.py`` download it from here: https://gitlab.discoverer.bg/vkolev/snippets/-/raw/main/checks/tensorflow_gpu_detection.py To submit the job to the queue: .. code-block:: bash sbatch /valhalla/projects/your_slurm_project_account_name/test_tensorflow.sbatch Once successfully submitted, you can check if the job is running by executing: .. code-block:: bash squeue --me If the job is running at the moment, information about its execution will be presented as: .. code-block:: bash JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1980 common test_tf username R 0:06 1 dgx1 The execution of the job will create two files in the current directory - one capturing the standard output, and another - where the standard error messages are collected: .. code-block:: bash test_tensorflow.1980.err test_tensorflow.1980.out Here ``1980`` is the job id. That number in your case will be different. The file ``test_tensorflow.1980.out`` will contain the results (should be the same as those reported for the interactive execution). Additional TensorFlow Libraries ------------------------------- You may also want to install additional TensorFlow libraries depending on your use case: .. code-block:: bash # For TensorFlow Extended (TFX) - ML pipeline platform conda install \ --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \ tfx -c conda-forge # For TensorFlow Probability - probabilistic programming conda install \ --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \ tensorflow-probability -c conda-forge # For TensorFlow Datasets - ready-to-use datasets conda install \ --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \ tensorflow-datasets -c conda-forge # For TensorFlow Hub - pre-trained models conda install \ --prefix /valhalla/projects/your_slurm_project_account_name/tensorflow_env/ \ tensorflow-hub -c conda-forge Example Usage ------------- Here's a simple example of how to use TensorFlow in your Python scripts: .. code-block:: python import tensorflow as tf import numpy as np # Check GPU availability print("GPUs available:", tf.config.list_physical_devices('GPU')) # Create a simple neural network model = tf.keras.Sequential([ tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(10, activation='softmax') ]) # Compile the model model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # Generate some dummy data x_train = np.random.random((1000, 784)).astype(np.float32) y_train = np.random.randint(0, 10, (1000,)).astype(np.int32) # Train the model with tf.device('/GPU:0'): history = model.fit(x_train, y_train, epochs=5, batch_size=32, verbose=1) print("Training completed successfully!") print(f"Final accuracy: {history.history['accuracy'][-1]:.4f}")