TensorFlow

Important

Discoverer HPC provides public access to the TensorFlow Optimizations from Intel backed by the Intel Distribution for Python. Both are included in the Intel oneAPI installation available in the public software repository.

Warning

No GPU accelerators are available currently on Discoverer HPC. That means tensorflow_gpu module is not currently supported.

Versions supported

1.x

Version 1.x is not officially supported on Discoverer HPC. If you cannot re-write your Python code to match version 2.x syntax requirements, you can install TensorFlow 1.x in your Personal scratch and storage folder (/discofs/username): by using Conda.

2.x

Versions 2.x of TensorFlow are supported on Discoverer HPC as part of the Intel oneAPI and the includer there Intel Python distribution, publicly available in the public software repository. Running that version does not require setting a virtual environment by Conda or pip.

Running TensorFlow

To load the TensofFlow 2 environment, load the module intel.universe.tensorflow from within your Slurm batch script:

module load intel.universe.tensorflow

Once loaded, that module provides access to the correct Python interpreter and TensorFlow 2 module. In case you need to combine TensorFlow 2 with some specific modules that are not included by default in the distribution, you can create a virtual environment based on the same Python interpreter.

Checking the version

The easiest way to check the version of the TensorFlow available in the software repository is to execute the following Slurm batch script:

#!/bin/bash
#
#SBATCH --partition=cn         # Partition name (ask the support team to clarify it)
#SBATCH --job-name=tf
#SBATCH --time=00:01:00        # WallTime - one minute is more than enough here
#SBATCH --account=<your_slurm_account_name_here>
#SBATCH --qos=<your_qos_name_here>

#SBATCH --nodes           1    # May vary
#SBATCH --ntasks-per-node 1    # Must be 1
#SBATCH --cpus-per-task   1    # Must be 1

#SBATCH -o slurm.check_tensorflow_version.out        # STDOUT
#SBATCH -e slurm.check_tensorflow_version.err        # STDERR

module purge
module load intel.universe.tensorflow

cd $SLURM_SUBMIT_DIR

python -c "import tensorflow;print('Tensorflow:',tensorflow.version.VERSION)"

To manage that, store the script content into a file, for example /discofs/${USER}/check_tensorflow_version.sbatch and submit it as a job to the queue:

cd /discofs/${USER}/check_tensorflow_version.sbatch
sbatch check_tensorflow_version.sbatch

Then check the content of the file slurm.check_tensorflow_version.out to find out which verion of TensorFlow is reported there.

Thread control

Consider a control over the threading added to your code. TensorFlow adopts TBB thread model and the best way to control it is from within the Python code that invokes the module tensorflow (place this in __main__ function of the code or in init.py, alternatively):

import tensorflow as tf
num_threads = 16 # You need to estimate the optimum value here
tf.config.threading.set_inter_op_parallelism_threads(num_threads)
tf.config.threading.set_intra_op_parallelism_threads(num_threads)
tf.config.set_soft_device_placement(True)

Slurm batch script (example)

Given below is an example of a Slurm batch script that runs a Python code invoking Tensorflow:

#!/bin/bash
#
#SBATCH --partition=cn         # Partition name (ask the support team to clarify it)
#SBATCH --job-name=tf
#SBATCH --time=512:00:00       # WallTime - set it accordningly
#SBATCH --account=<your_slurm_account_name_here>
#SBATCH --qos=<your_qos_name_here>

#SBATCH --nodes           1    # May vary
#SBATCH --ntasks-per-node 1    # Must be 1 if MPI is not used
#SBATCH --cpus-per-task   16   # See the 'Thread control` above to understand what number
                               # to supply here instead of 16 (16 is an example). You may
                               # run series of benchmarks varying that number until reach
                               # an optimal speed.

#SBATCH -o slurm.%j.out        # STDOUT
#SBATCH -e slurm.%j.err        # STDERR

module purge
module load intel.universe.tensorflow

export FI_PROVIDER=verbs
export UCX_NET_DEVICES=mlx5_0:1

cd $SLURM_SUBMIT_DIR

python my_tf_based_code.py

where my_tf_based_code.py is your TensorFlow-based Python code.

Specify the parameters and resources required for successfully running and completing the job:

  • Slurm partition of compute nodes, based on your project resource reservation (--partition)
  • job name, under which the job will be seen in the queue (--job-name)
  • wall time for running the job (--time)
  • number of threads to use - that should match num_threads in the code example above (--cpus-per-task)

Save the complete Slurm job description as a file, for example /discofs/$USER/run_tf/tf.batch, and submit it to the queue:

cd /discofs/$USER/run_tf
sbatch tf.batch

Getting help

See Getting help