Python virtual environments (GPU)¶

Table of Contents

Python virtual environments (GPU)

About ¶

This document explains why Python virtual environments created and managed by Conda from Anaconda are the preferred method to install and use software packages on a per-user or per-project basis on the Discoverer+ GPU cluster.

Important

Understanding this approach is essential for effectively running your GPU-accelerated computing tasks on the Discoverer+.

Even if we provide a set of packages delivered through environment modules, they can help in boosting the productivity at CPU level, whenever that is critical. However, the most important packages, whose productivity depends on GPU acceleration through CUDA libraries, need to be installed in separate Python virtual environments.

Benefit of Python virtual environments ¶

Note

Python virtual environments address situations where different environments need to host different packages in different versions, specific to a certain task or class of tasks. When working on multiple tasks with different topics and package dependencies, each with unique dependency requirements, managing packages in the same shared global Python installation becomes problematic. In that case the creation and utilisation of a different Python virtual environment for each different topic is probably the best approach.

By hosting different Python virtual environments, we ensure that no conflicts between installed packages can occur. This type of isolation prevents version incompatibilities that could break existing projects when new packages are installed or updated. We can even create different Python virtual environments for the same topic, but containing different versions of the same packages.

Note

By using Conda, we formally create Python virtual environments. However, these environments may also contain packages with libraries and tools that do not require Python to be executed. Conda can manage not only Python packages but also system-level libraries, binaries, and tools that can be used independently of Python. This makes Conda environments more versatile than traditional Python-only virtual environments, allowing you to manage a complete software stack including non-Python dependencies.

To summarise, each Python virtual environment maintains its own independent set of Python packages, allowing users to:

Install packages without affecting other projects or users
Use specific package versions required by each project
Maintain reproducibility across different tasks and timeframes
Avoid dependency conflicts that can cause runtime errors

Python virtual environments vs containers ¶

Python virtual environments supported by Conda from Anaconda offer a lightweight alternative to containers without the overhead of container runtime, while still providing isolation for different package versions and dependencies. They perfectly match the resource management paradigm of Slurm on the cluster Discoverer+. The access to GPU (managed by Slurm as access to GRES resources) occurs regardless the application is installed in the Python virtual environment or not.
Creating and managing Python virtual environments with Conda is much simpler than creating and managing containers with Docker or Podman. It can be easily automated and does not require time-consuming knowledge of containerisation technologies.
Python virtual environments are easy to examine, manage (including enhanced package management), troubleshoot, and put under version control. They are also easy to share between the users of the same or different projects and can be copied between storage locations as easy as copying a single folder with its content (for example, using rsync or scp/sftp).
Even if containers offer a more complete isolation of the environment, they have a higher overhead in terms of resource usage and startup time. Also, they may have a complex directory structure that is not easy to manage on the host system. You need to add to that some issues with handling inside the container of the internal UID and GID POSIX attribute mapping that may cause problems when running jobs on the cluster controlled by Slurm.
Python virtual environments are a good compromise between the two approaches, as they offer a good balance of isolation and performance. They also match better the way Slurm manages the resources on the cluster on a per-user or per-project basis, when running jobs.

Utilising Conda on Discoverer+ for managing Python virtual environments ¶

Important

On Discoverer+, Conda tool and basic channel of locally installable packages come with centralised Anaconda installation accessible through loading the corresponding environment modules.

module load anaconda3

Warning

We urge our users not to install Anaconda or Miniconda by themselves in the home or project folders on Discoverer+, because that creates overutilisation of the storage space. In fact, the users use Conda quite seldom - once or twice per week on average, or sometimes several times during the entire project lifecycle. From that perspective, installing separate Anaconda or Miniconda distributions in the home or project folders, seems totally inefficient.

Even if the packages in the used Conda channels do not match the highest productivity at a CPU level, we can live with that downside. This is because we are expecting to process on Discoverer+ mainly tasks that rely on CUDA to accelerate code on GPU, instead of running massive CPU-accelerated tasks on the host. This approach to productivity allows us to prioritise compatibility and stability of CUDA-linked packages over maximum CPU-level performance, ensuring that the packages installed in the Python virtual environments support our GPU-focused computing paradigm effectively.

When working on multiple tasks with different topics and package dependencies, you may need to create and maintain several Python virtual environments. Each virtual environment should be created in a separate location and can host different packages or different versions of the same packages. The recommended approach is to create each virtual environment in a dedicated folder within your project directory structure. For example:

/valhalla/projects/<your_slurm_project_account_name>/virt_envs/
├── torch/          # PyTorch environment
├── tf/             # TensorFlow environment
├── ml-project1/    # Environment for specific ML project
├── data-science/   # Environment for data science tools
└── custom/         # Custom environment for other packages

Unattended installation using Slurm batch job ¶

Creating Python virtual environments through Slurm batch jobs is the recommended approach on Discoverer+. This ensures that the installation process does not compete for resources with other users on the login node.

Create a Slurm script (e.g. create_virtual_env.sh) with the content below. Adjust the account, partition/QoS, target directory, and Python version as needed.

Warning

We strongly discourage running installation tasks directly on the login node on the Discoverer+ system. This is because the login node is shared by all users and the installation tasks, which are usually highly I/O intensive, may actively compete for resources with other users’ scripts and processes and exhaust the login node resources. Therefore, the installation tasks should be run only as Slurm jobs.

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=create_env
#SBATCH --time=00:30:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G

#SBATCH -o create_env.%j.out
#SBATCH -e create_env.%j.err

# Ensure that create_env.%j.out and create_env.%j.err are saved in the directory
# where you submit the job. Set the working directory of the Bash shell to the
# folder from which the script is launched.
cd ${SLURM_SUBMIT_DIR}

# Unload all previously loaded modules (in case you inherit the Bash environment)
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the module anaconda to access the conda tool
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
# Change the environment name (e.g., 'myenv') to match your specific use case
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/myenv

# Check if the target folder already exists.
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }

# Now use conda to create fresh Python virtual environment
# Change the Python version (e.g., python=3.11) to match your requirements
conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

if [ $? -ne 0 ]; then
  echo "Conda Python virtual environment creation failed" >&2
  exit 1
fi

echo "Python virtual environment successfully created!"

# Fully expose the Python virtual environment
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# At this point, you can install packages specific to this virtual environment.
# For example:
# CONDA_OVERRIDE_CUDA=12.9 conda install --prefix ${VIRTUAL_ENV} \
#    -c conda-forge package1 package2 -y

echo "Virtual environment ready for package installation!"

Then submit the script to the Slurm queue:

sbatch create_virtual_env.sh

and wait for the job to finish.

Once the job is finished, you can check if the virtual environment was created successfully by inspecting the standard output stream stored in the file create_env.%j.out:

cat create_env.%j.out

If the virtual environment was created successfully, you should see the following message:

Python virtual environment successfully created!
Virtual environment ready for package installation!

If the creation failed, you should see error messages indicating the cause of failure in create_env.%j.err.

Interactive creation using Slurm `srun`¶

Warning

This is not a recommended way of creating Python virtual environments. Use this example for checks only!

If you prefer to run the commands yourself in a Bash session, request starting such a session using srun:

srun --partition=common \
     --account=<your_slurm_project_account_name> \
     --qos=2cpu-single-host \
     --nodes=1 \
     --ntasks-per-node=2 \
     --cpus-per-task=1 \
     --mem=16G \
     --time=00:30:00 \
     --pty bash

Once the requested Bash session is started, run the following commands in it (copy and paste the commands shown below into the interactive session):

# Unload all previously loaded modules
module purge || { echo "Failed to purge the loaded modules. Exiting."; exit; }

# Load the module anaconda to access the conda tool
module load anaconda3 || { echo "Failed to load anaconda3 module. Exiting."; exit; }

# Export the path to the Python virtual environment folder
# Change the environment name (e.g., 'myenv') to match your specific use case
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/myenv

# Check if the target folder already exists
[ -d ${VIRTUAL_ENV} ] && { echo "The folder ${VIRTUAL_ENV} exists. Exiting."; exit; }

# Create fresh Python virtual environment
# Change the Python version (e.g., python=3.11) to match your requirements
conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

if [ $? -ne 0 ]; then
  echo "Conda Python virtual environment creation failed" >&2
  exit 1
fi

echo "Python virtual environment successfully created!"

# Fully expose the Python virtual environment
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# At this point, you can install packages specific to this virtual environment.
# For example:
# CONDA_OVERRIDE_CUDA=12.9 conda install --prefix ${VIRTUAL_ENV} \
#    -c conda-forge package1 package2 -y

echo "Virtual environment ready for package installation!"

This installation may fail if the folder containing the Python virtual environment already exists. In that case, you need to remove it first:

rm -rf ${VIRTUAL_ENV}

Then retry the creation again.

Managing multiple virtual environments ¶

Once you have created multiple virtual environments, you can manage them independently:

Each virtual environment has its own Python interpreter and package installations (own bin/ and lib/ folders)
You can activate different environments by setting the PATH and VIRTUAL_ENV environment variables in your Slurm scripts
Different environments can have different Python versions
Different environments can host different packages or different versions of the same packages

To use a specific virtual environment in your Slurm batch script, add the following lines after loading required modules:

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/myenv
[ -d ${VIRTUAL_ENV} ] || { echo "The folder ${VIRTUAL_ENV} does not exist. Exiting."; exit; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

Warning

In most cases, you do not need to use the classic Conda activation mechanism on Discoverer+, unless it is necessary to use the conda activate command to activate certain virtual environment.

For examples of how to create and utilise virtual environments for specific packages like PyTorch or TensorFlow, see:

PyTorch (GPU)

TensorFlow (GPU)