Serving Mixtral 8x7B and 8x22B on DGX H200 with vLLM and SLURM

This guide covers the complete process of deploying Mistral AI’s Mixtral 8x7B and Mixtral 8x22B on a single DGX H200 node within a SLURM-managed cluster (Discoverer+), using Conda for environment management and vLLM for inference.

Contents

  1. Model overview

  2. Hardware and software prerequisites

  3. Environment setup with Conda on Discoverer+

  4. Installing vLLM in the Conda environment

  5. Baseline deployment

  6. Memory layout and GPU allocation

  7. Expert parallelism

  8. KV cache optimisation

  9. Chunked prefill and scheduler tuning

  10. Speculative decoding with n-gram prompt lookup

  11. MoE Triton kernel tuning

  12. Full optimised SLURM job scripts

  13. Benchmarking

  14. Known caveats and constraints


1. Model overview

Mixtral 8x7B (released December 2023) and Mixtral 8x22B (released April 2024) are sparse Mixture-of-Experts language models from Mistral AI, both licensed under Apache 2.0.

Property

Mixtral 8x7B

Mixtral 8x22B

Total parameters

46.7B

141B

Active parameters per token

12.9B (top-2 of 8 experts)

39B (top-2 of 8 experts)

Experts per layer

8

8

Active experts per token

2

2

Attention mechanism

GQA

GQA

Context window

32,768 tokens

65,536 tokens

BF16 VRAM requirement

~94 GB

~263 GB

Licence

Apache 2.0

Apache 2.0

Hugging Face identifier (base)

mistrala i/Mixtral-8x7B-v0.1

mistralai /Mixtral-8x22B-v0.1

Hugging Face identifier (instruct)

mistralai/Mixtral -8x7B-Instruct-v0.1

mistralai/Mixtral- 8x22B-Instruct-v0.1

Both models are natively supported in vLLM without --trust-remote-code. Neither model uses MLA attention, reasoning tokens, or requires a tool-call parser — deployment is substantially simpler than Kimi K2.5/K2.6.

The active parameter count governs compute per forward pass: Mixtral 8x7B activates 12.9B parameters per token and processes each token with the equivalent compute of a 14B dense model, and Mixtral 8x22B with the equivalent of a 39B dense model, despite loading all expert weights into VRAM.


2. Hardware and software prerequisites

DGX H200 system specifications

The NVIDIA DGX H200 provides the following hardware relevant to this deployment:

Component

Specification

GPUs

8x NVIDIA H200 SXM Tensor Core GPU

GPU memory

141 GB HBM3e per GPU, 1,128 GB total

GPU memory bandwidth

4.8 TB/s per GPU

GPU interconnect

18x NVLink 4.0 connections per GPU, 900 GB/s bidirectional per GPU

NVSwitch

4x NVSwitch, 7.2 TB/s aggregate bidirectional GPU-to-GPU bandwidth

Host CPUs

2x Intel Xeon Platinum 8480C, 112 cores total

System memory

2 TB DDR5

NVMe storage

8x 3.84 TB (data), 2x 1.92 TB (OS)

Network

10x ConnectX-7, 400 Gb/s InfiniBand/Ethernet

Both Mixtral models fit comfortably on a single DGX H200 node with substantial KV cache headroom remaining. Mixtral 8x7B requires only 1–2 GPUs for weights; Mixtral 8x22B requires 2–4 GPUs. Neither model requires a full 8-GPU allocation, which is an important consideration on a shared cluster with only 4 nodes total. The guides below use TP=2 for 8x7B and TP=4 for 8x22B, leaving remaining GPUs available for other users.

Software requirements

Component

Minimum version

Notes

CUDA toolkit

12.1

12.8 required for FP8 KV cache on Hopper

NVIDIA driver

535.x

560+ recommended

Python

3.11

as specified in the Conda environment

vLLM

0.19.1

pin this version for stability

PyTorch

2.5+

installed as a vLLM dependency

On Discoverer+, CUDA libraries are provided through the cluster environment module system and do not need to be installed manually inside the Conda environment.


3. Environment setup with Conda on Discoverer+

On Discoverer+, Conda is provided through the centralised Anaconda installation and accessed via the module system. Do not install a separate Anaconda or Miniconda distribution in your home or project directory.

The recommended location for virtual environments on Discoverer+ is:

/valhalla/projects/<your_slurm_project_account_name>/virt_envs/

Creating the vLLM environment via a SLURM batch job

Environment creation must not be run on the login node. Submit a SLURM batch job instead.

Save the following as create_vllm_mixtral_env.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=create_vllm_mixtral_env
#SBATCH --time=00:30:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G

#SBATCH -o create_vllm_mixtral_env.%j.out
#SBATCH -e create_vllm_mixtral_env.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral

[ -d ${VIRTUAL_ENV} ] && { echo "Environment ${VIRTUAL_ENV} already exists. Exiting."; exit 1; }

conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

if [ $? -ne 0 ]; then
    echo "Conda environment creation failed." >&2
    exit 1
fi

echo "Conda environment created successfully."
export PATH=${VIRTUAL_ENV}/bin:${PATH}

echo "Environment ready for vLLM installation."

Submit and verify:

sbatch create_vllm_mixtral_env.sh
cat create_vllm_mixtral_env.<jobid>.out

4. Installing vLLM in the Conda environment

Why pip is used inside the Conda environment

Conda is the preferred package manager on Discoverer+. For vLLM, the conda-forge channel only carries versions up to 0.10.x, significantly behind the 0.19.1 release. The vLLM project distributes current releases exclusively through PyPI wheels, so pip is necessary for this package.

Pip installs into the Conda environment provided that export PATH=${VIRTUAL_ENV}/bin:${PATH} is set before calling pip. This causes pip to install all packages into ${VIRTUAL_ENV}/lib/python3.11/site-packages/ — entirely within the project storage path on /valhalla. Nothing is written to ~/.local or the home directory.

To confirm the correct pip binary is active at any point during a job:

which pip
# must print: /valhalla/projects/<account>/virt_envs/vllm-mixtral/bin/pip

Installing vLLM via a SLURM batch job

Save the following as install_vllm_mixtral.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=install_vllm_mixtral
#SBATCH --time=01:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G

#SBATCH -o install_vllm_mixtral.%j.out
#SBATCH -e install_vllm_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral

[ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; }

export PATH=${VIRTUAL_ENV}/bin:${PATH}

echo "Using pip at: $(which pip)"

pip install "vllm==0.19.1"

if [ $? -ne 0 ]; then
    echo "vLLM installation failed." >&2
    exit 1
fi

pip install "huggingface_hub[cli]"

echo "vLLM installation complete."
echo "Installed vLLM version: $(python -c 'import vllm; print(vllm.__version__)')"
echo "Install location: $(python -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')"

Submit:

sbatch install_vllm_mixtral.sh

Verify in the job output that the install location is under /valhalla and not ~/.local.

Downloading model weights

Both Mixtral models are available on Hugging Face under Apache 2.0. Download to project storage, not to the home directory. Adjust the model identifier and directory for whichever model you are deploying.

Save as download_mixtral.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=download_mixtral
#SBATCH --time=02:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G

#SBATCH -o download_mixtral.%j.out
#SBATCH -e download_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

# Set MODEL_ID to the desired variant:
#   mistralai/Mixtral-8x7B-Instruct-v0.1   (~94 GB BF16)
#   mistralai/Mixtral-8x22B-Instruct-v0.1  (~263 GB BF16)
MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1
MODEL_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct

huggingface-cli download ${MODEL_ID} \
    --local-dir ${MODEL_DIR} \
    --local-dir-use-symlinks False

echo "Download complete. Weights at ${MODEL_DIR}."

Approximate download times: 8x7B (94 GB) — under 1 hour; 8x22B (263 GB) — 1–2 hours.


5. Baseline deployment

The following SLURM jobs start vLLM inference servers for each model. All flags listed are required or strongly recommended for correct behaviour. The two models use different GPU counts and tensor parallel sizes, so separate scripts are provided.

Mixtral 8x7B baseline

Save as serve_mixtral_8x7b_baseline.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x7b
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G

#SBATCH -o vllm_mixtral_8x7b.%j.out
#SBATCH -e vllm_mixtral_8x7b.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 2 \
    --dtype bfloat16

Mixtral 8x22B baseline

Save as serve_mixtral_8x22b_baseline.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x22b
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=56
#SBATCH --mem=512G

#SBATCH -o vllm_mixtral_8x22b.%j.out
#SBATCH -e vllm_mixtral_8x22b.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 4 \
    --dtype bfloat16

Why no --trust-remote-code

Both Mixtral architectures are natively registered in vLLM. The flag is not required and should not be passed unless loading a custom or modified checkpoint.

Why no --reasoning-parser or --tool-call-parser

Mixtral models do not emit structured reasoning tokens. Tool calling uses standard function-calling syntax handled natively by vLLM’s OpenAI-compatible API without a model-specific parser.


6. Memory layout and GPU allocation

VRAM consumption breakdown (BF16 weights)

Component

Mixtral 8x7B (TP=2)

Mixtral 8x22B (TP=4)

All expert weights (BF16)

~94 GB total, ~47 GB per GPU

~263 GB total, ~66 GB per GPU

Activations and CUDA overhead

~2–4 GB per GPU

~2–4 GB per GPU

KV cache (remainder)

~90 GB per GPU

~70 GB per GPU

The DGX H200 single GPU has 141 GB HBM3e. With TP=2 for 8x7B and TP=4 for 8x22B, both models leave substantial KV cache headroom — much more than the Kimi models, because the weights are significantly smaller.

GPU memory utilisation

--gpu-memory-utilization 0.92

The default in vLLM is 0.90. Setting 0.92 is safe for both models on H200 and recovers additional KV cache space.

Context length

--max-model-len 32768   # for Mixtral 8x7B
--max-model-len 65536   # for Mixtral 8x22B

The config.json for Mixtral 8x7B sets max_position_embeddings to 32,768 and sliding_window to null — there is no active sliding window attention. The practical maximum for vLLM serving is therefore 32,768 tokens, which is what --max-model-len 32768 uses in this guide. Mixtral 8x22B sets max_position_embeddings to 65,536 tokens. Do not leave --max-model-len at the model default if you are serving short-context workloads, as the KV cache reservation is proportional to this value.

GPU count and shared cluster etiquette

Discoverer+ has 4 DGX H200 nodes and 32 GPUs in total. Using only 2 GPUs for Mixtral 8x7B and 4 GPUs for Mixtral 8x22B leaves the majority of the node available for other users. Do not request 8 GPUs for either model.


7. Expert parallelism

Background

Expert parallelism (EP) assigns different experts to different GPUs rather than sharding each expert’s weight matrix across all TP ranks. For Mixtral’s 8 experts distributed across the TP group, EP reduces the inter-GPU communication volume per forward pass because tokens route to the GPU holding the relevant expert rather than all-reducing partial results across all GPUs.

Enabling expert parallelism

--enable-expert-parallel

This flag modifies MoE communication patterns for layers and is only effective when tensor-parallel-size × data-parallel-size > 1. On both TP=2 (8x7B) and TP=4 (8x22B) configurations, this condition is satisfied.

For Mixtral’s 8 experts at TP=2, each GPU holds approximately 4 experts. At TP=4, each GPU holds approximately 2 experts. The flag is beneficial for both configurations.


8. KV cache optimisation

FP8 KV cache

--kv-cache-dtype fp8

Quantising the KV cache from BF16 to FP8 halves memory per cached token and reduces memory bandwidth during attention decode steps. Requires CUDA 11.8 or later. Validated on H200 (Hopper architecture) by the vLLM team. Both Mixtral models use GQA, which already results in smaller KV caches than standard MHA; FP8 halves this further.

Without a pre-calibrated checkpoint, vLLM defaults KV scale factors to 1.0. For better accuracy under extreme quantisation conditions, supply a calibrated scale file via --quantization-param-path.

Prefix caching

--enable-prefix-caching

Reuses the computed KV cache for identical prompt prefixes, eliminating redundant prefill computation. Particularly effective for RAG workloads that prepend the same system prompt and document chunks across many requests. Enabled by default in vLLM V1; specify explicitly if using an older version.

CPU offload (swap space)

--swap-space 16

The value is in GiB per GPU. A lower value than the Kimi guides is appropriate here because both Mixtral models leave far more KV cache headroom; swap is less likely to be needed. Adjust upward if preemption warnings appear in the vLLM logs:

WARNING scheduler.py Sequence group N is preempted by PreemptionMode.SWAP mode

9. Chunked prefill and scheduler tuning

Chunked prefill

--enable-chunked-prefill \
--max-num-batched-tokens 8192

Chunked prefill breaks large prefill computations into chunks interleaved with decode steps, preventing single long-context requests from blocking decode throughput for all other in-flight requests. Particularly relevant for Mixtral 8x22B with its 64K context window.

--max-num-batched-tokens controls the total tokens processed per scheduling step. A value of 8,192 is a reasonable starting point for both models on H200.

Max concurrent sequences

--max-num-seqs 256

The default is 256. Reduce if KV cache OOM errors occur under high concurrency; increase if GPU utilisation is consistently below 80%. Both Mixtral models have substantial remaining KV cache headroom, so the default 256 is generally viable without reduction.


10. Speculative decoding with n-gram prompt lookup

Neither Mistral AI nor the vLLM community has published dedicated draft models trained specifically on the Mixtral architecture. The appropriate speculative decoding strategy for Mixtral on vLLM is n-gram prompt lookup decoding, which requires no additional model download.

How n-gram speculative decoding works

N-gram speculative decoding matches the last N tokens of the current generation against occurrences of those same tokens in the input prompt, then proposes the tokens that follow in the prompt as draft candidates. The main model verifies them in a single forward pass. This is particularly effective for RAG workloads where the model is likely to quote or closely paraphrase retrieved document content.

No additional VRAM is required — the draft proposals are generated from the input context without loading any additional model weights.

Enabling n-gram speculative decoding

The verified syntax in vLLM 0.19.1 is:

--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Fields:

  • method: must be "ngram" exactly

  • num_speculative_tokens: number of tokens proposed per step; 5 is a reasonable starting value for RAG workloads

  • prompt_lookup_min: minimum n-gram length to match; 2 means at least a 2-token match is required before proposing

  • prompt_lookup_max: maximum n-gram length to search; larger values find more specific matches but incur slightly higher search cost per step

Effectiveness

N-gram speculative decoding provides a meaningful latency reduction only when the model output closely follows the input prompt — which is the case for document summarisation, extraction, and RAG answer generation. For open-ended generation tasks where the model does not repeat input text, the acceptance rate will be low and the overhead may reduce throughput marginally. Benchmark both configurations on your representative workload before deploying n-gram speculation to production.


11. MoE Triton kernel tuning

Without a tuned configuration, vLLM logs at startup:

WARNING fused_moe.py Using default MoE config. Performance might be sub-optimal!

The benchmark_moe.py script writes a hardware-specific JSON file into a target directory. Setting VLLM_TUNED_CONFIG_FOLDER to that directory before serving causes vLLM to load it automatically. Run separate tuning jobs for 8x7B and 8x22B since their expert dimensions differ.

Running the tuning script via SLURM

Save as tune_moe_mixtral.sh, adjusting MODEL_PATH, TUNING_DIR, and --tp-size for the model you are tuning:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=tune_moe_mixtral
#SBATCH --time=02:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G

#SBATCH -o tune_moe_mixtral.%j.out
#SBATCH -e tune_moe_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# Adjust for 8x7B (tp-size 2) or 8x22B (tp-size 4 with gres=gpu:4)
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

mkdir -p ${TUNING_DIR}

python benchmarks/kernels/benchmark_moe.py \
    --model ${MODEL_PATH} \
    --tp-size 2 \
    --dtype bfloat16 \
    --tune \
    --save-dir ${TUNING_DIR}

echo "Tuning complete. Config written to ${TUNING_DIR}."

For Mixtral 8x22B, change --gres=gpu:4, --tp-size 4, and both MODEL_PATH and TUNING_DIR accordingly.

The tuning run takes 30–90 minutes. Re-run if you change GPU count, TP size, or model.

Loading the tuned configuration at serve time

export VLLM_TUNED_CONFIG_FOLDER=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

Set this before the vllm serve call. vLLM logs confirmation of loading it at startup.


12. Full optimised SLURM job scripts

Mixtral 8x7B — full optimised script

Save as serve_mixtral_8x7b_optimised.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x7b_opt
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G

#SBATCH -o vllm_mixtral_8x7b_opt.%j.out
#SBATCH -e vllm_mixtral_8x7b_opt.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

[ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 32768 \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --swap-space 16 \
    --enable-expert-parallel \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Submit:

sbatch serve_mixtral_8x7b_optimised.sh

Mixtral 8x22B — full optimised script

Save as serve_mixtral_8x22b_optimised.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x22b_opt
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=56
#SBATCH --mem=512G

#SBATCH -o vllm_mixtral_8x22b_opt.%j.out
#SBATCH -e vllm_mixtral_8x22b_opt.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x22b

[ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 65536 \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --swap-space 16 \
    --enable-expert-parallel \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Submit:

sbatch serve_mixtral_8x22b_optimised.sh

The vLLM server binds to port 8000 by default. Retrieve the compute node hostname from the job output file and connect your client to http://<node_hostname>:8000/v1.


13. Benchmarking

Submit benchmarks as SLURM jobs. Replace <inference_node_hostname> with the hostname from the server job output. Adjust the --model identifier and --gres count for the model you are benchmarking. Save as benchmark_mixtral.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=benchmark_mixtral
#SBATCH --time=01:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G

#SBATCH -o benchmark_mixtral.%j.out
#SBATCH -e benchmark_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}

SERVER_URL=http://<inference_node_hostname>:8000

# Adjust --model to match whichever model is running on the server
vllm bench serve \
    --base-url ${SERVER_URL} \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1000 \
    --request-rate 20

Key metrics to track

  • output tokens per second (decode throughput)

  • time to first token (TTFT) — prefill latency

  • inter-token latency (ITL) — decode latency per token

  • KV cache utilisation — reported in vLLM logs and Prometheus metrics

  • speculative decoding acceptance rate — reported in vLLM logs when n-gram is active; low acceptance rates indicate the workload does not benefit from prompt lookup speculation

Profiling GPU utilisation

From within a SLURM job on the compute node:

nvidia-smi dmon -s u -d 1

The allocated GPUs (2 for 8x7B, 4 for 8x22B) should show compute utilisation above 70% under sustained load.


14. Known caveats and constraints

pip inside the Conda environment

All pip invocations in this guide follow export PATH=${VIRTUAL_ENV}/bin:${PATH}. If a SLURM script omits this line and calls pip directly, packages will install into ~/.local/lib/python3.11/site-packages/ and consume home directory quota.

N-gram speculative decoding in vLLM V1 mode

There is a known issue (vLLM GitHub issue #16883) in which n-gram speculative decoding does not function correctly when vLLM is running in V1 engine mode (VLLM_USE_V1=1). In vLLM 0.19.1, V1 is enabled by default for supported models. If n-gram speculation produces no speedup or incorrect output, disable V1 mode by setting VLLM_USE_V1=0 before the serve call:

export VLLM_USE_V1=0

Add this line to the serve script above the vllm serve call if you encounter issues with n-gram speculation.

N-gram speculation and workload fit

N-gram prompt lookup only benefits workloads where the model output closely follows the input prompt. For open-ended chat, creative writing, or reasoning tasks that do not quote the input, the acceptance rate will be low and the speculative overhead may cause a modest throughput reduction. Measure the acceptance rate in vLLM logs before committing to n-gram speculation in production.

FP8 KV cache with chunked prefill

There is a known interaction between --kv-cache-dtype fp8 and --enable-chunked-prefill in vLLM versions below 0.17.0. This has been resolved in the recommended vLLM 0.19.1.

Mixtral 8x7B context window

The config.json for mistralai/Mixtral-8x7B-v0.1 sets "sliding_window": null and "max_position_embeddings": 32768. Mixtral 8x7B does not use Sliding Window Attention in its shipped form, despite early documentation and secondary sources describing it as an architectural feature. The practical maximum context for vLLM serving is 32,768 tokens, which is what --max-model-len 32768 uses in this guide. There is no 8,192-token hard limit on reliable recall; this would only apply if the model had been configured with a fixed sliding window, which it is not.

Separate MoE tuning configs per model

The tuned Triton kernel configuration for Mixtral 8x7B at TP=2 is not valid for Mixtral 8x22B at TP=4, as the expert dimensions and parallel configuration differ. Use separate TUNING_DIR paths for each model and point VLLM_TUNED_CONFIG_FOLDER to the correct directory in each serve script.

GPU count and Discoverer+ etiquette

As Discoverer+ has only 2 DGX H200 nodes totalling 16 GPUs, requesting 2 GPUs for 8x7B or 4 GPUs for 8x22B is appropriate. Do not request 8 GPUs for either model; neither requires a full node, and doing so would unnecessarily deprive other users of GPU access.

Login node usage

On Discoverer+, all computationally or I/O-intensive operations must be submitted as SLURM jobs. This includes Conda environment creation, package installation, model weight downloading, server startup, and benchmarking.

Conda activation on Discoverer+

The guide uses export PATH=${VIRTUAL_ENV}/bin:${PATH} rather than conda activate. This is the recommended approach on Discoverer+ and does not require initialising a Conda shell.