Serving Mixtral 8x7B and 8x22B on DGX H200 with vLLM and SLURM

This guide covers the complete process of deploying Mistral AI’s Mixtral 8x7B and Mixtral 8x22B on a single DGX H200 node within a SLURM-managed cluster (Discoverer+), using Conda for environment management and vLLM for inference.

Contents

Model overview
Hardware and software prerequisites
Environment setup with Conda on Discoverer+
Installing vLLM in the Conda environment
Baseline deployment
Memory layout and GPU allocation
Expert parallelism
KV cache optimisation
Chunked prefill and scheduler tuning
Speculative decoding with n-gram prompt lookup
MoE Triton kernel tuning
Full optimised SLURM job scripts
Benchmarking
Known caveats and constraints

1. Model overview

Mixtral 8x7B (released December 2023) and Mixtral 8x22B (released April 2024) are sparse Mixture-of-Experts language models from Mistral AI, both licensed under Apache 2.0.

Property	Mixtral 8x7B	Mixtral 8x22B
Total parameters	46.7B	141B
Active parameters per token	12.9B (top-2 of 8 experts)	39B (top-2 of 8 experts)
Experts per layer	8	8
Active experts per token	2	2
Attention mechanism	GQA	GQA
Context window	32,768 tokens	65,536 tokens
BF16 VRAM requirement	~94 GB	~263 GB
Licence	Apache 2.0	Apache 2.0
Hugging Face identifier (base)	`mistrala i/Mixtral-8x7B-v0.1`	`mistralai /Mixtral-8x22B-v0.1`
Hugging Face identifier (instruct)	`mistralai/Mixtral -8x7B-Instruct-v0.1`	`mistralai/Mixtral- 8x22B-Instruct-v0.1`

Both models are natively supported in vLLM without --trust-remote-code. Neither model uses MLA attention, reasoning tokens, or requires a tool-call parser — deployment is substantially simpler than Kimi K2.5/K2.6.

The active parameter count governs compute per forward pass: Mixtral 8x7B activates 12.9B parameters per token and processes each token with the equivalent compute of a 14B dense model, and Mixtral 8x22B with the equivalent of a 39B dense model, despite loading all expert weights into VRAM.

2. Hardware and software prerequisites

DGX H200 system specifications

The NVIDIA DGX H200 provides the following hardware relevant to this deployment:

Component	Specification
GPUs	8x NVIDIA H200 SXM Tensor Core GPU
GPU memory	141 GB HBM3e per GPU, 1,128 GB total
GPU memory bandwidth	4.8 TB/s per GPU
GPU interconnect	18x NVLink 4.0 connections per GPU, 900 GB/s bidirectional per GPU
NVSwitch	4x NVSwitch, 7.2 TB/s aggregate bidirectional GPU-to-GPU bandwidth
Host CPUs	2x Intel Xeon Platinum 8480C, 112 cores total
System memory	2 TB DDR5
NVMe storage	8x 3.84 TB (data), 2x 1.92 TB (OS)
Network	10x ConnectX-7, 400 Gb/s InfiniBand/Ethernet

Both Mixtral models fit comfortably on a single DGX H200 node with substantial KV cache headroom remaining. Mixtral 8x7B requires only 1–2 GPUs for weights; Mixtral 8x22B requires 2–4 GPUs. Neither model requires a full 8-GPU allocation, which is an important consideration on a shared cluster with only 4 nodes total. The guides below use TP=2 for 8x7B and TP=4 for 8x22B, leaving remaining GPUs available for other users.

Software requirements

Component	Minimum version	Notes
CUDA toolkit	12.1	12.8 required for FP8 KV cache on Hopper
NVIDIA driver	535.x	560+ recommended
Python	3.11	as specified in the Conda environment
vLLM	0.19.1	pin this version for stability
PyTorch	2.5+	installed as a vLLM dependency

On Discoverer+, CUDA libraries are provided through the cluster environment module system and do not need to be installed manually inside the Conda environment.

3. Environment setup with Conda on Discoverer+

On Discoverer+, Conda is provided through the centralised Anaconda installation and accessed via the module system. Do not install a separate Anaconda or Miniconda distribution in your home or project directory.

The recommended location for virtual environments on Discoverer+ is:

/valhalla/projects/<your_slurm_project_account_name>/virt_envs/

Creating the vLLM environment via a SLURM batch job

Environment creation must not be run on the login node. Submit a SLURM batch job instead.

Save the following as create_vllm_mixtral_env.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=create_vllm_mixtral_env
#SBATCH --time=00:30:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G

#SBATCH -o create_vllm_mixtral_env.%j.out
#SBATCH -e create_vllm_mixtral_env.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral

[ -d ${VIRTUAL_ENV} ] && { echo "Environment ${VIRTUAL_ENV} already exists. Exiting."; exit 1; }

conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

if [ $? -ne 0 ]; then
    echo "Conda environment creation failed." >&2
    exit 1
fi

echo "Conda environment created successfully."
export PATH=${VIRTUAL_ENV}/bin:${PATH}

echo "Environment ready for vLLM installation."

Submit and verify:

sbatch create_vllm_mixtral_env.sh
cat create_vllm_mixtral_env.<jobid>.out

4. Installing vLLM in the Conda environment

Why pip is used inside the Conda environment

Conda is the preferred package manager on Discoverer+. For vLLM, the conda-forge channel only carries versions up to 0.10.x, significantly behind the 0.19.1 release. The vLLM project distributes current releases exclusively through PyPI wheels, so pip is necessary for this package.

Pip installs into the Conda environment provided that export PATH=${VIRTUAL_ENV}/bin:${PATH} is set before calling pip. This causes pip to install all packages into ${VIRTUAL_ENV}/lib/python3.11/site-packages/ — entirely within the project storage path on /valhalla. Nothing is written to ~/.local or the home directory.

To confirm the correct pip binary is active at any point during a job:

which pip
# must print: /valhalla/projects/<account>/virt_envs/vllm-mixtral/bin/pip

Installing vLLM via a SLURM batch job

Save the following as install_vllm_mixtral.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=install_vllm_mixtral
#SBATCH --time=01:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G

#SBATCH -o install_vllm_mixtral.%j.out
#SBATCH -e install_vllm_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral

[ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; }

export PATH=${VIRTUAL_ENV}/bin:${PATH}

echo "Using pip at: $(which pip)"

pip install "vllm==0.19.1"

if [ $? -ne 0 ]; then
    echo "vLLM installation failed." >&2
    exit 1
fi

pip install "huggingface_hub[cli]"

echo "vLLM installation complete."
echo "Installed vLLM version: $(python -c 'import vllm; print(vllm.__version__)')"
echo "Install location: $(python -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')"

Submit:

sbatch install_vllm_mixtral.sh

Verify in the job output that the install location is under /valhalla and not ~/.local.

Downloading model weights

Both Mixtral models are available on Hugging Face under Apache 2.0. Download to project storage, not to the home directory. Adjust the model identifier and directory for whichever model you are deploying.

Save as download_mixtral.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=download_mixtral
#SBATCH --time=02:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G

#SBATCH -o download_mixtral.%j.out
#SBATCH -e download_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

# Set MODEL_ID to the desired variant:
#   mistralai/Mixtral-8x7B-Instruct-v0.1   (~94 GB BF16)
#   mistralai/Mixtral-8x22B-Instruct-v0.1  (~263 GB BF16)
MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1
MODEL_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct

huggingface-cli download ${MODEL_ID} \
    --local-dir ${MODEL_DIR} \
    --local-dir-use-symlinks False

echo "Download complete. Weights at ${MODEL_DIR}."

Approximate download times: 8x7B (94 GB) — under 1 hour; 8x22B (263 GB) — 1–2 hours.

5. Baseline deployment

The following SLURM jobs start vLLM inference servers for each model. All flags listed are required or strongly recommended for correct behaviour. The two models use different GPU counts and tensor parallel sizes, so separate scripts are provided.

Mixtral 8x7B baseline

Save as serve_mixtral_8x7b_baseline.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x7b
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G

#SBATCH -o vllm_mixtral_8x7b.%j.out
#SBATCH -e vllm_mixtral_8x7b.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 2 \
    --dtype bfloat16

Mixtral 8x22B baseline

Save as serve_mixtral_8x22b_baseline.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x22b
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=56
#SBATCH --mem=512G

#SBATCH -o vllm_mixtral_8x22b.%j.out
#SBATCH -e vllm_mixtral_8x22b.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 4 \
    --dtype bfloat16

Why no `--trust-remote-code`

Both Mixtral architectures are natively registered in vLLM. The flag is not required and should not be passed unless loading a custom or modified checkpoint.

Why no `--reasoning-parser` or `--tool-call-parser`

Mixtral models do not emit structured reasoning tokens. Tool calling uses standard function-calling syntax handled natively by vLLM’s OpenAI-compatible API without a model-specific parser.

6. Memory layout and GPU allocation

VRAM consumption breakdown (BF16 weights)

Component	Mixtral 8x7B (TP=2)	Mixtral 8x22B (TP=4)
All expert weights (BF16)	~94 GB total, ~47 GB per GPU	~263 GB total, ~66 GB per GPU
Activations and CUDA overhead	~2–4 GB per GPU	~2–4 GB per GPU
KV cache (remainder)	~90 GB per GPU	~70 GB per GPU

The DGX H200 single GPU has 141 GB HBM3e. With TP=2 for 8x7B and TP=4 for 8x22B, both models leave substantial KV cache headroom — much more than the Kimi models, because the weights are significantly smaller.

GPU memory utilisation

--gpu-memory-utilization 0.92

The default in vLLM is 0.90. Setting 0.92 is safe for both models on H200 and recovers additional KV cache space.

Context length

--max-model-len 32768   # for Mixtral 8x7B
--max-model-len 65536   # for Mixtral 8x22B

The config.json for Mixtral 8x7B sets max_position_embeddings to 32,768 and sliding_window to null — there is no active sliding window attention. The practical maximum for vLLM serving is therefore 32,768 tokens, which is what --max-model-len 32768 uses in this guide. Mixtral 8x22B sets max_position_embeddings to 65,536 tokens. Do not leave --max-model-len at the model default if you are serving short-context workloads, as the KV cache reservation is proportional to this value.

GPU count and shared cluster etiquette

Discoverer+ has 4 DGX H200 nodes and 32 GPUs in total. Using only 2 GPUs for Mixtral 8x7B and 4 GPUs for Mixtral 8x22B leaves the majority of the node available for other users. Do not request 8 GPUs for either model.

7. Expert parallelism

Background

Expert parallelism (EP) assigns different experts to different GPUs rather than sharding each expert’s weight matrix across all TP ranks. For Mixtral’s 8 experts distributed across the TP group, EP reduces the inter-GPU communication volume per forward pass because tokens route to the GPU holding the relevant expert rather than all-reducing partial results across all GPUs.

Enabling expert parallelism

--enable-expert-parallel

This flag modifies MoE communication patterns for layers and is only effective when tensor-parallel-size × data-parallel-size > 1. On both TP=2 (8x7B) and TP=4 (8x22B) configurations, this condition is satisfied.

For Mixtral’s 8 experts at TP=2, each GPU holds approximately 4 experts. At TP=4, each GPU holds approximately 2 experts. The flag is beneficial for both configurations.

8. KV cache optimisation

FP8 KV cache

--kv-cache-dtype fp8

Quantising the KV cache from BF16 to FP8 halves memory per cached token and reduces memory bandwidth during attention decode steps. Requires CUDA 11.8 or later. Validated on H200 (Hopper architecture) by the vLLM team. Both Mixtral models use GQA, which already results in smaller KV caches than standard MHA; FP8 halves this further.

Without a pre-calibrated checkpoint, vLLM defaults KV scale factors to 1.0. For better accuracy under extreme quantisation conditions, supply a calibrated scale file via --quantization-param-path.

Prefix caching

--enable-prefix-caching

Reuses the computed KV cache for identical prompt prefixes, eliminating redundant prefill computation. Particularly effective for RAG workloads that prepend the same system prompt and document chunks across many requests. Enabled by default in vLLM V1; specify explicitly if using an older version.

CPU offload (swap space)

--swap-space 16

The value is in GiB per GPU. A lower value than the Kimi guides is appropriate here because both Mixtral models leave far more KV cache headroom; swap is less likely to be needed. Adjust upward if preemption warnings appear in the vLLM logs:

WARNING scheduler.py Sequence group N is preempted by PreemptionMode.SWAP mode

9. Chunked prefill and scheduler tuning

Chunked prefill

--enable-chunked-prefill \
--max-num-batched-tokens 8192

Chunked prefill breaks large prefill computations into chunks interleaved with decode steps, preventing single long-context requests from blocking decode throughput for all other in-flight requests. Particularly relevant for Mixtral 8x22B with its 64K context window.

--max-num-batched-tokens controls the total tokens processed per scheduling step. A value of 8,192 is a reasonable starting point for both models on H200.

Max concurrent sequences

--max-num-seqs 256

The default is 256. Reduce if KV cache OOM errors occur under high concurrency; increase if GPU utilisation is consistently below 80%. Both Mixtral models have substantial remaining KV cache headroom, so the default 256 is generally viable without reduction.

10. Speculative decoding with n-gram prompt lookup

Neither Mistral AI nor the vLLM community has published dedicated draft models trained specifically on the Mixtral architecture. The appropriate speculative decoding strategy for Mixtral on vLLM is n-gram prompt lookup decoding, which requires no additional model download.

How n-gram speculative decoding works

N-gram speculative decoding matches the last N tokens of the current generation against occurrences of those same tokens in the input prompt, then proposes the tokens that follow in the prompt as draft candidates. The main model verifies them in a single forward pass. This is particularly effective for RAG workloads where the model is likely to quote or closely paraphrase retrieved document content.

No additional VRAM is required — the draft proposals are generated from the input context without loading any additional model weights.

Enabling n-gram speculative decoding

The verified syntax in vLLM 0.19.1 is:

--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Fields:

method: must be "ngram" exactly
num_speculative_tokens: number of tokens proposed per step; 5 is a reasonable starting value for RAG workloads
prompt_lookup_min: minimum n-gram length to match; 2 means at least a 2-token match is required before proposing
prompt_lookup_max: maximum n-gram length to search; larger values find more specific matches but incur slightly higher search cost per step

Effectiveness

N-gram speculative decoding provides a meaningful latency reduction only when the model output closely follows the input prompt — which is the case for document summarisation, extraction, and RAG answer generation. For open-ended generation tasks where the model does not repeat input text, the acceptance rate will be low and the overhead may reduce throughput marginally. Benchmark both configurations on your representative workload before deploying n-gram speculation to production.

11. MoE Triton kernel tuning

Without a tuned configuration, vLLM logs at startup:

WARNING fused_moe.py Using default MoE config. Performance might be sub-optimal!

The benchmark_moe.py script writes a hardware-specific JSON file into a target directory. Setting VLLM_TUNED_CONFIG_FOLDER to that directory before serving causes vLLM to load it automatically. Run separate tuning jobs for 8x7B and 8x22B since their expert dimensions differ.

Running the tuning script via SLURM

Save as tune_moe_mixtral.sh, adjusting MODEL_PATH, TUNING_DIR, and --tp-size for the model you are tuning:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=tune_moe_mixtral
#SBATCH --time=02:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G

#SBATCH -o tune_moe_mixtral.%j.out
#SBATCH -e tune_moe_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}

# Adjust for 8x7B (tp-size 2) or 8x22B (tp-size 4 with gres=gpu:4)
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

mkdir -p ${TUNING_DIR}

python benchmarks/kernels/benchmark_moe.py \
    --model ${MODEL_PATH} \
    --tp-size 2 \
    --dtype bfloat16 \
    --tune \
    --save-dir ${TUNING_DIR}

echo "Tuning complete. Config written to ${TUNING_DIR}."

For Mixtral 8x22B, change --gres=gpu:4, --tp-size 4, and both MODEL_PATH and TUNING_DIR accordingly.

The tuning run takes 30–90 minutes. Re-run if you change GPU count, TP size, or model.

Loading the tuned configuration at serve time

export VLLM_TUNED_CONFIG_FOLDER=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

Set this before the vllm serve call. vLLM logs confirmation of loading it at startup.

12. Full optimised SLURM job scripts

Mixtral 8x7B — full optimised script

Save as serve_mixtral_8x7b_optimised.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x7b_opt
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G

#SBATCH -o vllm_mixtral_8x7b_opt.%j.out
#SBATCH -e vllm_mixtral_8x7b_opt.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

[ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 32768 \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --swap-space 16 \
    --enable-expert-parallel \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Submit:

sbatch serve_mixtral_8x7b_optimised.sh

Mixtral 8x22B — full optimised script

Save as serve_mixtral_8x22b_optimised.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x22b_opt
#SBATCH --time=24:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=56
#SBATCH --mem=512G

#SBATCH -o vllm_mixtral_8x22b_opt.%j.out
#SBATCH -e vllm_mixtral_8x22b_opt.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}

export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x22b

[ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

vllm serve ${MODEL_PATH} \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 65536 \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --swap-space 16 \
    --enable-expert-parallel \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Submit:

sbatch serve_mixtral_8x22b_optimised.sh

The vLLM server binds to port 8000 by default. Retrieve the compute node hostname from the job output file and connect your client to http://<node_hostname>:8000/v1.

13. Benchmarking

Submit benchmarks as SLURM jobs. Replace <inference_node_hostname> with the hostname from the server job output. Adjust the --model identifier and --gres count for the model you are benchmarking. Save as benchmark_mixtral.sh:

#!/bin/bash

#SBATCH --partition=common
#SBATCH --job-name=benchmark_mixtral
#SBATCH --time=01:00:00

#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G

#SBATCH -o benchmark_mixtral.%j.out
#SBATCH -e benchmark_mixtral.%j.err

cd ${SLURM_SUBMIT_DIR}

module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}

SERVER_URL=http://<inference_node_hostname>:8000

# Adjust --model to match whichever model is running on the server
vllm bench serve \
    --base-url ${SERVER_URL} \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1000 \
    --request-rate 20

Key metrics to track

output tokens per second (decode throughput)
time to first token (TTFT) — prefill latency
inter-token latency (ITL) — decode latency per token
KV cache utilisation — reported in vLLM logs and Prometheus metrics
speculative decoding acceptance rate — reported in vLLM logs when n-gram is active; low acceptance rates indicate the workload does not benefit from prompt lookup speculation

Profiling GPU utilisation

From within a SLURM job on the compute node:

nvidia-smi dmon -s u -d 1

The allocated GPUs (2 for 8x7B, 4 for 8x22B) should show compute utilisation above 70% under sustained load.

14. Known caveats and constraints

pip inside the Conda environment

All pip invocations in this guide follow export PATH=${VIRTUAL_ENV}/bin:${PATH}. If a SLURM script omits this line and calls pip directly, packages will install into ~/.local/lib/python3.11/site-packages/ and consume home directory quota.

N-gram speculative decoding in vLLM V1 mode

There is a known issue (vLLM GitHub issue #16883) in which n-gram speculative decoding does not function correctly when vLLM is running in V1 engine mode (VLLM_USE_V1=1). In vLLM 0.19.1, V1 is enabled by default for supported models. If n-gram speculation produces no speedup or incorrect output, disable V1 mode by setting VLLM_USE_V1=0 before the serve call:

export VLLM_USE_V1=0

Add this line to the serve script above the vllm serve call if you encounter issues with n-gram speculation.

N-gram speculation and workload fit

N-gram prompt lookup only benefits workloads where the model output closely follows the input prompt. For open-ended chat, creative writing, or reasoning tasks that do not quote the input, the acceptance rate will be low and the speculative overhead may cause a modest throughput reduction. Measure the acceptance rate in vLLM logs before committing to n-gram speculation in production.

FP8 KV cache with chunked prefill

There is a known interaction between --kv-cache-dtype fp8 and --enable-chunked-prefill in vLLM versions below 0.17.0. This has been resolved in the recommended vLLM 0.19.1.

Mixtral 8x7B context window

The config.json for mistralai/Mixtral-8x7B-v0.1 sets "sliding_window": null and "max_position_embeddings": 32768. Mixtral 8x7B does not use Sliding Window Attention in its shipped form, despite early documentation and secondary sources describing it as an architectural feature. The practical maximum context for vLLM serving is 32,768 tokens, which is what --max-model-len 32768 uses in this guide. There is no 8,192-token hard limit on reliable recall; this would only apply if the model had been configured with a fixed sliding window, which it is not.

Separate MoE tuning configs per model

The tuned Triton kernel configuration for Mixtral 8x7B at TP=2 is not valid for Mixtral 8x22B at TP=4, as the expert dimensions and parallel configuration differ. Use separate TUNING_DIR paths for each model and point VLLM_TUNED_CONFIG_FOLDER to the correct directory in each serve script.

GPU count and Discoverer+ etiquette

As Discoverer+ has only 2 DGX H200 nodes totalling 16 GPUs, requesting 2 GPUs for 8x7B or 4 GPUs for 8x22B is appropriate. Do not request 8 GPUs for either model; neither requires a full node, and doing so would unnecessarily deprive other users of GPU access.

Conda activation on Discoverer+

The guide uses export PATH=${VIRTUAL_ENV}/bin:${PATH} rather than conda activate. This is the recommended approach on Discoverer+ and does not require initialising a Conda shell.

Serving Mixtral 8x7B and 8x22B on DGX H200 with vLLM and SLURM

Contents

1. Model overview

2. Hardware and software prerequisites

DGX H200 system specifications

Software requirements

3. Environment setup with Conda on Discoverer+

Creating the vLLM environment via a SLURM batch job

4. Installing vLLM in the Conda environment

Why pip is used inside the Conda environment

Installing vLLM via a SLURM batch job

Downloading model weights

5. Baseline deployment

Mixtral 8x7B baseline

Mixtral 8x22B baseline

Why no --trust-remote-code

Why no --reasoning-parser or --tool-call-parser

6. Memory layout and GPU allocation

VRAM consumption breakdown (BF16 weights)

GPU memory utilisation

Context length

GPU count and shared cluster etiquette

7. Expert parallelism

Background

Enabling expert parallelism

8. KV cache optimisation

FP8 KV cache

Prefix caching

CPU offload (swap space)

9. Chunked prefill and scheduler tuning

Chunked prefill

Max concurrent sequences

10. Speculative decoding with n-gram prompt lookup

How n-gram speculative decoding works

Enabling n-gram speculative decoding

Effectiveness

11. MoE Triton kernel tuning

Running the tuning script via SLURM

Loading the tuned configuration at serve time

12. Full optimised SLURM job scripts

Mixtral 8x7B — full optimised script

Mixtral 8x22B — full optimised script

13. Benchmarking

Key metrics to track

Profiling GPU utilisation

14. Known caveats and constraints

pip inside the Conda environment

N-gram speculative decoding in vLLM V1 mode

N-gram speculation and workload fit

FP8 KV cache with chunked prefill

Mixtral 8x7B context window

Separate MoE tuning configs per model

GPU count and Discoverer+ etiquette

Login node usage

Conda activation on Discoverer+

Why no `--trust-remote-code`

Why no `--reasoning-parser` or `--tool-call-parser`