Serving Mixtral 8x7B and 8x22B on DGX H200 with vLLM and SLURM
This guide covers the complete process of deploying Mistral AI’s Mixtral 8x7B and Mixtral 8x22B on a single DGX H200 node within a SLURM-managed cluster (Discoverer+), using Conda for environment management and vLLM for inference.
Contents
1. Model overview
Mixtral 8x7B (released December 2023) and Mixtral 8x22B (released April 2024) are sparse Mixture-of-Experts language models from Mistral AI, both licensed under Apache 2.0.
Property |
Mixtral 8x7B |
Mixtral 8x22B |
|---|---|---|
Total parameters |
46.7B |
141B |
Active parameters per token |
12.9B (top-2 of 8 experts) |
39B (top-2 of 8 experts) |
Experts per layer |
8 |
8 |
Active experts per token |
2 |
2 |
Attention mechanism |
GQA |
GQA |
Context window |
32,768 tokens |
65,536 tokens |
BF16 VRAM requirement |
~94 GB |
~263 GB |
Licence |
Apache 2.0 |
Apache 2.0 |
Hugging Face identifier (base) |
|
|
Hugging Face identifier (instruct) |
|
|
Both models are natively supported in vLLM without
--trust-remote-code. Neither model uses MLA attention, reasoning
tokens, or requires a tool-call parser — deployment is substantially
simpler than Kimi K2.5/K2.6.
The active parameter count governs compute per forward pass: Mixtral 8x7B activates 12.9B parameters per token and processes each token with the equivalent compute of a 14B dense model, and Mixtral 8x22B with the equivalent of a 39B dense model, despite loading all expert weights into VRAM.
2. Hardware and software prerequisites
DGX H200 system specifications
The NVIDIA DGX H200 provides the following hardware relevant to this deployment:
Component |
Specification |
|---|---|
GPUs |
8x NVIDIA H200 SXM Tensor Core GPU |
GPU memory |
141 GB HBM3e per GPU, 1,128 GB total |
GPU memory bandwidth |
4.8 TB/s per GPU |
GPU interconnect |
18x NVLink 4.0 connections per GPU, 900 GB/s bidirectional per GPU |
NVSwitch |
4x NVSwitch, 7.2 TB/s aggregate bidirectional GPU-to-GPU bandwidth |
Host CPUs |
2x Intel Xeon Platinum 8480C, 112 cores total |
System memory |
2 TB DDR5 |
NVMe storage |
8x 3.84 TB (data), 2x 1.92 TB (OS) |
Network |
10x ConnectX-7, 400 Gb/s InfiniBand/Ethernet |
Both Mixtral models fit comfortably on a single DGX H200 node with substantial KV cache headroom remaining. Mixtral 8x7B requires only 1–2 GPUs for weights; Mixtral 8x22B requires 2–4 GPUs. Neither model requires a full 8-GPU allocation, which is an important consideration on a shared cluster with only 4 nodes total. The guides below use TP=2 for 8x7B and TP=4 for 8x22B, leaving remaining GPUs available for other users.
Software requirements
Component |
Minimum version |
Notes |
|---|---|---|
CUDA toolkit |
12.1 |
12.8 required for FP8 KV cache on Hopper |
NVIDIA driver |
535.x |
560+ recommended |
Python |
3.11 |
as specified in the Conda environment |
vLLM |
0.19.1 |
pin this version for stability |
PyTorch |
2.5+ |
installed as a vLLM dependency |
On Discoverer+, CUDA libraries are provided through the cluster environment module system and do not need to be installed manually inside the Conda environment.
3. Environment setup with Conda on Discoverer+
On Discoverer+, Conda is provided through the centralised Anaconda installation and accessed via the module system. Do not install a separate Anaconda or Miniconda distribution in your home or project directory.
The recommended location for virtual environments on Discoverer+ is:
/valhalla/projects/<your_slurm_project_account_name>/virt_envs/
Creating the vLLM environment via a SLURM batch job
Environment creation must not be run on the login node. Submit a SLURM batch job instead.
Save the following as create_vllm_mixtral_env.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=create_vllm_mixtral_env
#SBATCH --time=00:30:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH -o create_vllm_mixtral_env.%j.out
#SBATCH -e create_vllm_mixtral_env.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] && { echo "Environment ${VIRTUAL_ENV} already exists. Exiting."; exit 1; }
conda create --prefix ${VIRTUAL_ENV} python=3.11 -y
if [ $? -ne 0 ]; then
echo "Conda environment creation failed." >&2
exit 1
fi
echo "Conda environment created successfully."
export PATH=${VIRTUAL_ENV}/bin:${PATH}
echo "Environment ready for vLLM installation."
Submit and verify:
sbatch create_vllm_mixtral_env.sh
cat create_vllm_mixtral_env.<jobid>.out
4. Installing vLLM in the Conda environment
Why pip is used inside the Conda environment
Conda is the preferred package manager on Discoverer+. For vLLM, the conda-forge channel only carries versions up to 0.10.x, significantly behind the 0.19.1 release. The vLLM project distributes current releases exclusively through PyPI wheels, so pip is necessary for this package.
Pip installs into the Conda environment provided that
export PATH=${VIRTUAL_ENV}/bin:${PATH} is set before calling pip.
This causes pip to install all packages into
${VIRTUAL_ENV}/lib/python3.11/site-packages/ — entirely within the
project storage path on /valhalla. Nothing is written to
~/.local or the home directory.
To confirm the correct pip binary is active at any point during a job:
which pip
# must print: /valhalla/projects/<account>/virt_envs/vllm-mixtral/bin/pip
Installing vLLM via a SLURM batch job
Save the following as install_vllm_mixtral.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=install_vllm_mixtral
#SBATCH --time=01:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G
#SBATCH -o install_vllm_mixtral.%j.out
#SBATCH -e install_vllm_mixtral.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
echo "Using pip at: $(which pip)"
pip install "vllm==0.19.1"
if [ $? -ne 0 ]; then
echo "vLLM installation failed." >&2
exit 1
fi
pip install "huggingface_hub[cli]"
echo "vLLM installation complete."
echo "Installed vLLM version: $(python -c 'import vllm; print(vllm.__version__)')"
echo "Install location: $(python -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')"
Submit:
sbatch install_vllm_mixtral.sh
Verify in the job output that the install location is under
/valhalla and not ~/.local.
Downloading model weights
Both Mixtral models are available on Hugging Face under Apache 2.0. Download to project storage, not to the home directory. Adjust the model identifier and directory for whichever model you are deploying.
Save as download_mixtral.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=download_mixtral
#SBATCH --time=02:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=2cpu-single-host
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=32G
#SBATCH -o download_mixtral.%j.out
#SBATCH -e download_mixtral.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
# Set MODEL_ID to the desired variant:
# mistralai/Mixtral-8x7B-Instruct-v0.1 (~94 GB BF16)
# mistralai/Mixtral-8x22B-Instruct-v0.1 (~263 GB BF16)
MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1
MODEL_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
huggingface-cli download ${MODEL_ID} \
--local-dir ${MODEL_DIR} \
--local-dir-use-symlinks False
echo "Download complete. Weights at ${MODEL_DIR}."
Approximate download times: 8x7B (94 GB) — under 1 hour; 8x22B (263 GB) — 1–2 hours.
5. Baseline deployment
The following SLURM jobs start vLLM inference servers for each model. All flags listed are required or strongly recommended for correct behaviour. The two models use different GPU counts and tensor parallel sizes, so separate scripts are provided.
Mixtral 8x7B baseline
Save as serve_mixtral_8x7b_baseline.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x7b
#SBATCH --time=24:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G
#SBATCH -o vllm_mixtral_8x7b.%j.out
#SBATCH -e vllm_mixtral_8x7b.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
vllm serve ${MODEL_PATH} \
--tensor-parallel-size 2 \
--dtype bfloat16
Mixtral 8x22B baseline
Save as serve_mixtral_8x22b_baseline.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x22b
#SBATCH --time=24:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=56
#SBATCH --mem=512G
#SBATCH -o vllm_mixtral_8x22b.%j.out
#SBATCH -e vllm_mixtral_8x22b.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct
vllm serve ${MODEL_PATH} \
--tensor-parallel-size 4 \
--dtype bfloat16
Why no --trust-remote-code
Both Mixtral architectures are natively registered in vLLM. The flag is not required and should not be passed unless loading a custom or modified checkpoint.
Why no --reasoning-parser or --tool-call-parser
Mixtral models do not emit structured reasoning tokens. Tool calling uses standard function-calling syntax handled natively by vLLM’s OpenAI-compatible API without a model-specific parser.
6. Memory layout and GPU allocation
VRAM consumption breakdown (BF16 weights)
Component |
Mixtral 8x7B (TP=2) |
Mixtral 8x22B (TP=4) |
|---|---|---|
All expert weights (BF16) |
~94 GB total, ~47 GB per GPU |
~263 GB total, ~66 GB per GPU |
Activations and CUDA overhead |
~2–4 GB per GPU |
~2–4 GB per GPU |
KV cache (remainder) |
~90 GB per GPU |
~70 GB per GPU |
The DGX H200 single GPU has 141 GB HBM3e. With TP=2 for 8x7B and TP=4 for 8x22B, both models leave substantial KV cache headroom — much more than the Kimi models, because the weights are significantly smaller.
GPU memory utilisation
--gpu-memory-utilization 0.92
The default in vLLM is 0.90. Setting 0.92 is safe for both models on H200 and recovers additional KV cache space.
Context length
--max-model-len 32768 # for Mixtral 8x7B
--max-model-len 65536 # for Mixtral 8x22B
The config.json for Mixtral 8x7B sets max_position_embeddings to
32,768 and sliding_window to null — there is no active sliding
window attention. The practical maximum for vLLM serving is therefore
32,768 tokens, which is what --max-model-len 32768 uses in this
guide. Mixtral 8x22B sets max_position_embeddings to 65,536 tokens.
Do not leave --max-model-len at the model default if you are serving
short-context workloads, as the KV cache reservation is proportional to
this value.
7. Expert parallelism
Background
Expert parallelism (EP) assigns different experts to different GPUs rather than sharding each expert’s weight matrix across all TP ranks. For Mixtral’s 8 experts distributed across the TP group, EP reduces the inter-GPU communication volume per forward pass because tokens route to the GPU holding the relevant expert rather than all-reducing partial results across all GPUs.
Enabling expert parallelism
--enable-expert-parallel
This flag modifies MoE communication patterns for layers and is only
effective when tensor-parallel-size × data-parallel-size > 1. On
both TP=2 (8x7B) and TP=4 (8x22B) configurations, this condition is
satisfied.
For Mixtral’s 8 experts at TP=2, each GPU holds approximately 4 experts. At TP=4, each GPU holds approximately 2 experts. The flag is beneficial for both configurations.
8. KV cache optimisation
FP8 KV cache
--kv-cache-dtype fp8
Quantising the KV cache from BF16 to FP8 halves memory per cached token and reduces memory bandwidth during attention decode steps. Requires CUDA 11.8 or later. Validated on H200 (Hopper architecture) by the vLLM team. Both Mixtral models use GQA, which already results in smaller KV caches than standard MHA; FP8 halves this further.
Without a pre-calibrated checkpoint, vLLM defaults KV scale factors to
1.0. For better accuracy under extreme quantisation conditions, supply a
calibrated scale file via --quantization-param-path.
Prefix caching
--enable-prefix-caching
Reuses the computed KV cache for identical prompt prefixes, eliminating redundant prefill computation. Particularly effective for RAG workloads that prepend the same system prompt and document chunks across many requests. Enabled by default in vLLM V1; specify explicitly if using an older version.
CPU offload (swap space)
--swap-space 16
The value is in GiB per GPU. A lower value than the Kimi guides is appropriate here because both Mixtral models leave far more KV cache headroom; swap is less likely to be needed. Adjust upward if preemption warnings appear in the vLLM logs:
WARNING scheduler.py Sequence group N is preempted by PreemptionMode.SWAP mode
9. Chunked prefill and scheduler tuning
Chunked prefill
--enable-chunked-prefill \
--max-num-batched-tokens 8192
Chunked prefill breaks large prefill computations into chunks interleaved with decode steps, preventing single long-context requests from blocking decode throughput for all other in-flight requests. Particularly relevant for Mixtral 8x22B with its 64K context window.
--max-num-batched-tokens controls the total tokens processed per
scheduling step. A value of 8,192 is a reasonable starting point for
both models on H200.
Max concurrent sequences
--max-num-seqs 256
The default is 256. Reduce if KV cache OOM errors occur under high concurrency; increase if GPU utilisation is consistently below 80%. Both Mixtral models have substantial remaining KV cache headroom, so the default 256 is generally viable without reduction.
10. Speculative decoding with n-gram prompt lookup
Neither Mistral AI nor the vLLM community has published dedicated draft models trained specifically on the Mixtral architecture. The appropriate speculative decoding strategy for Mixtral on vLLM is n-gram prompt lookup decoding, which requires no additional model download.
How n-gram speculative decoding works
N-gram speculative decoding matches the last N tokens of the current generation against occurrences of those same tokens in the input prompt, then proposes the tokens that follow in the prompt as draft candidates. The main model verifies them in a single forward pass. This is particularly effective for RAG workloads where the model is likely to quote or closely paraphrase retrieved document content.
No additional VRAM is required — the draft proposals are generated from the input context without loading any additional model weights.
Enabling n-gram speculative decoding
The verified syntax in vLLM 0.19.1 is:
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'
Fields:
method: must be"ngram"exactlynum_speculative_tokens: number of tokens proposed per step; 5 is a reasonable starting value for RAG workloadsprompt_lookup_min: minimum n-gram length to match; 2 means at least a 2-token match is required before proposingprompt_lookup_max: maximum n-gram length to search; larger values find more specific matches but incur slightly higher search cost per step
Effectiveness
N-gram speculative decoding provides a meaningful latency reduction only when the model output closely follows the input prompt — which is the case for document summarisation, extraction, and RAG answer generation. For open-ended generation tasks where the model does not repeat input text, the acceptance rate will be low and the overhead may reduce throughput marginally. Benchmark both configurations on your representative workload before deploying n-gram speculation to production.
11. MoE Triton kernel tuning
Without a tuned configuration, vLLM logs at startup:
WARNING fused_moe.py Using default MoE config. Performance might be sub-optimal!
The benchmark_moe.py script writes a hardware-specific JSON file
into a target directory. Setting VLLM_TUNED_CONFIG_FOLDER to that
directory before serving causes vLLM to load it automatically. Run
separate tuning jobs for 8x7B and 8x22B since their expert dimensions
differ.
Running the tuning script via SLURM
Save as tune_moe_mixtral.sh, adjusting MODEL_PATH,
TUNING_DIR, and --tp-size for the model you are tuning:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=tune_moe_mixtral
#SBATCH --time=02:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G
#SBATCH -o tune_moe_mixtral.%j.out
#SBATCH -e tune_moe_mixtral.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}
# Adjust for 8x7B (tp-size 2) or 8x22B (tp-size 4 with gres=gpu:4)
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b
mkdir -p ${TUNING_DIR}
python benchmarks/kernels/benchmark_moe.py \
--model ${MODEL_PATH} \
--tp-size 2 \
--dtype bfloat16 \
--tune \
--save-dir ${TUNING_DIR}
echo "Tuning complete. Config written to ${TUNING_DIR}."
For Mixtral 8x22B, change --gres=gpu:4, --tp-size 4, and both
MODEL_PATH and TUNING_DIR accordingly.
The tuning run takes 30–90 minutes. Re-run if you change GPU count, TP size, or model.
Loading the tuned configuration at serve time
export VLLM_TUNED_CONFIG_FOLDER=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b
Set this before the vllm serve call. vLLM logs confirmation of
loading it at startup.
12. Full optimised SLURM job scripts
Mixtral 8x7B — full optimised script
Save as serve_mixtral_8x7b_optimised.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x7b_opt
#SBATCH --time=24:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=28
#SBATCH --mem=256G
#SBATCH -o vllm_mixtral_8x7b_opt.%j.out
#SBATCH -e vllm_mixtral_8x7b_opt.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b
[ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}
vllm serve ${MODEL_PATH} \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--swap-space 16 \
--enable-expert-parallel \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'
Submit:
sbatch serve_mixtral_8x7b_optimised.sh
Mixtral 8x22B — full optimised script
Save as serve_mixtral_8x22b_optimised.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=vllm_mixtral_8x22b_opt
#SBATCH --time=24:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=56
#SBATCH --mem=512G
#SBATCH -o vllm_mixtral_8x22b_opt.%j.out
#SBATCH -e vllm_mixtral_8x22b_opt.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
[ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
export PATH=${VIRTUAL_ENV}/bin:${PATH}
export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct
TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x22b
[ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}
vllm serve ${MODEL_PATH} \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--gpu-memory-utilization 0.92 \
--max-model-len 65536 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--swap-space 16 \
--enable-expert-parallel \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'
Submit:
sbatch serve_mixtral_8x22b_optimised.sh
The vLLM server binds to port 8000 by default. Retrieve the compute node
hostname from the job output file and connect your client to
http://<node_hostname>:8000/v1.
13. Benchmarking
Submit benchmarks as SLURM jobs. Replace <inference_node_hostname>
with the hostname from the server job output. Adjust the --model
identifier and --gres count for the model you are benchmarking. Save
as benchmark_mixtral.sh:
#!/bin/bash
#SBATCH --partition=common
#SBATCH --job-name=benchmark_mixtral
#SBATCH --time=01:00:00
#SBATCH --account=<your_slurm_project_account_name>
#SBATCH --qos=<your_qos_name>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH -o benchmark_mixtral.%j.out
#SBATCH -e benchmark_mixtral.%j.err
cd ${SLURM_SUBMIT_DIR}
module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }
export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
export PATH=${VIRTUAL_ENV}/bin:${PATH}
SERVER_URL=http://<inference_node_hostname>:8000
# Adjust --model to match whichever model is running on the server
vllm bench serve \
--base-url ${SERVER_URL} \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000 \
--request-rate 20
Key metrics to track
output tokens per second (decode throughput)
time to first token (TTFT) — prefill latency
inter-token latency (ITL) — decode latency per token
KV cache utilisation — reported in vLLM logs and Prometheus metrics
speculative decoding acceptance rate — reported in vLLM logs when n-gram is active; low acceptance rates indicate the workload does not benefit from prompt lookup speculation
Profiling GPU utilisation
From within a SLURM job on the compute node:
nvidia-smi dmon -s u -d 1
The allocated GPUs (2 for 8x7B, 4 for 8x22B) should show compute utilisation above 70% under sustained load.
14. Known caveats and constraints
pip inside the Conda environment
All pip invocations in this guide follow
export PATH=${VIRTUAL_ENV}/bin:${PATH}. If a SLURM script omits this
line and calls pip directly, packages will install into
~/.local/lib/python3.11/site-packages/ and consume home directory
quota.
N-gram speculative decoding in vLLM V1 mode
There is a known issue (vLLM GitHub issue #16883) in which n-gram
speculative decoding does not function correctly when vLLM is running in
V1 engine mode (VLLM_USE_V1=1). In vLLM 0.19.1, V1 is enabled by
default for supported models. If n-gram speculation produces no speedup
or incorrect output, disable V1 mode by setting VLLM_USE_V1=0 before
the serve call:
export VLLM_USE_V1=0
Add this line to the serve script above the vllm serve call if you
encounter issues with n-gram speculation.
N-gram speculation and workload fit
N-gram prompt lookup only benefits workloads where the model output closely follows the input prompt. For open-ended chat, creative writing, or reasoning tasks that do not quote the input, the acceptance rate will be low and the speculative overhead may cause a modest throughput reduction. Measure the acceptance rate in vLLM logs before committing to n-gram speculation in production.
FP8 KV cache with chunked prefill
There is a known interaction between --kv-cache-dtype fp8 and
--enable-chunked-prefill in vLLM versions below 0.17.0. This has
been resolved in the recommended vLLM 0.19.1.
Mixtral 8x7B context window
The config.json for mistralai/Mixtral-8x7B-v0.1 sets
"sliding_window": null and "max_position_embeddings": 32768.
Mixtral 8x7B does not use Sliding Window Attention in its shipped form,
despite early documentation and secondary sources describing it as an
architectural feature. The practical maximum context for vLLM serving is
32,768 tokens, which is what --max-model-len 32768 uses in this
guide. There is no 8,192-token hard limit on reliable recall; this would
only apply if the model had been configured with a fixed sliding window,
which it is not.
Separate MoE tuning configs per model
The tuned Triton kernel configuration for Mixtral 8x7B at TP=2 is not
valid for Mixtral 8x22B at TP=4, as the expert dimensions and parallel
configuration differ. Use separate TUNING_DIR paths for each model
and point VLLM_TUNED_CONFIG_FOLDER to the correct directory in each
serve script.
GPU count and Discoverer+ etiquette
As Discoverer+ has only 2 DGX H200 nodes totalling 16 GPUs, requesting 2 GPUs for 8x7B or 4 GPUs for 8x22B is appropriate. Do not request 8 GPUs for either model; neither requires a full node, and doing so would unnecessarily deprive other users of GPU access.
Login node usage
On Discoverer+, all computationally or I/O-intensive operations must be submitted as SLURM jobs. This includes Conda environment creation, package installation, model weight downloading, server startup, and benchmarking.
Conda activation on Discoverer+
The guide uses export PATH=${VIRTUAL_ENV}/bin:${PATH} rather than
conda activate. This is the recommended approach on Discoverer+ and
does not require initialising a Conda shell.