Serving Mixtral 8x7B and 8x22B on DGX H200 with vLLM and SLURM
==============================================================

This guide covers the complete process of deploying Mistral AI’s Mixtral
8x7B and Mixtral 8x22B on a single DGX H200 node within a SLURM-managed
cluster (Discoverer+), using Conda for environment management and vLLM
for inference.


Contents
--------

1.  `Model overview <mix-model-overview_>`_
2.  `Hardware and software prerequisites <mix-hardware-and-software-prerequisites_>`_
3.  `Environment setup with Conda on Discoverer+ <mix-environment-setup-with-conda-on-discoverer_>`_
4.  `Installing vLLM in the Conda environment <mix-installing-vllm-in-the-conda-environment_>`_
5.  `Baseline deployment <mix-baseline-deployment_>`_
6.  `Memory layout and GPU allocation <mix-memory-layout-and-gpu-allocation_>`_
7.  `Expert parallelism <mix-expert-parallelism_>`_
8.  `KV cache optimisation <mix-kv-cache-optimisation_>`_
9.  `Chunked prefill and scheduler tuning <mix-chunked-prefill-and-scheduler-tuning_>`_
10.  `Speculative decoding with n-gram prompt lookup <mix-speculative-decoding-with-n-gram-prompt-lookup_>`_
11.  `MoE Triton kernel tuning <mix-moe-triton-kernel-tuning_>`_
12.  `Full optimised SLURM job scripts <mix-full-optimised-slurm-job-scripts_>`_
13.  `Benchmarking <mix-benchmarking_>`_
14.  `Known caveats and constraints <mix-known-caveats-and-constraints_>`_

--------------

.. _mix-model-overview:

1. Model overview
-----------------

Mixtral 8x7B (released December 2023) and Mixtral 8x22B (released April
2024) are sparse Mixture-of-Experts language models from Mistral AI,
both licensed under Apache 2.0.

+-----------------------+-----------------------+-----------------------+
| Property              | Mixtral 8x7B          | Mixtral 8x22B         |
+=======================+=======================+=======================+
| Total parameters      | 46.7B                 | 141B                  |
+-----------------------+-----------------------+-----------------------+
| Active parameters per | 12.9B (top-2 of 8     | 39B (top-2 of 8       |
| token                 | experts)              | experts)              |
+-----------------------+-----------------------+-----------------------+
| Experts per layer     | 8                     | 8                     |
+-----------------------+-----------------------+-----------------------+
| Active experts per    | 2                     | 2                     |
| token                 |                       |                       |
+-----------------------+-----------------------+-----------------------+
| Attention mechanism   | GQA                   | GQA                   |
+-----------------------+-----------------------+-----------------------+
| Context window        | 32,768 tokens         | 65,536 tokens         |
+-----------------------+-----------------------+-----------------------+
| BF16 VRAM requirement | ~94 GB                | ~263 GB               |
+-----------------------+-----------------------+-----------------------+
| Licence               | Apache 2.0            | Apache 2.0            |
+-----------------------+-----------------------+-----------------------+
| Hugging Face          | ``mistrala            | ``mistralai           |
| identifier (base)     | i/Mixtral-8x7B-v0.1`` | /Mixtral-8x22B-v0.1`` |
+-----------------------+-----------------------+-----------------------+
| Hugging Face          | ``mistralai/Mixtral   | ``mistralai/Mixtral-  |
| identifier (instruct) | -8x7B-Instruct-v0.1`` | 8x22B-Instruct-v0.1`` |
+-----------------------+-----------------------+-----------------------+

Both models are natively supported in vLLM without
``--trust-remote-code``. Neither model uses MLA attention, reasoning
tokens, or requires a tool-call parser — deployment is substantially
simpler than Kimi K2.5/K2.6.

The active parameter count governs compute per forward pass: Mixtral
8x7B activates 12.9B parameters per token and processes each token with
the equivalent compute of a 14B dense model, and Mixtral 8x22B with the
equivalent of a 39B dense model, despite loading all expert weights into
VRAM.

--------------

.. _mix-hardware-and-software-prerequisites:

2. Hardware and software prerequisites
--------------------------------------

DGX H200 system specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The NVIDIA DGX H200 provides the following hardware relevant to this
deployment:

+-----------------------------------+-----------------------------------+
| Component                         | Specification                     |
+===================================+===================================+
| GPUs                              | 8x NVIDIA H200 SXM Tensor Core    |
|                                   | GPU                               |
+-----------------------------------+-----------------------------------+
| GPU memory                        | 141 GB HBM3e per GPU, 1,128 GB    |
|                                   | total                             |
+-----------------------------------+-----------------------------------+
| GPU memory bandwidth              | 4.8 TB/s per GPU                  |
+-----------------------------------+-----------------------------------+
| GPU interconnect                  | 18x NVLink 4.0 connections per    |
|                                   | GPU, 900 GB/s bidirectional per   |
|                                   | GPU                               |
+-----------------------------------+-----------------------------------+
| NVSwitch                          | 4x NVSwitch, 7.2 TB/s aggregate   |
|                                   | bidirectional GPU-to-GPU          |
|                                   | bandwidth                         |
+-----------------------------------+-----------------------------------+
| Host CPUs                         | 2x Intel Xeon Platinum 8480C, 112 |
|                                   | cores total                       |
+-----------------------------------+-----------------------------------+
| System memory                     | 2 TB DDR5                         |
+-----------------------------------+-----------------------------------+
| NVMe storage                      | 8x 3.84 TB (data), 2x 1.92 TB     |
|                                   | (OS)                              |
+-----------------------------------+-----------------------------------+
| Network                           | 10x ConnectX-7, 400 Gb/s          |
|                                   | InfiniBand/Ethernet               |
+-----------------------------------+-----------------------------------+

Both Mixtral models fit comfortably on a single DGX H200 node with
substantial KV cache headroom remaining. Mixtral 8x7B requires only 1–2
GPUs for weights; Mixtral 8x22B requires 2–4 GPUs. Neither model
requires a full 8-GPU allocation, which is an important consideration on
a shared cluster with only 4 nodes total. The guides below use TP=2 for
8x7B and TP=4 for 8x22B, leaving remaining GPUs available for other
users.

Software requirements
~~~~~~~~~~~~~~~~~~~~~

============= =============== ========================================
Component     Minimum version Notes
============= =============== ========================================
CUDA toolkit  12.1            12.8 required for FP8 KV cache on Hopper
NVIDIA driver 535.x           560+ recommended
Python        3.11            as specified in the Conda environment
vLLM          0.19.1          pin this version for stability
PyTorch       2.5+            installed as a vLLM dependency
============= =============== ========================================

On Discoverer+, CUDA libraries are provided through the cluster
environment module system and do not need to be installed manually
inside the Conda environment.

--------------

.. _mix-environment-setup-with-conda-on-discoverer:

3. Environment setup with Conda on Discoverer+
----------------------------------------------

On Discoverer+, Conda is provided through the centralised Anaconda
installation and accessed via the module system. Do not install a
separate Anaconda or Miniconda distribution in your home or project
directory.

The recommended location for virtual environments on Discoverer+ is:

::

   /valhalla/projects/<your_slurm_project_account_name>/virt_envs/

Creating the vLLM environment via a SLURM batch job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Environment creation must not be run on the login node. Submit a SLURM
batch job instead.

Save the following as ``create_vllm_mixtral_env.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=create_vllm_mixtral_env
   #SBATCH --time=00:30:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=2
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G

   #SBATCH -o create_vllm_mixtral_env.%j.out
   #SBATCH -e create_vllm_mixtral_env.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral

   [ -d ${VIRTUAL_ENV} ] && { echo "Environment ${VIRTUAL_ENV} already exists. Exiting."; exit 1; }

   conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

   if [ $? -ne 0 ]; then
       echo "Conda environment creation failed." >&2
       exit 1
   fi

   echo "Conda environment created successfully."
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   echo "Environment ready for vLLM installation."

Submit and verify:

.. code:: bash

   sbatch create_vllm_mixtral_env.sh
   cat create_vllm_mixtral_env.<jobid>.out

--------------

.. _mix-installing-vllm-in-the-conda-environment:

4. Installing vLLM in the Conda environment
-------------------------------------------

Why pip is used inside the Conda environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Conda is the preferred package manager on Discoverer+. For vLLM, the
conda-forge channel only carries versions up to 0.10.x, significantly
behind the 0.19.1 release. The vLLM project distributes current releases
exclusively through PyPI wheels, so pip is necessary for this package.

Pip installs into the Conda environment provided that
``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` is set before calling pip.
This causes pip to install all packages into
``${VIRTUAL_ENV}/lib/python3.11/site-packages/`` — entirely within the
project storage path on ``/valhalla``. Nothing is written to
``~/.local`` or the home directory.

To confirm the correct pip binary is active at any point during a job:

.. code:: bash

   which pip
   # must print: /valhalla/projects/<account>/virt_envs/vllm-mixtral/bin/pip

Installing vLLM via a SLURM batch job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save the following as ``install_vllm_mixtral.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=install_vllm_mixtral
   #SBATCH --time=01:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=4
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=32G

   #SBATCH -o install_vllm_mixtral.%j.out
   #SBATCH -e install_vllm_mixtral.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral

   [ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; }

   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   echo "Using pip at: $(which pip)"

   pip install "vllm==0.19.1"

   if [ $? -ne 0 ]; then
       echo "vLLM installation failed." >&2
       exit 1
   fi

   pip install "huggingface_hub[cli]"

   echo "vLLM installation complete."
   echo "Installed vLLM version: $(python -c 'import vllm; print(vllm.__version__)')"
   echo "Install location: $(python -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')"

Submit:

.. code:: bash

   sbatch install_vllm_mixtral.sh

Verify in the job output that the install location is under
``/valhalla`` and not ``~/.local``.

Downloading model weights
~~~~~~~~~~~~~~~~~~~~~~~~~

Both Mixtral models are available on Hugging Face under Apache 2.0.
Download to project storage, not to the home directory. Adjust the model
identifier and directory for whichever model you are deploying.

Save as ``download_mixtral.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=download_mixtral
   #SBATCH --time=02:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=4
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=32G

   #SBATCH -o download_mixtral.%j.out
   #SBATCH -e download_mixtral.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

   # Set MODEL_ID to the desired variant:
   #   mistralai/Mixtral-8x7B-Instruct-v0.1   (~94 GB BF16)
   #   mistralai/Mixtral-8x22B-Instruct-v0.1  (~263 GB BF16)
   MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1
   MODEL_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct

   huggingface-cli download ${MODEL_ID} \
       --local-dir ${MODEL_DIR} \
       --local-dir-use-symlinks False

   echo "Download complete. Weights at ${MODEL_DIR}."

Approximate download times: 8x7B (94 GB) — under 1 hour; 8x22B (263 GB)
— 1–2 hours.

--------------

.. _mix-baseline-deployment:

5. Baseline deployment
----------------------

The following SLURM jobs start vLLM inference servers for each model.
All flags listed are required or strongly recommended for correct
behaviour. The two models use different GPU counts and tensor parallel
sizes, so separate scripts are provided.

Mixtral 8x7B baseline
~~~~~~~~~~~~~~~~~~~~~

Save as ``serve_mixtral_8x7b_baseline.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=vllm_mixtral_8x7b
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:2
   #SBATCH --cpus-per-task=28
   #SBATCH --mem=256G

   #SBATCH -o vllm_mixtral_8x7b.%j.out
   #SBATCH -e vllm_mixtral_8x7b.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 2 \
       --dtype bfloat16

Mixtral 8x22B baseline
~~~~~~~~~~~~~~~~~~~~~~

Save as ``serve_mixtral_8x22b_baseline.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=vllm_mixtral_8x22b
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:4
   #SBATCH --cpus-per-task=56
   #SBATCH --mem=512G

   #SBATCH -o vllm_mixtral_8x22b.%j.out
   #SBATCH -e vllm_mixtral_8x22b.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 4 \
       --dtype bfloat16

Why no ``--trust-remote-code``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both Mixtral architectures are natively registered in vLLM. The flag is
not required and should not be passed unless loading a custom or
modified checkpoint.

Why no ``--reasoning-parser`` or ``--tool-call-parser``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mixtral models do not emit structured reasoning tokens. Tool calling
uses standard function-calling syntax handled natively by vLLM’s
OpenAI-compatible API without a model-specific parser.

--------------

.. _mix-memory-layout-and-gpu-allocation:

6. Memory layout and GPU allocation
-----------------------------------

VRAM consumption breakdown (BF16 weights)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+-----------------------+-----------------------+-----------------------+
| Component             | Mixtral 8x7B (TP=2)   | Mixtral 8x22B (TP=4)  |
+=======================+=======================+=======================+
| All expert weights    | ~94 GB total, ~47 GB  | ~263 GB total, ~66 GB |
| (BF16)                | per GPU               | per GPU               |
+-----------------------+-----------------------+-----------------------+
| Activations and CUDA  | ~2–4 GB per GPU       | ~2–4 GB per GPU       |
| overhead              |                       |                       |
+-----------------------+-----------------------+-----------------------+
| KV cache (remainder)  | ~90 GB per GPU        | ~70 GB per GPU        |
+-----------------------+-----------------------+-----------------------+

The DGX H200 single GPU has 141 GB HBM3e. With TP=2 for 8x7B and TP=4
for 8x22B, both models leave substantial KV cache headroom — much more
than the Kimi models, because the weights are significantly smaller.

GPU memory utilisation
~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --gpu-memory-utilization 0.92

The default in vLLM is 0.90. Setting 0.92 is safe for both models on
H200 and recovers additional KV cache space.

Context length
~~~~~~~~~~~~~~

.. code:: bash

   --max-model-len 32768   # for Mixtral 8x7B
   --max-model-len 65536   # for Mixtral 8x22B

The ``config.json`` for Mixtral 8x7B sets ``max_position_embeddings`` to
32,768 and ``sliding_window`` to null — there is no active sliding
window attention. The practical maximum for vLLM serving is therefore
32,768 tokens, which is what ``--max-model-len 32768`` uses in this
guide. Mixtral 8x22B sets ``max_position_embeddings`` to 65,536 tokens.
Do not leave ``--max-model-len`` at the model default if you are serving
short-context workloads, as the KV cache reservation is proportional to
this value.

GPU count and shared cluster etiquette
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Discoverer+ has 4 DGX H200 nodes and 32 GPUs in total. Using only 2 GPUs
for Mixtral 8x7B and 4 GPUs for Mixtral 8x22B leaves the majority of the
node available for other users. Do not request 8 GPUs for either model.

--------------

.. _mix-expert-parallelism:

7. Expert parallelism
---------------------

Background
~~~~~~~~~~

Expert parallelism (EP) assigns different experts to different GPUs
rather than sharding each expert’s weight matrix across all TP ranks.
For Mixtral’s 8 experts distributed across the TP group, EP reduces the
inter-GPU communication volume per forward pass because tokens route to
the GPU holding the relevant expert rather than all-reducing partial
results across all GPUs.

Enabling expert parallelism
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --enable-expert-parallel

This flag modifies MoE communication patterns for layers and is only
effective when ``tensor-parallel-size × data-parallel-size > 1``. On
both TP=2 (8x7B) and TP=4 (8x22B) configurations, this condition is
satisfied.

For Mixtral’s 8 experts at TP=2, each GPU holds approximately 4 experts.
At TP=4, each GPU holds approximately 2 experts. The flag is beneficial
for both configurations.

--------------

.. _mix-kv-cache-optimisation:

8. KV cache optimisation
------------------------

FP8 KV cache
~~~~~~~~~~~~

.. code:: bash

   --kv-cache-dtype fp8

Quantising the KV cache from BF16 to FP8 halves memory per cached token
and reduces memory bandwidth during attention decode steps. Requires
CUDA 11.8 or later. Validated on H200 (Hopper architecture) by the vLLM
team. Both Mixtral models use GQA, which already results in smaller KV
caches than standard MHA; FP8 halves this further.

Without a pre-calibrated checkpoint, vLLM defaults KV scale factors to
1.0. For better accuracy under extreme quantisation conditions, supply a
calibrated scale file via ``--quantization-param-path``.

Prefix caching
~~~~~~~~~~~~~~

.. code:: bash

   --enable-prefix-caching

Reuses the computed KV cache for identical prompt prefixes, eliminating
redundant prefill computation. Particularly effective for RAG workloads
that prepend the same system prompt and document chunks across many
requests. Enabled by default in vLLM V1; specify explicitly if using an
older version.

CPU offload (swap space)
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --swap-space 16

The value is in GiB per GPU. A lower value than the Kimi guides is
appropriate here because both Mixtral models leave far more KV cache
headroom; swap is less likely to be needed. Adjust upward if preemption
warnings appear in the vLLM logs:

::

   WARNING scheduler.py Sequence group N is preempted by PreemptionMode.SWAP mode

--------------

.. _mix-chunked-prefill-and-scheduler-tuning:

9. Chunked prefill and scheduler tuning
---------------------------------------

Chunked prefill
~~~~~~~~~~~~~~~

.. code:: bash

   --enable-chunked-prefill \
   --max-num-batched-tokens 8192

Chunked prefill breaks large prefill computations into chunks
interleaved with decode steps, preventing single long-context requests
from blocking decode throughput for all other in-flight requests.
Particularly relevant for Mixtral 8x22B with its 64K context window.

``--max-num-batched-tokens`` controls the total tokens processed per
scheduling step. A value of 8,192 is a reasonable starting point for
both models on H200.

Max concurrent sequences
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --max-num-seqs 256

The default is 256. Reduce if KV cache OOM errors occur under high
concurrency; increase if GPU utilisation is consistently below 80%. Both
Mixtral models have substantial remaining KV cache headroom, so the
default 256 is generally viable without reduction.

--------------

.. _mix-speculative-decoding-with-n-gram-prompt-lookup:

10. Speculative decoding with n-gram prompt lookup
--------------------------------------------------

Neither Mistral AI nor the vLLM community has published dedicated draft
models trained specifically on the Mixtral architecture. The appropriate
speculative decoding strategy for Mixtral on vLLM is n-gram prompt
lookup decoding, which requires no additional model download.

How n-gram speculative decoding works
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

N-gram speculative decoding matches the last N tokens of the current
generation against occurrences of those same tokens in the input prompt,
then proposes the tokens that follow in the prompt as draft candidates.
The main model verifies them in a single forward pass. This is
particularly effective for RAG workloads where the model is likely to
quote or closely paraphrase retrieved document content.

No additional VRAM is required — the draft proposals are generated from
the input context without loading any additional model weights.

Enabling n-gram speculative decoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The verified syntax in vLLM 0.19.1 is:

.. code:: bash

   --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Fields:

-  ``method``: must be ``"ngram"`` exactly
-  ``num_speculative_tokens``: number of tokens proposed per step; 5 is
   a reasonable starting value for RAG workloads
-  ``prompt_lookup_min``: minimum n-gram length to match; 2 means at
   least a 2-token match is required before proposing
-  ``prompt_lookup_max``: maximum n-gram length to search; larger values
   find more specific matches but incur slightly higher search cost per
   step

Effectiveness
~~~~~~~~~~~~~

N-gram speculative decoding provides a meaningful latency reduction only
when the model output closely follows the input prompt — which is the
case for document summarisation, extraction, and RAG answer generation.
For open-ended generation tasks where the model does not repeat input
text, the acceptance rate will be low and the overhead may reduce
throughput marginally. Benchmark both configurations on your
representative workload before deploying n-gram speculation to
production.

--------------

.. _mix-moe-triton-kernel-tuning:

11. MoE Triton kernel tuning
----------------------------

Without a tuned configuration, vLLM logs at startup:

::

   WARNING fused_moe.py Using default MoE config. Performance might be sub-optimal!

The ``benchmark_moe.py`` script writes a hardware-specific JSON file
into a target directory. Setting ``VLLM_TUNED_CONFIG_FOLDER`` to that
directory before serving causes vLLM to load it automatically. Run
separate tuning jobs for 8x7B and 8x22B since their expert dimensions
differ.

Running the tuning script via SLURM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save as ``tune_moe_mixtral.sh``, adjusting ``MODEL_PATH``,
``TUNING_DIR``, and ``--tp-size`` for the model you are tuning:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=tune_moe_mixtral
   #SBATCH --time=02:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:2
   #SBATCH --cpus-per-task=28
   #SBATCH --mem=256G

   #SBATCH -o tune_moe_mixtral.%j.out
   #SBATCH -e tune_moe_mixtral.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   # Adjust for 8x7B (tp-size 2) or 8x22B (tp-size 4 with gres=gpu:4)
   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

   mkdir -p ${TUNING_DIR}

   python benchmarks/kernels/benchmark_moe.py \
       --model ${MODEL_PATH} \
       --tp-size 2 \
       --dtype bfloat16 \
       --tune \
       --save-dir ${TUNING_DIR}

   echo "Tuning complete. Config written to ${TUNING_DIR}."

For Mixtral 8x22B, change ``--gres=gpu:4``, ``--tp-size 4``, and both
``MODEL_PATH`` and ``TUNING_DIR`` accordingly.

The tuning run takes 30–90 minutes. Re-run if you change GPU count, TP
size, or model.

Loading the tuned configuration at serve time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   export VLLM_TUNED_CONFIG_FOLDER=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

Set this before the ``vllm serve`` call. vLLM logs confirmation of
loading it at startup.

--------------

.. _mix-full-optimised-slurm-job-scripts:

12. Full optimised SLURM job scripts
------------------------------------

Mixtral 8x7B — full optimised script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save as ``serve_mixtral_8x7b_optimised.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=vllm_mixtral_8x7b_opt
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:2
   #SBATCH --cpus-per-task=28
   #SBATCH --mem=256G

   #SBATCH -o vllm_mixtral_8x7b_opt.%j.out
   #SBATCH -e vllm_mixtral_8x7b_opt.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b

   [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 2 \
       --dtype bfloat16 \
       --gpu-memory-utilization 0.92 \
       --max-model-len 32768 \
       --kv-cache-dtype fp8 \
       --enable-prefix-caching \
       --enable-chunked-prefill \
       --max-num-batched-tokens 8192 \
       --swap-space 16 \
       --enable-expert-parallel \
       --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Submit:

.. code:: bash

   sbatch serve_mixtral_8x7b_optimised.sh

Mixtral 8x22B — full optimised script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save as ``serve_mixtral_8x22b_optimised.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=vllm_mixtral_8x22b_opt
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:4
   #SBATCH --cpus-per-task=56
   #SBATCH --mem=512G

   #SBATCH -o vllm_mixtral_8x22b_opt.%j.out
   #SBATCH -e vllm_mixtral_8x22b_opt.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x22b

   [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 4 \
       --dtype bfloat16 \
       --gpu-memory-utilization 0.92 \
       --max-model-len 65536 \
       --kv-cache-dtype fp8 \
       --enable-prefix-caching \
       --enable-chunked-prefill \
       --max-num-batched-tokens 8192 \
       --swap-space 16 \
       --enable-expert-parallel \
       --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}'

Submit:

.. code:: bash

   sbatch serve_mixtral_8x22b_optimised.sh

The vLLM server binds to port 8000 by default. Retrieve the compute node
hostname from the job output file and connect your client to
``http://<node_hostname>:8000/v1``.

--------------

.. _mix-benchmarking:

13. Benchmarking
----------------

Submit benchmarks as SLURM jobs. Replace ``<inference_node_hostname>``
with the hostname from the server job output. Adjust the ``--model``
identifier and ``--gres`` count for the model you are benchmarking. Save
as ``benchmark_mixtral.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=benchmark_mixtral
   #SBATCH --time=01:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:1
   #SBATCH --cpus-per-task=16
   #SBATCH --mem=64G

   #SBATCH -o benchmark_mixtral.%j.out
   #SBATCH -e benchmark_mixtral.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   SERVER_URL=http://<inference_node_hostname>:8000

   # Adjust --model to match whichever model is running on the server
   vllm bench serve \
       --base-url ${SERVER_URL} \
       --backend openai-chat \
       --endpoint /v1/chat/completions \
       --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
       --dataset-name sharegpt \
       --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
       --num-prompts 1000 \
       --request-rate 20

Key metrics to track
~~~~~~~~~~~~~~~~~~~~

-  output tokens per second (decode throughput)
-  time to first token (TTFT) — prefill latency
-  inter-token latency (ITL) — decode latency per token
-  KV cache utilisation — reported in vLLM logs and Prometheus metrics
-  speculative decoding acceptance rate — reported in vLLM logs when
   n-gram is active; low acceptance rates indicate the workload does not
   benefit from prompt lookup speculation

Profiling GPU utilisation
~~~~~~~~~~~~~~~~~~~~~~~~~

From within a SLURM job on the compute node:

.. code:: bash

   nvidia-smi dmon -s u -d 1

The allocated GPUs (2 for 8x7B, 4 for 8x22B) should show compute
utilisation above 70% under sustained load.

--------------

.. _mix-known-caveats-and-constraints:

14. Known caveats and constraints
---------------------------------

pip inside the Conda environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All pip invocations in this guide follow
``export PATH=${VIRTUAL_ENV}/bin:${PATH}``. If a SLURM script omits this
line and calls pip directly, packages will install into
``~/.local/lib/python3.11/site-packages/`` and consume home directory
quota.

N-gram speculative decoding in vLLM V1 mode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a known issue (vLLM GitHub issue #16883) in which n-gram
speculative decoding does not function correctly when vLLM is running in
V1 engine mode (``VLLM_USE_V1=1``). In vLLM 0.19.1, V1 is enabled by
default for supported models. If n-gram speculation produces no speedup
or incorrect output, disable V1 mode by setting ``VLLM_USE_V1=0`` before
the serve call:

.. code:: bash

   export VLLM_USE_V1=0

Add this line to the serve script above the ``vllm serve`` call if you
encounter issues with n-gram speculation.

N-gram speculation and workload fit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

N-gram prompt lookup only benefits workloads where the model output
closely follows the input prompt. For open-ended chat, creative writing,
or reasoning tasks that do not quote the input, the acceptance rate will
be low and the speculative overhead may cause a modest throughput
reduction. Measure the acceptance rate in vLLM logs before committing to
n-gram speculation in production.

FP8 KV cache with chunked prefill
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a known interaction between ``--kv-cache-dtype fp8`` and
``--enable-chunked-prefill`` in vLLM versions below 0.17.0. This has
been resolved in the recommended vLLM 0.19.1.

Mixtral 8x7B context window
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``config.json`` for ``mistralai/Mixtral-8x7B-v0.1`` sets
``"sliding_window": null`` and ``"max_position_embeddings": 32768``.
Mixtral 8x7B does not use Sliding Window Attention in its shipped form,
despite early documentation and secondary sources describing it as an
architectural feature. The practical maximum context for vLLM serving is
32,768 tokens, which is what ``--max-model-len 32768`` uses in this
guide. There is no 8,192-token hard limit on reliable recall; this would
only apply if the model had been configured with a fixed sliding window,
which it is not.

Separate MoE tuning configs per model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The tuned Triton kernel configuration for Mixtral 8x7B at TP=2 is not
valid for Mixtral 8x22B at TP=4, as the expert dimensions and parallel
configuration differ. Use separate ``TUNING_DIR`` paths for each model
and point ``VLLM_TUNED_CONFIG_FOLDER`` to the correct directory in each
serve script.

GPU count and Discoverer+ etiquette
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As Discoverer+ has only 2 DGX H200 nodes totalling 16 GPUs, requesting 2
GPUs for 8x7B or 4 GPUs for 8x22B is appropriate. Do not request 8 GPUs
for either model; neither requires a full node, and doing so would
unnecessarily deprive other users of GPU access.

Login node usage
~~~~~~~~~~~~~~~~

On Discoverer+, all computationally or I/O-intensive operations must be
submitted as SLURM jobs. This includes Conda environment creation,
package installation, model weight downloading, server startup, and
benchmarking.

Conda activation on Discoverer+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The guide uses ``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` rather than
``conda activate``. This is the recommended approach on Discoverer+ and
does not require initialising a Conda shell.