Serving Mixtral 8x7B and 8x22B on DGX H200 with vLLM and SLURM ============================================================== This guide covers the complete process of deploying Mistral AI’s Mixtral 8x7B and Mixtral 8x22B on a single DGX H200 node within a SLURM-managed cluster (Discoverer+), using Conda for environment management and vLLM for inference. Contents -------- 1. `Model overview `_ 2. `Hardware and software prerequisites `_ 3. `Environment setup with Conda on Discoverer+ `_ 4. `Installing vLLM in the Conda environment `_ 5. `Baseline deployment `_ 6. `Memory layout and GPU allocation `_ 7. `Expert parallelism `_ 8. `KV cache optimisation `_ 9. `Chunked prefill and scheduler tuning `_ 10. `Speculative decoding with n-gram prompt lookup `_ 11. `MoE Triton kernel tuning `_ 12. `Full optimised SLURM job scripts `_ 13. `Benchmarking `_ 14. `Known caveats and constraints `_ -------------- .. _mix-model-overview: 1. Model overview ----------------- Mixtral 8x7B (released December 2023) and Mixtral 8x22B (released April 2024) are sparse Mixture-of-Experts language models from Mistral AI, both licensed under Apache 2.0. +-----------------------+-----------------------+-----------------------+ | Property | Mixtral 8x7B | Mixtral 8x22B | +=======================+=======================+=======================+ | Total parameters | 46.7B | 141B | +-----------------------+-----------------------+-----------------------+ | Active parameters per | 12.9B (top-2 of 8 | 39B (top-2 of 8 | | token | experts) | experts) | +-----------------------+-----------------------+-----------------------+ | Experts per layer | 8 | 8 | +-----------------------+-----------------------+-----------------------+ | Active experts per | 2 | 2 | | token | | | +-----------------------+-----------------------+-----------------------+ | Attention mechanism | GQA | GQA | +-----------------------+-----------------------+-----------------------+ | Context window | 32,768 tokens | 65,536 tokens | +-----------------------+-----------------------+-----------------------+ | BF16 VRAM requirement | ~94 GB | ~263 GB | +-----------------------+-----------------------+-----------------------+ | Licence | Apache 2.0 | Apache 2.0 | +-----------------------+-----------------------+-----------------------+ | Hugging Face | ``mistrala | ``mistralai | | identifier (base) | i/Mixtral-8x7B-v0.1`` | /Mixtral-8x22B-v0.1`` | +-----------------------+-----------------------+-----------------------+ | Hugging Face | ``mistralai/Mixtral | ``mistralai/Mixtral- | | identifier (instruct) | -8x7B-Instruct-v0.1`` | 8x22B-Instruct-v0.1`` | +-----------------------+-----------------------+-----------------------+ Both models are natively supported in vLLM without ``--trust-remote-code``. Neither model uses MLA attention, reasoning tokens, or requires a tool-call parser — deployment is substantially simpler than Kimi K2.5/K2.6. The active parameter count governs compute per forward pass: Mixtral 8x7B activates 12.9B parameters per token and processes each token with the equivalent compute of a 14B dense model, and Mixtral 8x22B with the equivalent of a 39B dense model, despite loading all expert weights into VRAM. -------------- .. _mix-hardware-and-software-prerequisites: 2. Hardware and software prerequisites -------------------------------------- DGX H200 system specifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The NVIDIA DGX H200 provides the following hardware relevant to this deployment: +-----------------------------------+-----------------------------------+ | Component | Specification | +===================================+===================================+ | GPUs | 8x NVIDIA H200 SXM Tensor Core | | | GPU | +-----------------------------------+-----------------------------------+ | GPU memory | 141 GB HBM3e per GPU, 1,128 GB | | | total | +-----------------------------------+-----------------------------------+ | GPU memory bandwidth | 4.8 TB/s per GPU | +-----------------------------------+-----------------------------------+ | GPU interconnect | 18x NVLink 4.0 connections per | | | GPU, 900 GB/s bidirectional per | | | GPU | +-----------------------------------+-----------------------------------+ | NVSwitch | 4x NVSwitch, 7.2 TB/s aggregate | | | bidirectional GPU-to-GPU | | | bandwidth | +-----------------------------------+-----------------------------------+ | Host CPUs | 2x Intel Xeon Platinum 8480C, 112 | | | cores total | +-----------------------------------+-----------------------------------+ | System memory | 2 TB DDR5 | +-----------------------------------+-----------------------------------+ | NVMe storage | 8x 3.84 TB (data), 2x 1.92 TB | | | (OS) | +-----------------------------------+-----------------------------------+ | Network | 10x ConnectX-7, 400 Gb/s | | | InfiniBand/Ethernet | +-----------------------------------+-----------------------------------+ Both Mixtral models fit comfortably on a single DGX H200 node with substantial KV cache headroom remaining. Mixtral 8x7B requires only 1–2 GPUs for weights; Mixtral 8x22B requires 2–4 GPUs. Neither model requires a full 8-GPU allocation, which is an important consideration on a shared cluster with only 4 nodes total. The guides below use TP=2 for 8x7B and TP=4 for 8x22B, leaving remaining GPUs available for other users. Software requirements ~~~~~~~~~~~~~~~~~~~~~ ============= =============== ======================================== Component Minimum version Notes ============= =============== ======================================== CUDA toolkit 12.1 12.8 required for FP8 KV cache on Hopper NVIDIA driver 535.x 560+ recommended Python 3.11 as specified in the Conda environment vLLM 0.19.1 pin this version for stability PyTorch 2.5+ installed as a vLLM dependency ============= =============== ======================================== On Discoverer+, CUDA libraries are provided through the cluster environment module system and do not need to be installed manually inside the Conda environment. -------------- .. _mix-environment-setup-with-conda-on-discoverer: 3. Environment setup with Conda on Discoverer+ ---------------------------------------------- On Discoverer+, Conda is provided through the centralised Anaconda installation and accessed via the module system. Do not install a separate Anaconda or Miniconda distribution in your home or project directory. The recommended location for virtual environments on Discoverer+ is: :: /valhalla/projects//virt_envs/ Creating the vLLM environment via a SLURM batch job ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Environment creation must not be run on the login node. Submit a SLURM batch job instead. Save the following as ``create_vllm_mixtral_env.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=create_vllm_mixtral_env #SBATCH --time=00:30:00 #SBATCH --account= #SBATCH --qos=2cpu-single-host #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=1 #SBATCH --mem=16G #SBATCH -o create_vllm_mixtral_env.%j.out #SBATCH -e create_vllm_mixtral_env.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral [ -d ${VIRTUAL_ENV} ] && { echo "Environment ${VIRTUAL_ENV} already exists. Exiting."; exit 1; } conda create --prefix ${VIRTUAL_ENV} python=3.11 -y if [ $? -ne 0 ]; then echo "Conda environment creation failed." >&2 exit 1 fi echo "Conda environment created successfully." export PATH=${VIRTUAL_ENV}/bin:${PATH} echo "Environment ready for vLLM installation." Submit and verify: .. code:: bash sbatch create_vllm_mixtral_env.sh cat create_vllm_mixtral_env..out -------------- .. _mix-installing-vllm-in-the-conda-environment: 4. Installing vLLM in the Conda environment ------------------------------------------- Why pip is used inside the Conda environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Conda is the preferred package manager on Discoverer+. For vLLM, the conda-forge channel only carries versions up to 0.10.x, significantly behind the 0.19.1 release. The vLLM project distributes current releases exclusively through PyPI wheels, so pip is necessary for this package. Pip installs into the Conda environment provided that ``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` is set before calling pip. This causes pip to install all packages into ``${VIRTUAL_ENV}/lib/python3.11/site-packages/`` — entirely within the project storage path on ``/valhalla``. Nothing is written to ``~/.local`` or the home directory. To confirm the correct pip binary is active at any point during a job: .. code:: bash which pip # must print: /valhalla/projects//virt_envs/vllm-mixtral/bin/pip Installing vLLM via a SLURM batch job ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Save the following as ``install_vllm_mixtral.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=install_vllm_mixtral #SBATCH --time=01:00:00 #SBATCH --account= #SBATCH --qos=2cpu-single-host #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=1 #SBATCH --mem=32G #SBATCH -o install_vllm_mixtral.%j.out #SBATCH -e install_vllm_mixtral.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral [ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; } export PATH=${VIRTUAL_ENV}/bin:${PATH} echo "Using pip at: $(which pip)" pip install "vllm==0.19.1" if [ $? -ne 0 ]; then echo "vLLM installation failed." >&2 exit 1 fi pip install "huggingface_hub[cli]" echo "vLLM installation complete." echo "Installed vLLM version: $(python -c 'import vllm; print(vllm.__version__)')" echo "Install location: $(python -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')" Submit: .. code:: bash sbatch install_vllm_mixtral.sh Verify in the job output that the install location is under ``/valhalla`` and not ``~/.local``. Downloading model weights ~~~~~~~~~~~~~~~~~~~~~~~~~ Both Mixtral models are available on Hugging Face under Apache 2.0. Download to project storage, not to the home directory. Adjust the model identifier and directory for whichever model you are deploying. Save as ``download_mixtral.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=download_mixtral #SBATCH --time=02:00:00 #SBATCH --account= #SBATCH --qos=2cpu-single-host #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=1 #SBATCH --mem=32G #SBATCH -o download_mixtral.%j.out #SBATCH -e download_mixtral.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral export PATH=${VIRTUAL_ENV}/bin:${PATH} export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache # Set MODEL_ID to the desired variant: # mistralai/Mixtral-8x7B-Instruct-v0.1 (~94 GB BF16) # mistralai/Mixtral-8x22B-Instruct-v0.1 (~263 GB BF16) MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1 MODEL_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct huggingface-cli download ${MODEL_ID} \ --local-dir ${MODEL_DIR} \ --local-dir-use-symlinks False echo "Download complete. Weights at ${MODEL_DIR}." Approximate download times: 8x7B (94 GB) — under 1 hour; 8x22B (263 GB) — 1–2 hours. -------------- .. _mix-baseline-deployment: 5. Baseline deployment ---------------------- The following SLURM jobs start vLLM inference servers for each model. All flags listed are required or strongly recommended for correct behaviour. The two models use different GPU counts and tensor parallel sizes, so separate scripts are provided. Mixtral 8x7B baseline ~~~~~~~~~~~~~~~~~~~~~ Save as ``serve_mixtral_8x7b_baseline.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=vllm_mixtral_8x7b #SBATCH --time=24:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:2 #SBATCH --cpus-per-task=28 #SBATCH --mem=256G #SBATCH -o vllm_mixtral_8x7b.%j.out #SBATCH -e vllm_mixtral_8x7b.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; } export PATH=${VIRTUAL_ENV}/bin:${PATH} export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct vllm serve ${MODEL_PATH} \ --tensor-parallel-size 2 \ --dtype bfloat16 Mixtral 8x22B baseline ~~~~~~~~~~~~~~~~~~~~~~ Save as ``serve_mixtral_8x22b_baseline.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=vllm_mixtral_8x22b #SBATCH --time=24:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:4 #SBATCH --cpus-per-task=56 #SBATCH --mem=512G #SBATCH -o vllm_mixtral_8x22b.%j.out #SBATCH -e vllm_mixtral_8x22b.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; } export PATH=${VIRTUAL_ENV}/bin:${PATH} export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct vllm serve ${MODEL_PATH} \ --tensor-parallel-size 4 \ --dtype bfloat16 Why no ``--trust-remote-code`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Both Mixtral architectures are natively registered in vLLM. The flag is not required and should not be passed unless loading a custom or modified checkpoint. Why no ``--reasoning-parser`` or ``--tool-call-parser`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Mixtral models do not emit structured reasoning tokens. Tool calling uses standard function-calling syntax handled natively by vLLM’s OpenAI-compatible API without a model-specific parser. -------------- .. _mix-memory-layout-and-gpu-allocation: 6. Memory layout and GPU allocation ----------------------------------- VRAM consumption breakdown (BF16 weights) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +-----------------------+-----------------------+-----------------------+ | Component | Mixtral 8x7B (TP=2) | Mixtral 8x22B (TP=4) | +=======================+=======================+=======================+ | All expert weights | ~94 GB total, ~47 GB | ~263 GB total, ~66 GB | | (BF16) | per GPU | per GPU | +-----------------------+-----------------------+-----------------------+ | Activations and CUDA | ~2–4 GB per GPU | ~2–4 GB per GPU | | overhead | | | +-----------------------+-----------------------+-----------------------+ | KV cache (remainder) | ~90 GB per GPU | ~70 GB per GPU | +-----------------------+-----------------------+-----------------------+ The DGX H200 single GPU has 141 GB HBM3e. With TP=2 for 8x7B and TP=4 for 8x22B, both models leave substantial KV cache headroom — much more than the Kimi models, because the weights are significantly smaller. GPU memory utilisation ~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash --gpu-memory-utilization 0.92 The default in vLLM is 0.90. Setting 0.92 is safe for both models on H200 and recovers additional KV cache space. Context length ~~~~~~~~~~~~~~ .. code:: bash --max-model-len 32768 # for Mixtral 8x7B --max-model-len 65536 # for Mixtral 8x22B The ``config.json`` for Mixtral 8x7B sets ``max_position_embeddings`` to 32,768 and ``sliding_window`` to null — there is no active sliding window attention. The practical maximum for vLLM serving is therefore 32,768 tokens, which is what ``--max-model-len 32768`` uses in this guide. Mixtral 8x22B sets ``max_position_embeddings`` to 65,536 tokens. Do not leave ``--max-model-len`` at the model default if you are serving short-context workloads, as the KV cache reservation is proportional to this value. GPU count and shared cluster etiquette ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Discoverer+ has 4 DGX H200 nodes and 32 GPUs in total. Using only 2 GPUs for Mixtral 8x7B and 4 GPUs for Mixtral 8x22B leaves the majority of the node available for other users. Do not request 8 GPUs for either model. -------------- .. _mix-expert-parallelism: 7. Expert parallelism --------------------- Background ~~~~~~~~~~ Expert parallelism (EP) assigns different experts to different GPUs rather than sharding each expert’s weight matrix across all TP ranks. For Mixtral’s 8 experts distributed across the TP group, EP reduces the inter-GPU communication volume per forward pass because tokens route to the GPU holding the relevant expert rather than all-reducing partial results across all GPUs. Enabling expert parallelism ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash --enable-expert-parallel This flag modifies MoE communication patterns for layers and is only effective when ``tensor-parallel-size × data-parallel-size > 1``. On both TP=2 (8x7B) and TP=4 (8x22B) configurations, this condition is satisfied. For Mixtral’s 8 experts at TP=2, each GPU holds approximately 4 experts. At TP=4, each GPU holds approximately 2 experts. The flag is beneficial for both configurations. -------------- .. _mix-kv-cache-optimisation: 8. KV cache optimisation ------------------------ FP8 KV cache ~~~~~~~~~~~~ .. code:: bash --kv-cache-dtype fp8 Quantising the KV cache from BF16 to FP8 halves memory per cached token and reduces memory bandwidth during attention decode steps. Requires CUDA 11.8 or later. Validated on H200 (Hopper architecture) by the vLLM team. Both Mixtral models use GQA, which already results in smaller KV caches than standard MHA; FP8 halves this further. Without a pre-calibrated checkpoint, vLLM defaults KV scale factors to 1.0. For better accuracy under extreme quantisation conditions, supply a calibrated scale file via ``--quantization-param-path``. Prefix caching ~~~~~~~~~~~~~~ .. code:: bash --enable-prefix-caching Reuses the computed KV cache for identical prompt prefixes, eliminating redundant prefill computation. Particularly effective for RAG workloads that prepend the same system prompt and document chunks across many requests. Enabled by default in vLLM V1; specify explicitly if using an older version. CPU offload (swap space) ~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash --swap-space 16 The value is in GiB per GPU. A lower value than the Kimi guides is appropriate here because both Mixtral models leave far more KV cache headroom; swap is less likely to be needed. Adjust upward if preemption warnings appear in the vLLM logs: :: WARNING scheduler.py Sequence group N is preempted by PreemptionMode.SWAP mode -------------- .. _mix-chunked-prefill-and-scheduler-tuning: 9. Chunked prefill and scheduler tuning --------------------------------------- Chunked prefill ~~~~~~~~~~~~~~~ .. code:: bash --enable-chunked-prefill \ --max-num-batched-tokens 8192 Chunked prefill breaks large prefill computations into chunks interleaved with decode steps, preventing single long-context requests from blocking decode throughput for all other in-flight requests. Particularly relevant for Mixtral 8x22B with its 64K context window. ``--max-num-batched-tokens`` controls the total tokens processed per scheduling step. A value of 8,192 is a reasonable starting point for both models on H200. Max concurrent sequences ~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash --max-num-seqs 256 The default is 256. Reduce if KV cache OOM errors occur under high concurrency; increase if GPU utilisation is consistently below 80%. Both Mixtral models have substantial remaining KV cache headroom, so the default 256 is generally viable without reduction. -------------- .. _mix-speculative-decoding-with-n-gram-prompt-lookup: 10. Speculative decoding with n-gram prompt lookup -------------------------------------------------- Neither Mistral AI nor the vLLM community has published dedicated draft models trained specifically on the Mixtral architecture. The appropriate speculative decoding strategy for Mixtral on vLLM is n-gram prompt lookup decoding, which requires no additional model download. How n-gram speculative decoding works ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N-gram speculative decoding matches the last N tokens of the current generation against occurrences of those same tokens in the input prompt, then proposes the tokens that follow in the prompt as draft candidates. The main model verifies them in a single forward pass. This is particularly effective for RAG workloads where the model is likely to quote or closely paraphrase retrieved document content. No additional VRAM is required — the draft proposals are generated from the input context without loading any additional model weights. Enabling n-gram speculative decoding ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The verified syntax in vLLM 0.19.1 is: .. code:: bash --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}' Fields: - ``method``: must be ``"ngram"`` exactly - ``num_speculative_tokens``: number of tokens proposed per step; 5 is a reasonable starting value for RAG workloads - ``prompt_lookup_min``: minimum n-gram length to match; 2 means at least a 2-token match is required before proposing - ``prompt_lookup_max``: maximum n-gram length to search; larger values find more specific matches but incur slightly higher search cost per step Effectiveness ~~~~~~~~~~~~~ N-gram speculative decoding provides a meaningful latency reduction only when the model output closely follows the input prompt — which is the case for document summarisation, extraction, and RAG answer generation. For open-ended generation tasks where the model does not repeat input text, the acceptance rate will be low and the overhead may reduce throughput marginally. Benchmark both configurations on your representative workload before deploying n-gram speculation to production. -------------- .. _mix-moe-triton-kernel-tuning: 11. MoE Triton kernel tuning ---------------------------- Without a tuned configuration, vLLM logs at startup: :: WARNING fused_moe.py Using default MoE config. Performance might be sub-optimal! The ``benchmark_moe.py`` script writes a hardware-specific JSON file into a target directory. Setting ``VLLM_TUNED_CONFIG_FOLDER`` to that directory before serving causes vLLM to load it automatically. Run separate tuning jobs for 8x7B and 8x22B since their expert dimensions differ. Running the tuning script via SLURM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Save as ``tune_moe_mixtral.sh``, adjusting ``MODEL_PATH``, ``TUNING_DIR``, and ``--tp-size`` for the model you are tuning: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=tune_moe_mixtral #SBATCH --time=02:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:2 #SBATCH --cpus-per-task=28 #SBATCH --mem=256G #SBATCH -o tune_moe_mixtral.%j.out #SBATCH -e tune_moe_mixtral.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral export PATH=${VIRTUAL_ENV}/bin:${PATH} # Adjust for 8x7B (tp-size 2) or 8x22B (tp-size 4 with gres=gpu:4) MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b mkdir -p ${TUNING_DIR} python benchmarks/kernels/benchmark_moe.py \ --model ${MODEL_PATH} \ --tp-size 2 \ --dtype bfloat16 \ --tune \ --save-dir ${TUNING_DIR} echo "Tuning complete. Config written to ${TUNING_DIR}." For Mixtral 8x22B, change ``--gres=gpu:4``, ``--tp-size 4``, and both ``MODEL_PATH`` and ``TUNING_DIR`` accordingly. The tuning run takes 30–90 minutes. Re-run if you change GPU count, TP size, or model. Loading the tuned configuration at serve time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash export VLLM_TUNED_CONFIG_FOLDER=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b Set this before the ``vllm serve`` call. vLLM logs confirmation of loading it at startup. -------------- .. _mix-full-optimised-slurm-job-scripts: 12. Full optimised SLURM job scripts ------------------------------------ Mixtral 8x7B — full optimised script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Save as ``serve_mixtral_8x7b_optimised.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=vllm_mixtral_8x7b_opt #SBATCH --time=24:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:2 #SBATCH --cpus-per-task=28 #SBATCH --mem=256G #SBATCH -o vllm_mixtral_8x7b_opt.%j.out #SBATCH -e vllm_mixtral_8x7b_opt.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; } export PATH=${VIRTUAL_ENV}/bin:${PATH} export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x7b-instruct TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x7b [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR} vllm serve ${MODEL_PATH} \ --tensor-parallel-size 2 \ --dtype bfloat16 \ --gpu-memory-utilization 0.92 \ --max-model-len 32768 \ --kv-cache-dtype fp8 \ --enable-prefix-caching \ --enable-chunked-prefill \ --max-num-batched-tokens 8192 \ --swap-space 16 \ --enable-expert-parallel \ --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}' Submit: .. code:: bash sbatch serve_mixtral_8x7b_optimised.sh Mixtral 8x22B — full optimised script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Save as ``serve_mixtral_8x22b_optimised.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=vllm_mixtral_8x22b_opt #SBATCH --time=24:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:4 #SBATCH --cpus-per-task=56 #SBATCH --mem=512G #SBATCH -o vllm_mixtral_8x22b_opt.%j.out #SBATCH -e vllm_mixtral_8x22b_opt.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; } export PATH=${VIRTUAL_ENV}/bin:${PATH} export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/mixtral-8x22b-instruct TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_mixtral_8x22b [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR} vllm serve ${MODEL_PATH} \ --tensor-parallel-size 4 \ --dtype bfloat16 \ --gpu-memory-utilization 0.92 \ --max-model-len 65536 \ --kv-cache-dtype fp8 \ --enable-prefix-caching \ --enable-chunked-prefill \ --max-num-batched-tokens 8192 \ --swap-space 16 \ --enable-expert-parallel \ --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 10}' Submit: .. code:: bash sbatch serve_mixtral_8x22b_optimised.sh The vLLM server binds to port 8000 by default. Retrieve the compute node hostname from the job output file and connect your client to ``http://:8000/v1``. -------------- .. _mix-benchmarking: 13. Benchmarking ---------------- Submit benchmarks as SLURM jobs. Replace ```` with the hostname from the server job output. Adjust the ``--model`` identifier and ``--gres`` count for the model you are benchmarking. Save as ``benchmark_mixtral.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=benchmark_mixtral #SBATCH --time=01:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=16 #SBATCH --mem=64G #SBATCH -o benchmark_mixtral.%j.out #SBATCH -e benchmark_mixtral.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-mixtral export PATH=${VIRTUAL_ENV}/bin:${PATH} SERVER_URL=http://:8000 # Adjust --model to match whichever model is running on the server vllm bench serve \ --base-url ${SERVER_URL} \ --backend openai-chat \ --endpoint /v1/chat/completions \ --model mistralai/Mixtral-8x7B-Instruct-v0.1 \ --dataset-name sharegpt \ --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \ --num-prompts 1000 \ --request-rate 20 Key metrics to track ~~~~~~~~~~~~~~~~~~~~ - output tokens per second (decode throughput) - time to first token (TTFT) — prefill latency - inter-token latency (ITL) — decode latency per token - KV cache utilisation — reported in vLLM logs and Prometheus metrics - speculative decoding acceptance rate — reported in vLLM logs when n-gram is active; low acceptance rates indicate the workload does not benefit from prompt lookup speculation Profiling GPU utilisation ~~~~~~~~~~~~~~~~~~~~~~~~~ From within a SLURM job on the compute node: .. code:: bash nvidia-smi dmon -s u -d 1 The allocated GPUs (2 for 8x7B, 4 for 8x22B) should show compute utilisation above 70% under sustained load. -------------- .. _mix-known-caveats-and-constraints: 14. Known caveats and constraints --------------------------------- pip inside the Conda environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All pip invocations in this guide follow ``export PATH=${VIRTUAL_ENV}/bin:${PATH}``. If a SLURM script omits this line and calls pip directly, packages will install into ``~/.local/lib/python3.11/site-packages/`` and consume home directory quota. N-gram speculative decoding in vLLM V1 mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is a known issue (vLLM GitHub issue #16883) in which n-gram speculative decoding does not function correctly when vLLM is running in V1 engine mode (``VLLM_USE_V1=1``). In vLLM 0.19.1, V1 is enabled by default for supported models. If n-gram speculation produces no speedup or incorrect output, disable V1 mode by setting ``VLLM_USE_V1=0`` before the serve call: .. code:: bash export VLLM_USE_V1=0 Add this line to the serve script above the ``vllm serve`` call if you encounter issues with n-gram speculation. N-gram speculation and workload fit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N-gram prompt lookup only benefits workloads where the model output closely follows the input prompt. For open-ended chat, creative writing, or reasoning tasks that do not quote the input, the acceptance rate will be low and the speculative overhead may cause a modest throughput reduction. Measure the acceptance rate in vLLM logs before committing to n-gram speculation in production. FP8 KV cache with chunked prefill ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is a known interaction between ``--kv-cache-dtype fp8`` and ``--enable-chunked-prefill`` in vLLM versions below 0.17.0. This has been resolved in the recommended vLLM 0.19.1. Mixtral 8x7B context window ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``config.json`` for ``mistralai/Mixtral-8x7B-v0.1`` sets ``"sliding_window": null`` and ``"max_position_embeddings": 32768``. Mixtral 8x7B does not use Sliding Window Attention in its shipped form, despite early documentation and secondary sources describing it as an architectural feature. The practical maximum context for vLLM serving is 32,768 tokens, which is what ``--max-model-len 32768`` uses in this guide. There is no 8,192-token hard limit on reliable recall; this would only apply if the model had been configured with a fixed sliding window, which it is not. Separate MoE tuning configs per model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The tuned Triton kernel configuration for Mixtral 8x7B at TP=2 is not valid for Mixtral 8x22B at TP=4, as the expert dimensions and parallel configuration differ. Use separate ``TUNING_DIR`` paths for each model and point ``VLLM_TUNED_CONFIG_FOLDER`` to the correct directory in each serve script. GPU count and Discoverer+ etiquette ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As Discoverer+ has only 2 DGX H200 nodes totalling 16 GPUs, requesting 2 GPUs for 8x7B or 4 GPUs for 8x22B is appropriate. Do not request 8 GPUs for either model; neither requires a full node, and doing so would unnecessarily deprive other users of GPU access. Login node usage ~~~~~~~~~~~~~~~~ On Discoverer+, all computationally or I/O-intensive operations must be submitted as SLURM jobs. This includes Conda environment creation, package installation, model weight downloading, server startup, and benchmarking. Conda activation on Discoverer+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The guide uses ``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` rather than ``conda activate``. This is the recommended approach on Discoverer+ and does not require initialising a Conda shell.