Serving Kimi K2.5 on DGX H200 with vLLM and SLURM
=================================================

This guide covers the complete process of deploying Moonshot AI’s Kimi
K2.5 on a single DGX H200 node within a SLURM-managed cluster
(Discoverer+), using Conda for environment management and vLLM for
inference (some of the components are instaled using ``pip``).

Contents
--------

1.  `Model overview <k25-model-overview_>`_
2.  `Hardware and software prerequisites <k25-hardware-and-software-prerequisites_>`_
3.  `Environment setup with Conda on Discoverer+ <k25-environment-setup-with-conda-on-discoverer_>`_
4.  `Installing vLLM in the Conda environment <k25-installing-vllm-in-the-conda-environment_>`_
5.  `Baseline deployment <k25-baseline-deployment_>`_
6.  `Memory layout and GPU allocation <k25-memory-layout-and-gpu-allocation_>`_
7.  `Expert parallelism <k25-expert-parallelism_>`_
8.  `KV cache optimisation <k25-kv-cache-optimisation_>`_
9.  `Chunked prefill and scheduler tuning <k25-chunked-prefill-and-scheduler-tuning_>`_
10.  `Eagle3 speculative decoding <k25-eagle3-speculative-decoding_>`_
11.  `MoE Triton kernel tuning <k25-moe-triton-kernel-tuning_>`_
12.  `Full optimised SLURM job script <k25-full-optimised-slurm-job-script_>`_
13.  `Benchmarking <k25-benchmarking_>`_
14.  `Known caveats and constraints <k25-known-caveats-and-constraints_>`_

--------------


.. _k25-model-overview:

1. Model overview
-----------------

Kimi K2.5 is a 1 trillion parameter Mixture-of-Experts (MoE) model
released by Moonshot AI in January 2026. Key architectural
characteristics relevant to deployment:

-  1 trillion total parameters, 32 billion active parameters per token
-  384 experts per MoE layer, with 8 experts selected per token plus 1
   shared expert
-  61 transformer layers
-  Multi-head Latent Attention (MLA), which compresses the KV cache by
   approximately 10× compared to standard MHA
-  256,000 token context window
-  Native multimodal support via a MoonViT vision encoder
-  Weights available at ``moonshotai/Kimi-K2.5`` on Hugging Face under a
   modified MIT licence

The MLA attention mechanism is the single most important architectural
property for deployment planning. It reduces KV cache memory by roughly
10× relative to standard grouped-query attention, making long-context
serving materially more practical.

--------------


.. _k25-hardware-and-software-prerequisites:

2. Hardware and software prerequisites
--------------------------------------

DGX H200 system specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The NVIDIA DGX H200 provides the following hardware relevant to this
deployment:

+-----------------------------------+-----------------------------------+
| Component                         | Specification                     |
+===================================+===================================+
| GPUs                              | 8x NVIDIA H200 SXM Tensor Core    |
|                                   | GPU                               |
+-----------------------------------+-----------------------------------+
| GPU memory                        | 141 GB HBM3e per GPU, 1,128 GB    |
|                                   | total                             |
+-----------------------------------+-----------------------------------+
| GPU memory bandwidth              | 4.8 TB/s per GPU                  |
+-----------------------------------+-----------------------------------+
| GPU interconnect                  | 18x NVLink 4.0 connections per    |
|                                   | GPU, 900 GB/s bidirectional per   |
|                                   | GPU                               |
+-----------------------------------+-----------------------------------+
| NVSwitch                          | 4x NVSwitch, 7.2 TB/s aggregate   |
|                                   | bidirectional GPU-to-GPU          |
|                                   | bandwidth                         |
+-----------------------------------+-----------------------------------+
| Host CPUs                         | 2x Intel Xeon Platinum 8480C, 112 |
|                                   | cores total                       |
+-----------------------------------+-----------------------------------+
| System memory                     | 2 TB DDR5                         |
+-----------------------------------+-----------------------------------+
| NVMe storage                      | 8x 3.84 TB (data), 2x 1.92 TB     |
|                                   | (OS)                              |
+-----------------------------------+-----------------------------------+
| Network                           | 10x ConnectX-7, 400 Gb/s          |
|                                   | InfiniBand/Ethernet               |
+-----------------------------------+-----------------------------------+

The 4x NVSwitch fabric provides full all-to-all GPU connectivity at 7.2
TB/s, which is critical for the all-reduce operations in tensor
parallelism across all 8 GPUs. This is substantially higher bandwidth
than PCIe-connected multi-GPU systems and directly affects the
efficiency of communication during inference.

Software requirements
~~~~~~~~~~~~~~~~~~~~~

+-----------------------+-----------------------+-----------------------+
| Component             | Minimum version       | Notes                 |
+=======================+=======================+=======================+
| CUDA toolkit          | 12.1                  | 12.8 required for FP8 |
|                       |                       | KV cache on Hopper    |
+-----------------------+-----------------------+-----------------------+
| NVIDIA driver         | 535.x                 | 560+ recommended      |
+-----------------------+-----------------------+-----------------------+
| Python                | 3.11                  | as specified in the   |
|                       |                       | Conda environment     |
+-----------------------+-----------------------+-----------------------+
| vLLM                  | 0.19.1                | verified by Moonshot  |
|                       |                       | AI; pin this version  |
|                       |                       | for stability         |
+-----------------------+-----------------------+-----------------------+
| PyTorch               | 2.5+                  | installed as a vLLM   |
|                       |                       | dependency            |
+-----------------------+-----------------------+-----------------------+

On Discoverer+, the necessary CUDA libraries are provided through the
cluster environment module system:

.. code:: bash

   module load nvidia/cuda/12/12.8

and do not need to be installed manually inside the Conda environment.

--------------


.. _k25-environment-setup-with-conda-on-discoverer:

3. Environment setup with Conda on Discoverer+
----------------------------------------------

On Discoverer+, Conda installation is provided through the centralised
Anaconda installation and accessed via the module system

.. code:: bash

   module load anaconda3

Do not install a separate Anaconda or Miniconda distribution in your
home or project directory.

The recommended location for virtual environments on Discoverer+ is:

::

   /valhalla/projects/<your_slurm_project_account_name>/virt_envs/

Creating the vLLM environment via a SLURM batch job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Environment creation must not be run on the login node, as installation
tasks are I/O-intensive and compete for shared login node resources.
Submit a SLURM batch job instead.

Save the following as ``create_vllm_env.sh``, replacing
``<your_slurm_project_account_name>`` with your actual account name:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=create_vllm_env
   #SBATCH --time=00:30:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=2
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G

   #SBATCH -o create_vllm_env.%j.out
   #SBATCH -e create_vllm_env.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi

   [ -d ${VIRTUAL_ENV} ] && { echo "Environment ${VIRTUAL_ENV} already exists. Exiting."; exit 1; }

   conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

   if [ $? -ne 0 ]; then
       echo "Conda environment creation failed." >&2
       exit 1
   fi

   echo "Conda environment created successfully."
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   echo "Environment ready for vLLM installation."

Submit and verify:

.. code:: bash

   sbatch create_vllm_env.sh
   cat create_vllm_env.<jobid>.out

--------------


.. _k25-installing-vllm-in-the-conda-environment:

4. Installing vLLM in the Conda environment
-------------------------------------------

Use of ``pip`` inside the Conda environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Conda is the preferred package manager on Discoverer+ and should be used
wherever packages are available in a suitable version on conda-forge.
For vLLM, the conda-forge channel currently only carries versions up to
0.10.x — significantly behind the 0.19.1 release verified by Moonshot AI
for Kimi K2.5. The vLLM project distributes current releases exclusively
through PyPI wheels, so ``pip`` is necessary for this specific package.

``pip`` is safe to use here provided it is invoked through the ``pip``
binary that resides inside the Conda environment. Setting
``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` before calling ``pip`` causes
``pip`` to install all packages into
``${VIRTUAL_ENV}/lib/python3.11/site-packages/`` — entirely within the
project storage path on ``/valhalla``. Nothing is written to
``~/.local`` or to any other location withing the user’s home directory.
The home directory spillage only occurs when ``pip`` is called without
an active environment, using the system Python binary.

To confirm the correct ``pip`` binary is active at any point during a
job, execute as a job within the created virtual environment:

.. code:: bash

   which pip
   # must print: /valhalla/projects/<account>/virt_envs/vllm-kimi/bin/pip

Installing vLLM via a SLURM batch job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save the following as ``install_vllm.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=install_vllm
   #SBATCH --time=01:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=4
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=32G

   #SBATCH -o install_vllm.%j.out
   #SBATCH -e install_vllm.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi

   [ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; }

   # Expose the Conda environment. pip below installs into ${VIRTUAL_ENV}, not into ~/.local
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   echo "Using pip at: $(which pip)"

   # vLLM 0.19.1 is not on conda-forge; install from PyPI using the environment's own pip
   pip install "vllm==0.19.1"

   if [ $? -ne 0 ]; then
       echo "vLLM installation failed." >&2
       exit 1
   fi

   # huggingface_hub provides huggingface-cli for weight downloading
   pip install "huggingface_hub[cli]"

   echo "vLLM installation complete."
   echo "Installed vLLM version: $(python -c 'import vllm; print(vllm.__version__)')"
   echo "Install location: $(python -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')"

The final echo lines confirm installation went into the Conda
environment and not the home directory. Verify before proceeding.

Downloading model weights
~~~~~~~~~~~~~~~~~~~~~~~~~

Model weights must be stored in project storage. The BF16 checkpoint is
approximately 2 TB; home directory quota on Discoverer+ cannot
accommodate this. Setting ``HF_HOME`` redirects the Hugging Face
metadata cache away from ``~/.cache/huggingface``. Submit a download
job:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=download_kimi
   #SBATCH --time=04:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=4
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=32G

   #SBATCH -o download_kimi.%j.out
   #SBATCH -e download_kimi.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
   MODEL_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.5

   huggingface-cli download moonshotai/Kimi-K2.5 \
       --local-dir ${MODEL_DIR} \
       --local-dir-use-symlinks False

   echo "Download complete. Weights at ${MODEL_DIR}."

Allow 2-4 hours. Ensure project storage allocation exceeds 2.5 TB before
submitting.

--------------


.. _k25-baseline-deployment:

5. Baseline deployment
----------------------

The following SLURM job starts a vLLM inference server on a DGX H200
node. All flags listed are required for correct behaviour with Kimi
K2.5.

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=gpu
   #SBATCH --job-name=vllm_kimi
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<gpu_qos_for_your_cluster>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gpus-per-node=8
   #SBATCH --cpus-per-task=112
   #SBATCH --mem=1800G

   #SBATCH -o vllm_kimi.%j.out
   #SBATCH -e vllm_kimi.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.5

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 8 \
       --mm-encoder-tp-mode data \
       --tool-call-parser kimi_k2 \
       --reasoning-parser kimi_k2 \
       --enable-auto-tool-choice \
       --trust-remote-code

Flag explanations
~~~~~~~~~~~~~~~~~

``--tensor-parallel-size 8`` Shards model weights across all 8 GPUs.
Required to fit the model in VRAM. The DGX H200 NVSwitch fabric provides
7.2 TB/s aggregate GPU-to-GPU bandwidth, making all-reduce operations
efficient.

``--mm-encoder-tp-mode data`` Deploys the MoonViT vision encoder in
data-parallel mode. The encoder is small enough that tensor parallelism
adds communication overhead with negligible memory benefit. Encoder
weights are replicated across TP ranks and image inputs are processed in
parallel. Reduces ``--gpu-memory-utilization`` slightly if OOM errors
occur at startup.

``--tool-call-parser kimi_k2`` Required for correct parsing of function
call syntax from model output when using agentic or RAG workflows.

``--reasoning-parser kimi_k2`` K2.5 emits reasoning content in a
structured format; without this flag, reasoning tokens are not correctly
separated from final output in the API response. As of vLLM 0.9.0,
specifying this flag implicitly enables reasoning mode; the deprecated
``--enable-reasoning`` flag is no longer needed.

``--enable-auto-tool-choice`` Permits the model to autonomously decide
when to call a tool, rather than requiring the client to specify
``tool_choice`` in each request.

``--trust-remote-code`` Required for K2.5’s MLA attention
implementation, which defines custom architecture classes not present in
the vLLM codebase.

--------------


.. _k25-memory-layout-and-gpu-allocation:

6. Memory layout and GPU allocation
-----------------------------------

VRAM consumption breakdown (8x H200, BF16 weights)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+-----------------------+-----------------------+-----------------------+
| Component             | Approximate size      | Notes                 |
+=======================+=======================+=======================+
| MoE routed expert     | ~410 GB               | distributed across TP |
| weights               |                       | group                 |
+-----------------------+-----------------------+-----------------------+
| Attention layers      | ~120 GB               | all 61 MLA layers     |
| (BF16)                |                       |                       |
+-----------------------+-----------------------+-----------------------+
| Shared expert weights | ~12 GB                | one shared expert per |
|                       |                       | MoE layer             |
+-----------------------+-----------------------+-----------------------+
| Dense layer 0,        | ~7 GB                 | first layer is fully  |
| embeddings, lm_head   |                       | dense                 |
+-----------------------+-----------------------+-----------------------+
| Activations and CUDA  | ~10 GB                | varies with batch     |
| overhead              |                       | size                  |
+-----------------------+-----------------------+-----------------------+
| KV cache (remainder)  | ~50-100 GB            | at                    |
|                       |                       | ``--gpu-memo          |
|                       |                       | ry-utilization 0.92`` |
+-----------------------+-----------------------+-----------------------+

Total VRAM across 8x H200 is 1,128 GB. At BF16 precision, weights alone
occupy approximately 550 GB, leaving roughly 550 GB for KV cache,
activations, and CUDA overhead at ``--gpu-memory-utilization 0.92``.

GPU memory utilisation
~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --gpu-memory-utilization 0.92

The default in vLLM is 0.90. Setting 0.92 on the DGX H200 with 8-way TP
is safe for K2.5 and recovers approximately 23 GB of additional KV cache
space across the node. Do not exceed 0.95 without careful testing;
values above this risk OOM on prefill spikes.

Context length
~~~~~~~~~~~~~~

.. code:: bash

   --max-model-len 65536

K2.5 supports up to 256,000 tokens. Do not leave ``--max-model-len`` at
the model default unless your workload genuinely requires it. Every
sequence reserves KV cache proportional to ``--max-model-len``. Thanks
to MLA’s 10x KV compression, 65,536 tokens is viable on the DGX H200 at
``--gpu-memory-utilization 0.92``. Increase to 131,072 if longer
contexts are required, and verify with a benchmark run first.

System memory in SLURM
~~~~~~~~~~~~~~~~~~~~~~

The DGX H200 has 2 TB of system RAM. ``--mem=1800G`` reserves 1,800 GB,
leaving approximately 200 GB for the OS. This headroom supports the
swap-space optimisation in section 8, which uses CPU RAM as KV cache
overflow storage.

--------------


.. _k25-expert-parallelism:

7. Expert parallelism
---------------------

Background
~~~~~~~~~~

Standard tensor parallelism for MoE models replicates all experts on
every GPU and shards each expert’s weight matrices across GPUs. Expert
parallelism (EP) instead assigns different experts to different GPUs.
This reduces per-GPU memory pressure and, at sufficient concurrency,
increases GPU utilisation because different requests activate different
expert subsets.

For K2.5 with 384 experts across 8 GPUs, EP assigns approximately 48
experts per GPU (plus the shared expert, which remains replicated).

Enabling expert parallelism
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --enable-expert-parallel

This substitutes expert parallelism for tensor parallelism on MoE
layers. Attention layers continue to use tensor parallelism regardless.
Note that this flag only takes effect when
``tensor-parallel-size x data-parallel-size > 1``; on a single-node
8-GPU deployment with TP=8, this condition is satisfied.

Expert parallelism load balancing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

K2.5’s router is trained to distribute tokens across experts, but
traffic can become skewed in practice. vLLM’s Expert Parallel Load
Balancer (EPLB) redistributes expert mappings to even the load. Enable
it with ``--enable-eplb`` and configure it through ``--eplb-config`` as
a JSON object:

.. code:: bash

   --enable-expert-parallel \
   --enable-eplb \
   --eplb-config '{"window_size": 1000, "step_interval": 1000}'

``window_size`` controls how many forward-pass steps of load statistics
are retained. ``step_interval`` controls how often expert rearrangement
is triggered; the default is 3000 steps. Setting 1000 makes rebalancing
more responsive under bursty traffic at the cost of slightly higher
rearrangement overhead. Do not set ``step_interval`` lower than
``window_size``, as the rebalancer would then operate on incomplete
statistics.

Skipping non-local expert weights on load
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --skip-non-local-expert-weights

With EP active, each GPU rank only needs its own expert shard. This flag
causes each rank to skip loading expert weights that will not reside on
that GPU, reducing storage I/O by approximately 7/8 on an 8-GPU node.
This has no effect if the checkpoint uses a 3D fused-expert format.

--------------


.. _k25-kv-cache-optimisation:

8. KV cache optimisation
------------------------

FP8 KV cache
~~~~~~~~~~~~

.. code:: bash

   --kv-cache-dtype fp8

Quantising the KV cache from BF16 to FP8 halves memory per cached token
and reduces memory bandwidth during attention decode steps. Requires
CUDA 11.8 or later. On H200 (Hopper architecture), this has been
validated by the vLLM team to preserve near-baseline accuracy. The main
accuracy caveat is for hybrid-attention models with small sliding-window
layers, which does not apply to K2.5.

Without a pre-calibrated checkpoint, vLLM defaults KV scale factors to
1.0. For better accuracy under extreme quantisation conditions, supply a
calibrated scale file via ``--quantization-param-path``.

Prefix caching
~~~~~~~~~~~~~~

.. code:: bash

   --enable-prefix-caching

Reuses the computed KV cache for identical prompt prefixes, eliminating
redundant prefill computation. Prefix caching is enabled by default in
vLLM V1; specify the flag explicitly if using an older version.

Effectiveness depends entirely on prompt structure. To maximise cache
hit rates:

-  keep the system prompt identical across all requests
-  prepend retrieved document chunks in a consistent order
-  do not insert dynamic content (timestamps, request IDs) before shared
   content

CPU offload (swap space)
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --swap-space 32

The value is in GiB per GPU. On the DGX H200 with 2 TB system RAM and 8
GPUs, this allocates up to 256 GiB of CPU RAM for KV offload across all
ranks — well within the 1,800 GB SLURM allocation.

Swap is a fallback, not a primary optimisation. Monitor for preemption
warnings:

::

   WARNING scheduler.py Sequence group N is preempted by PreemptionMode.SWAP mode

Frequent preemptions indicate KV pressure. Remedies in order of
preference: increase ``--gpu-memory-utilization``, decrease
``--max-model-len``, decrease ``--max-num-seqs``.

--------------


.. _k25-chunked-prefill-and-scheduler-tuning:

9. Chunked prefill and scheduler tuning
---------------------------------------

Chunked prefill
~~~~~~~~~~~~~~~

.. code:: bash

   --enable-chunked-prefill \
   --max-num-batched-tokens 8192

Without chunked prefill, a single long-context request occupies the GPU
entirely during prefill, blocking decode for all other in-flight
requests. Chunked prefill breaks large prefill computations into chunks
and interleaves them with decode steps.

``--max-num-batched-tokens`` controls the total tokens processed per
scheduling step across all requests. A value of 8,192 is a reasonable
starting point for the DGX H200. Larger values improve GPU utilisation
at the cost of increased per-step latency; values below 4,096 may leave
the GPU underutilised on prefill-heavy workloads.

Async scheduling, which overlaps scheduling overhead with decoding, is
enabled by default in recent vLLM versions. Disable with
``--no-async-scheduling`` only if unexpected behaviour is observed.

Max concurrent sequences
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --max-num-seqs 256

Caps the number of sequences in flight simultaneously. The default is
256. Reduce this if KV cache OOM errors occur under high concurrency
with long contexts; increase it if GPU utilisation is consistently below
80%.

--------------


.. _k25-eagle3-speculative-decoding:

10. Eagle3 speculative decoding
-------------------------------

Speculative decoding accelerates decode throughput by using a small
draft model to generate candidate tokens, which the main model verifies
in a single forward pass. Eagle3 draft models are trained to match the
hidden-state distribution of the target model specifically.

Two Eagle3 draft models are available for K2.5:

-  ``lightseekorg/kimi-k2.5-eagle3-mla`` for Instant mode (non-thinking)
-  ``nvidia/Kimi-K2.5-Thinking-Eagle3`` for Thinking mode (reasoning
   enabled)

Download the appropriate Eagle3 weights to project storage using the
same download job pattern from section 4.

Enabling Eagle3
~~~~~~~~~~~~~~~

.. code:: bash

   --speculative-config '{"model": "/valhalla/projects/<account>/models/kimi-k2.5-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}'

Fields:

-  ``model``: local path to Eagle3 weights on project storage
-  ``method``: must be ``"eagle3"``; not auto-detected for Eagle3
-  ``num_speculative_tokens``: tokens the draft model generates per step
   before the main model verifies; 3 is the value used in the official
   vLLM recipe for K2.5

Higher values of ``num_speculative_tokens`` increase potential speedup
per accepted run but also increase verification cost when tokens are
rejected. Values of 4 or 5 may yield further gains on predictable RAG
output; evaluate on representative traffic.

Speculative decoding requires vLLM 0.18.0 or later for Eagle3 support,
satisfied by the recommended 0.19.1.

Interaction with prefix caching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These two features can be used together in recent vLLM versions. Verify
in server startup logs that both report as active if using both
simultaneously.

--------------


.. _k25-moe-triton-kernel-tuning:

11. MoE Triton kernel tuning
----------------------------

vLLM uses Triton kernels for MoE expert routing and computation. Without
a tuned configuration, vLLM logs at startup:

::

   WARNING fused_moe.py Using default MoE config. Performance might be sub-optimal!

The ``benchmark_moe.py`` tuning script writes a hardware-specific JSON
file named after the GPU and expert dimensions
(e.g. ``E=384,N=...,device_name=NVIDIA_H200_141GB_HBM3e.json``) into a
target directory. Setting ``VLLM_TUNED_CONFIG_FOLDER`` to that directory
before serving causes vLLM to load it automatically.

Running the tuning script via SLURM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save as ``tune_moe_kernels.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=gpu
   #SBATCH --job-name=tune_moe
   #SBATCH --time=02:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<gpu_qos_for_your_cluster>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gpus-per-node=8
   #SBATCH --cpus-per-task=112
   #SBATCH --mem=1800G

   #SBATCH -o tune_moe.%j.out
   #SBATCH -e tune_moe.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.5
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning

   mkdir -p ${TUNING_DIR}

   # --tune  runs the configuration sweep
   # --save-dir  directory where the JSON config file is written
   # --tp-size   must match --tensor-parallel-size used during inference
   python benchmarks/kernels/benchmark_moe.py \
       --model ${MODEL_PATH} \
       --tp-size 8 \
       --dtype bfloat16 \
       --tune \
       --save-dir ${TUNING_DIR}

   echo "Tuning complete. Config written to ${TUNING_DIR}."

The tuning run takes 30-90 minutes. Re-run if you change GPU count, TP
size, or model.

Loading the tuned configuration at serve time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   export VLLM_TUNED_CONFIG_FOLDER=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning

Set this before the ``vllm serve`` call. vLLM locates and loads the
matching config file automatically and logs:

::

   INFO fused_moe.py Using configuration from /path/to/moe_tuning/E=384,...json

--------------


.. _k25-full-optimised-slurm-job-script:

12. Full optimised SLURM job script
-----------------------------------

Save as ``serve_kimi_optimised.sh``.

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=gpu
   #SBATCH --job-name=vllm_kimi_opt
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<gpu_qos_for_your_cluster>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gpus-per-node=8
   #SBATCH --cpus-per-task=112
   #SBATCH --mem=1800G

   #SBATCH -o vllm_kimi_opt.%j.out
   #SBATCH -e vllm_kimi_opt.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.5
   EAGLE3_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.5-eagle3-mla
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning

   # Load tuned MoE kernel configuration if the directory exists
   [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 8 \
       --mm-encoder-tp-mode data \
       --gpu-memory-utilization 0.92 \
       --max-model-len 65536 \
       --dtype bfloat16 \
       --kv-cache-dtype fp8 \
       --enable-prefix-caching \
       --enable-chunked-prefill \
       --max-num-batched-tokens 8192 \
       --swap-space 32 \
       --enable-expert-parallel \
       --enable-eplb \
       --eplb-config '{"window_size": 1000, "step_interval": 1000}' \
       --skip-non-local-expert-weights \
       --tool-call-parser kimi_k2 \
       --reasoning-parser kimi_k2 \
       --enable-auto-tool-choice \
       --speculative-config "{\"model\": \"${EAGLE3_PATH}\", \"method\": \"eagle3\", \"num_speculative_tokens\": 3}" \
       --trust-remote-code

For Thinking mode workloads, change ``EAGLE3_PATH``:

.. code:: bash

   EAGLE3_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.5-thinking-eagle3

Submit:

.. code:: bash

   sbatch serve_kimi_optimised.sh

The vLLM server binds to port 8000 by default. Retrieve the compute node
hostname from the job output file and connect your client to
``http://<node_hostname>:8000/v1``.

--------------


.. _k25-benchmarking:

13. Benchmarking
----------------

Submit benchmarks as SLURM jobs. Replace ``<inference_node_hostname>``
with the hostname from the server job output. Save as
``benchmark_kimi.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=gpu
   #SBATCH --job-name=benchmark_kimi
   #SBATCH --time=01:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<gpu_qos_for_your_cluster>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gpus-per-node=1
   #SBATCH --cpus-per-task=16
   #SBATCH --mem=64G

   #SBATCH -o benchmark_kimi.%j.out
   #SBATCH -e benchmark_kimi.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   SERVER_URL=http://<inference_node_hostname>:8000

   vllm bench serve \
       --base-url ${SERVER_URL} \
       --backend openai-chat \
       --endpoint /v1/chat/completions \
       --model moonshotai/Kimi-K2.5 \
       --dataset-name hf \
       --dataset-path lmarena-ai/VisionArena-Chat \
       --num-prompts 1000 \
       --request-rate 20 \
       --trust-remote-code

Key metrics to track
~~~~~~~~~~~~~~~~~~~~

-  output tokens per second (decode throughput)
-  time to first token (TTFT) — prefill latency
-  inter-token latency (ITL) — decode latency per token
-  KV cache utilisation — reported in vLLM logs and Prometheus metrics
-  preemption count — an increase indicates KV pressure

Profiling GPU utilisation
~~~~~~~~~~~~~~~~~~~~~~~~~

From within a SLURM job on the compute node:

.. code:: bash

   nvidia-smi dmon -s u -d 1

Under sustained load, all 8 GPUs should show compute utilisation above
70%. Consistent values below this indicate a scheduling or communication
bottleneck rather than a compute bottleneck.

--------------


.. _k25-known-caveats-and-constraints:

14. Known caveats and constraints
---------------------------------

``pip`` inside the Conda environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All ``pip`` invocations in this guide follow
``export PATH=${VIRTUAL_ENV}/bin:${PATH}``. If a new SLURM script omits
this line and calls ``pip`` directly, packages will install into
``~/.local/lib/python3.11/site-packages/`` and consume home directory
quota. Always verify with ``which pip`` before any installation step.

FP8 KV cache with chunked prefill
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a known interaction between ``--kv-cache-dtype fp8`` and
``--enable-chunked-prefill`` in vLLM versions below 0.17.0 that can
produce type incompatibility errors. This has been resolved in the
recommended vLLM 0.19.1. If using an older version, disable one of the
two flags and verify stability before re-enabling both.

Speculative decoding and prefix caching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These two features can interact unexpectedly in some vLLM versions.
Verify in server startup logs that both report as active when using them
simultaneously.

``--mm-encoder-tp-mode`` and memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``--mm-encoder-tp-mode data`` replicates vision encoder weights across
all TP ranks. If OOM errors occur at startup, reduce
``--gpu-memory-utilization`` from 0.92 to 0.90 as a first remediation.

``--skip-non-local-expert-weights``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This flag only reduces storage I/O on load and has no effect at
inference time. It has no effect if your checkpoint uses a 3D
fused-expert format. Verify weight loading time with and without the
flag to confirm it is active for your checkpoint format.

MoE tuning config naming
~~~~~~~~~~~~~~~~~~~~~~~~

The file written by ``benchmark_moe.py --tune --save-dir`` is named
automatically based on the model’s expert configuration and the detected
GPU name. The exact filename will depend on how CUDA reports the H200
device name on your cluster. Confirm the file is present in
``TUNING_DIR`` and that vLLM logs confirmation of loading it at serve
startup.

vLLM version pinning
~~~~~~~~~~~~~~~~~~~~

Moonshot AI has verified deployment against vLLM 0.19.1. Nightly builds
and later versions may introduce API changes or regressions in
K2.5-specific code paths. Pin to 0.19.1 for production and test newer
versions in a staging environment before promoting.

Reasoning mode default
~~~~~~~~~~~~~~~~~~~~~~

K2.5 enables Thinking mode by default. To suppress reasoning output on a
per-request basis, pass ``thinking_token_budget: 0`` as a sampling
parameter. The ``--reasoning-parser kimi_k2`` flag is still required
even when reasoning is suppressed per-request, as it initialises the
parser infrastructure.

Login node usage
~~~~~~~~~~~~~~~~

On Discoverer+, all computationally or I/O-intensive operations must be
submitted as SLURM jobs. This includes Conda environment creation,
package installation, model weight downloading, server startup, and
benchmarking. Running any of these on the login node is prohibited and
will compete for resources shared with all other users.

Conda activation on Discoverer+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The guide uses ``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` rather than
``conda activate`` to expose the virtual environment within SLURM
scripts. This is the recommended approach on Discoverer+ and does not
require initialising a Conda shell. Use ``conda activate`` only if a
specific package or script explicitly requires it.