Serving Kimi K2.6 on DGX H200 with vLLM and SLURM
=================================================

This guide covers the complete process of deploying Moonshot AI’s Kimi
K2.6 on a single DGX H200 node within a SLURM-managed cluster
(Discoverer+), using Conda for environment management and vLLM for
inference (some of the components are instaled using ``pip``).

Contents
--------

1.  `Model overview <k26-model-overview_>`_
2.  `Hardware and software prerequisites <k26-hardware-and-software-prerequisites_>`_
3.  `Environment setup with Conda on Discoverer+ <k26-environment-setup-with-conda-on-discoverer_>`_
4.  `Installing vLLM in the Conda environment <k26-installing-vllm-in-the-conda-environment_>`_
5.  `Baseline deployment <k26-baseline-deployment_>`_
6.  `Memory layout and GPU allocation <k26-memory-layout-and-gpu-allocation_>`_
7.  `Expert parallelism <k26-expert-parallelism_>`_
8.  `KV cache optimisation <k26-kv-cache-optimisation_>`_
9.  `Chunked prefill and scheduler tuning <k26-chunked-prefill-and-scheduler-tuning_>`_
10.  `Eagle3 speculative decoding <k26-eagle3-speculative-decoding_>`_
11.  `MoE Triton kernel tuning <k26-moe-triton-kernel-tuning_>`_
12.  `Full optimised SLURM job script <k26-full-optimised-slurm-job-script_>`_
13.  `Benchmarking <k26-benchmarking_>`_
14.  `Known caveats and constraints <k26-known-caveats-and-constraints_>`_

--------------


.. _k26-model-overview:

1. Model overview
-----------------

Kimi K2.6 is a 1 trillion parameter Mixture-of-Experts (MoE) model
released by Moonshot AI on April 20, 2026. It shares the same base
architecture as Kimi K2.5; Moonshot’s own deployment guide states
explicitly: “Kimi-K2.6 has the same architecture as Kimi-K2.5, and the
deployment method can be directly reused.” Key architectural
characteristics relevant to deployment:

=========================== =================================
Property                    Value
=========================== =================================
Total parameters            1 trillion
Active parameters per token 32 billion
Layers                      61 (including 1 dense layer)
Attention mechanism         Multi-head Latent Attention (MLA)
Number of experts           384
Experts selected per token  8 routed + 1 shared
Attention heads             64
Activation function         SwiGLU
Vision encoder              MoonViT (400M parameters)
Vocabulary size             160K
Context window              262,144 tokens
=========================== =================================

MLA compresses the KV cache by approximately 10× compared to standard
MHA, making long-context serving materially more practical.

K2.6 differs from K2.5 in the following deployment-relevant ways:

-  A native INT4 checkpoint (``moonshotai/Kimi-K2.6-INT4``) is
   available, quantised during training rather than post-hoc. The
   verified VRAM requirement for the INT4 checkpoint is approximately
   640 GB, well within the single DGX H200 node envelope of 1,128 GB.
-  Native video input is supported but flagged by Moonshot as
   experimental for third-party deployments; do not rely on it for
   production workloads on self-hosted vLLM.
-  The ``transformers`` library version must be ``>=4.57.1, <5.0.0``.

Weights are available at ``moonshotai/Kimi-K2.6`` (BF16, approximately 2
TB on disk) and ``moonshotai/Kimi-K2.6-INT4`` (INT4, approximately 594
GB on disk) on Hugging Face under a modified MIT licence.

--------------


.. _k26-hardware-and-software-prerequisites:

2. Hardware and software prerequisites
--------------------------------------

DGX H200 system specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The NVIDIA DGX H200 provides the following hardware relevant to this
deployment:

+-----------------------------------+-----------------------------------+
| Component                         | Specification                     |
+===================================+===================================+
| GPUs                              | 8x NVIDIA H200 SXM Tensor Core    |
|                                   | GPU                               |
+-----------------------------------+-----------------------------------+
| GPU memory                        | 141 GB HBM3e per GPU, 1,128 GB    |
|                                   | total                             |
+-----------------------------------+-----------------------------------+
| GPU memory bandwidth              | 4.8 TB/s per GPU                  |
+-----------------------------------+-----------------------------------+
| GPU interconnect                  | 18x NVLink 4.0 connections per    |
|                                   | GPU, 900 GB/s bidirectional per   |
|                                   | GPU                               |
+-----------------------------------+-----------------------------------+
| NVSwitch                          | 4x NVSwitch, 7.2 TB/s aggregate   |
|                                   | bidirectional GPU-to-GPU          |
|                                   | bandwidth                         |
+-----------------------------------+-----------------------------------+
| Host CPUs                         | 2x Intel Xeon Platinum 8480C, 112 |
|                                   | cores total                       |
+-----------------------------------+-----------------------------------+
| System memory                     | 2 TB DDR5                         |
+-----------------------------------+-----------------------------------+
| NVMe storage                      | 8x 3.84 TB (data), 2x 1.92 TB     |
|                                   | (OS)                              |
+-----------------------------------+-----------------------------------+
| Network                           | 10x ConnectX-7, 400 Gb/s          |
|                                   | InfiniBand/Ethernet               |
+-----------------------------------+-----------------------------------+

The 4x NVSwitch fabric provides full all-to-all GPU connectivity at 7.2
TB/s, which is critical for all-reduce operations in tensor parallelism
across all 8 GPUs.

Software requirements
~~~~~~~~~~~~~~~~~~~~~

+---------------+------------------+---------------------------------+
| Component     | Minimum version  | Notes                           |
+===============+==================+=================================+
| CUDA toolkit  | 12.1             | 12.8 required for FP8 KV cache  |
|               |                  | on Hopper                       |
+---------------+------------------+---------------------------------+
| NVIDIA driver | 535.x            | 560+ recommended                |
+---------------+------------------+---------------------------------+
| Python        | 3.11             | as specified in the Conda       |
|               |                  | environment                     |
+---------------+------------------+---------------------------------+
| vLLM          | 0.19.1           | verified by Moonshot AI for     |
|               |                  | K2.6; pin this version          |
+---------------+------------------+---------------------------------+
| transformers  | >=4.57.1, <5.0.0 | required by the K2.6 model code |
+---------------+------------------+---------------------------------+
| PyTorch       | 2.5+             | installed as a vLLM dependency  |
+---------------+------------------+---------------------------------+

On Discoverer+, CUDA libraries are provided through the cluster
environment module system and do not need to be installed manually
inside the Conda environment.

--------------


.. _k26-environment-setup-with-conda-on-discoverer:

3. Environment setup with Conda on Discoverer+
----------------------------------------------

On Discoverer+, Conda is provided through the centralised Anaconda
installation and accessed via the module system. Do not install a
separate Anaconda or Miniconda distribution in your home or project
directory.

The recommended location for virtual environments on Discoverer+ is:

::

   /valhalla/projects/<your_slurm_project_account_name>/virt_envs/

Create a dedicated environment for K2.6, separate from any K2.5
environment, to keep the two deployments independent.

Creating the vLLM environment via a SLURM batch job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Environment creation must not be run on the login node. Submit a SLURM
batch job instead.

Save the following as ``create_vllm_k26_env.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=create_vllm_k26_env
   #SBATCH --time=00:30:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=2
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G

   #SBATCH -o create_vllm_k26_env.%j.out
   #SBATCH -e create_vllm_k26_env.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26

   [ -d ${VIRTUAL_ENV} ] && { echo "Environment ${VIRTUAL_ENV} already exists. Exiting."; exit 1; }

   conda create --prefix ${VIRTUAL_ENV} python=3.11 -y

   if [ $? -ne 0 ]; then
       echo "Conda environment creation failed." >&2
       exit 1
   fi

   echo "Conda environment created successfully."
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   echo "Environment ready for vLLM installation."

Submit and verify:

.. code:: bash

   sbatch create_vllm_k26_env.sh
   cat create_vllm_k26_env.<jobid>.out

--------------


.. _k26-installing-vllm-in-the-conda-environment:

4. Installing vLLM in the Conda environment
-------------------------------------------

Use of ``pip`` inside the Conda environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Conda is the preferred package manager on Discoverer+ and should be used
wherever packages are available in a suitable version on conda-forge.
For vLLM, the conda-forge channel currently only carries versions up to
0.10.x — significantly behind the 0.19.1 release verified by Moonshot AI
for K2.6. The vLLM project distributes current releases exclusively
through PyPI wheels, so ``pip`` is necessary for this specific package.

``pip`` is safe to use here provided it is invoked through the ``pip``
binary that resides inside the Conda environment. Setting
``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` before calling ``pip`` causes
``pip`` to install all packages into
``${VIRTUAL_ENV}/lib/python3.11/site-packages/`` — entirely within the
project storage path on ``/valhalla``. Nothing is written to
``~/.local`` or to any other location within the user’s home directory.
The home directory spillage only occurs when ``pip`` is called without
an active environment, using the system Python binary.

To confirm the correct ``pip`` binary is active at any point during a
job, execute as a job within the created virtual environment:

.. code:: bash

   which pip
   # must print: /valhalla/projects/<account>/virt_envs/vllm-kimi-k26/bin/pip

Installing vLLM via a SLURM batch job
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save the following as ``install_vllm_k26.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=install_vllm_k26
   #SBATCH --time=01:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=4
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=32G

   #SBATCH -o install_vllm_k26.%j.out
   #SBATCH -e install_vllm_k26.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26

   [ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; }

   # Expose the Conda environment. pip below installs into ${VIRTUAL_ENV}, not into ~/.local
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   echo "Using pip at: $(which pip)"

   # vLLM 0.19.1 is not on conda-forge; install from PyPI using the environment's own pip
   pip install "vllm==0.19.1"

   if [ $? -ne 0 ]; then
       echo "vLLM installation failed." >&2
       exit 1
   fi

   pip install "huggingface_hub[cli]"

   # K2.6 requires transformers >=4.57.1, <5.0.0
   pip install "transformers>=4.57.1,<5.0.0"

   echo "vLLM installation complete."
   echo "Installed vLLM version: $(python -c 'import vllm; print(vllm.__version__)')"
   echo "Install location: $(python -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')"

Submit:

.. code:: bash

   sbatch install_vllm_k26.sh

Verify in the job output that the install location is under
``/valhalla`` and not ``~/.local``.

Downloading model weights
~~~~~~~~~~~~~~~~~~~~~~~~~

The INT4 checkpoint is recommended for single-node deployment on DGX
H200. It requires approximately 640 GB of VRAM and approximately 594 GB
of disk space. Ensure project storage has at least 700 GB free before
submitting.

Save the following as ``download_kimi_k26.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=download_kimi_k26
   #SBATCH --time=04:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=4
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=32G

   #SBATCH -o download_kimi_k26.%j.out
   #SBATCH -e download_kimi_k26.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

   MODEL_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4

   huggingface-cli download moonshotai/Kimi-K2.6-INT4 \
       --local-dir ${MODEL_DIR} \
       --local-dir-use-symlinks False

   echo "Download complete. Weights at ${MODEL_DIR}."

To download the BF16 checkpoint instead, substitute
``moonshotai/Kimi-K2.6`` and update ``MODEL_DIR`` accordingly. Allow 2-4
hours for the INT4 checkpoint and significantly longer for BF16.

--------------


.. _k26-baseline-deployment:

5. Baseline deployment
----------------------

The following SLURM job starts a vLLM inference server on a single DGX
H200 node. All flags listed are required for correct behaviour with Kimi
K2.6.

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=vllm_kimi_k26
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:8
   #SBATCH --cpus-per-task=112
   #SBATCH --mem=1800G

   #SBATCH -o vllm_kimi_k26.%j.out
   #SBATCH -e vllm_kimi_k26.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache
   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 8 \
       --mm-encoder-tp-mode data \
       --tool-call-parser kimi_k2 \
       --reasoning-parser kimi_k2 \
       --enable-auto-tool-choice \
       --trust-remote-code

Flag explanations
~~~~~~~~~~~~~~~~~

``--tensor-parallel-size 8`` Shards model weights across all 8 GPUs.
Required to fit the model in VRAM. The DGX H200 NVSwitch fabric provides
7.2 TB/s aggregate GPU-to-GPU bandwidth, making all-reduce operations
efficient.

``--mm-encoder-tp-mode data`` Deploys the MoonViT vision encoder in
data-parallel mode. The encoder (400M parameters) is small enough that
tensor parallelism adds communication overhead with negligible memory
benefit. Confirmed in the official vLLM recipe for K2.6.

``--tool-call-parser kimi_k2`` Required for correct parsing of function
call syntax. The ``kimi_k2`` parser covers the entire K2 series
including K2.6, as confirmed in Moonshot’s deploy guide.

``--reasoning-parser kimi_k2`` K2.6 enables Thinking mode by default.
Without this flag, reasoning tokens are not correctly separated from
final output in the API response. As of vLLM 0.9.0, this flag implicitly
enables reasoning mode.

``--enable-auto-tool-choice`` Permits the model to decide when to call a
tool without the client specifying ``tool_choice`` in each request.

``--trust-remote-code`` Required for K2.6’s MLA attention
implementation, which defines custom architecture classes not present in
the vLLM codebase.

--------------


.. _k26-memory-layout-and-gpu-allocation:

6. Memory layout and GPU allocation
-----------------------------------

VRAM consumption breakdown (8x H200, INT4 weights)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+------------------------+------------------+------------------------+
| Component              | Approximate size | Notes                  |
+========================+==================+========================+
| Model weights (INT4)   | ~640 GB          | distributed across TP  |
|                        |                  | group                  |
+------------------------+------------------+------------------------+
| Activations and CUDA   | ~10 GB           | varies with batch size |
| overhead               |                  |                        |
+------------------------+------------------+------------------------+
| KV cache (remainder)   | ~450 GB          | at                     |
|                        |                  | ``--gpu-mem            |
|                        |                  | ory-utilization 0.92`` |
+------------------------+------------------+------------------------+

The INT4 checkpoint leaves substantially more KV cache headroom than the
BF16 checkpoint on a single DGX H200 node, making longer context lengths
and higher concurrency viable without moving to a two-node deployment.

For BF16 weights, the breakdown mirrors the K2.5 guide: approximately
550 GB for weights, leaving approximately 550 GB for KV cache,
activations, and overhead at ``--gpu-memory-utilization 0.92``.

GPU memory utilisation
~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --gpu-memory-utilization 0.92

The default in vLLM is 0.90. Setting 0.92 on the DGX H200 with 8-way TP
is safe and recovers approximately 23 GB of additional KV cache space
across the node.

Context length
~~~~~~~~~~~~~~

.. code:: bash

   --max-model-len 131072

K2.6 supports up to 262,144 tokens. With INT4 weights and the large
remaining KV cache, higher values are feasible; start conservatively at
131,072 and increase to 262,144 only after verifying headroom with a
benchmark run. Every sequence reserves KV cache proportional to
``--max-model-len``; do not leave this at the model default.

System memory in SLURM
~~~~~~~~~~~~~~~~~~~~~~

The DGX H200 has 2 TB of system RAM. ``--mem=1800G`` reserves 1,800 GB,
leaving approximately 200 GB for the OS. This supports the swap-space
optimisation in section 8.

--------------


.. _k26-expert-parallelism:

7. Expert parallelism
---------------------

Background
~~~~~~~~~~

Expert parallelism (EP) assigns different experts to different GPUs
rather than replicating all experts on every GPU and sharding them. For
K2.6 with 384 experts across 8 GPUs, EP assigns approximately 48 experts
per GPU (plus the shared expert, which remains replicated). This reduces
per-GPU memory pressure from expert weights and improves GPU utilisation
at high concurrency.

Enabling expert parallelism
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --enable-expert-parallel

This flag only takes effect when
``tensor-parallel-size × data-parallel-size > 1``. On a single-node
8-GPU deployment with TP=8, this condition is satisfied.

Expert parallelism load balancing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --enable-expert-parallel \
   --enable-eplb \
   --eplb-config '{"window_size": 1000, "step_interval": 1000}'

``window_size`` controls how many forward-pass steps of load statistics
are retained. ``step_interval`` controls how often rearrangement is
triggered; the default is 3000 steps. A value of 1000 makes rebalancing
more responsive. Do not set ``step_interval`` lower than
``window_size``, as the rebalancer would then operate on incomplete
statistics.

Skipping non-local expert weights on load
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --skip-non-local-expert-weights

With EP active, each GPU rank only needs its own expert shard. This flag
reduces storage I/O at load time by approximately 7/8 on an 8-GPU node.
It has no effect if the checkpoint uses a 3D fused-expert format.

--------------


.. _k26-kv-cache-optimisation:

8. KV cache optimisation
------------------------

FP8 KV cache
~~~~~~~~~~~~

.. code:: bash

   --kv-cache-dtype fp8

Quantising the KV cache from BF16 to FP8 halves memory per cached token.
Requires CUDA 11.8 or later. Validated on H200 (Hopper architecture) by
the vLLM team.

Without a pre-calibrated checkpoint, vLLM defaults KV scale factors to
1.0. For better accuracy under extreme quantisation conditions, supply a
calibrated scale file via ``--quantization-param-path``.

Prefix caching
~~~~~~~~~~~~~~

.. code:: bash

   --enable-prefix-caching

Reuses the computed KV cache for identical prompt prefixes, eliminating
redundant prefill computation. Prefix caching is enabled by default in
vLLM V1; specify the flag explicitly if using an older version.

To maximise cache hit rates:

-  keep the system prompt identical across all requests
-  prepend retrieved document chunks in a consistent order
-  do not insert dynamic content (timestamps, request IDs) before shared
   content

CPU offload (swap space)
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --swap-space 32

The value is in GiB per GPU. On the DGX H200 with 2 TB system RAM and 8
GPUs, this allocates up to 256 GiB of CPU RAM for KV offload across all
ranks.

Swap is a fallback mechanism. Monitor for preemption warnings:

::

   WARNING scheduler.py Sequence group N is preempted by PreemptionMode.SWAP mode

Remedies in order of preference: increase ``--gpu-memory-utilization``,
decrease ``--max-model-len``, decrease ``--max-num-seqs``.

--------------


.. _k26-chunked-prefill-and-scheduler-tuning:

9. Chunked prefill and scheduler tuning
---------------------------------------

Chunked prefill
~~~~~~~~~~~~~~~

.. code:: bash

   --enable-chunked-prefill \
   --max-num-batched-tokens 8192

Chunked prefill breaks large prefill computations into chunks
interleaved with decode steps, preventing single long-context requests
from blocking decode throughput for all other in-flight requests.

``--max-num-batched-tokens`` controls the total tokens processed per
scheduling step. A value of 8,192 is a reasonable starting point for the
DGX H200.

Max concurrent sequences
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   --max-num-seqs 256

Caps the number of sequences in flight simultaneously. Reduce if KV
cache OOM errors occur under high concurrency; increase if GPU
utilisation is consistently below 80%.

--------------


.. _k26-eagle3-speculative-decoding:

10. Eagle3 speculative decoding
-------------------------------

Speculative decoding accelerates decode throughput by using a small
draft model to generate candidate tokens, which the main model verifies
in a single forward pass. Eagle3 draft models are trained specifically
on the hidden-state distribution of the target model.

Two Eagle3 draft models are available for the K2 series. At the time of
writing, Moonshot has not published K2.6-specific Eagle3 weights. Use
the K2.5 Eagle3 models only if they have been confirmed compatible with
K2.6 by Moonshot or the vLLM community; do not assume compatibility
without verification.

-  ``lightseekorg/kimi-k2.5-eagle3-mla`` — Instant mode
-  ``nvidia/Kimi-K2.5-Thinking-Eagle3`` — Thinking mode

If K2.6-specific Eagle3 weights become available, download them to
project storage and substitute the path in the serve command.

Enabling Eagle3
~~~~~~~~~~~~~~~

.. code:: bash

   --speculative-config '{"model": "/valhalla/projects/<account>/models/kimi-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}'

Fields:

-  ``model``: local path to Eagle3 weights on project storage
-  ``method``: must be ``"eagle3"``; not auto-detected
-  ``num_speculative_tokens``: 3 is the value used in the official vLLM
   recipe for the K2 series

--------------


.. _k26-moe-triton-kernel-tuning:

11. MoE Triton kernel tuning
----------------------------

Without a tuned configuration, vLLM logs at startup:

::

   WARNING fused_moe.py Using default MoE config. Performance might be sub-optimal!

The ``benchmark_moe.py`` script writes a hardware-specific JSON file
named by GPU and expert dimensions into a target directory. Setting
``VLLM_TUNED_CONFIG_FOLDER`` to that directory before serving causes
vLLM to load it automatically.

Running the tuning script via SLURM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Save as ``tune_moe_k26.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=tune_moe_k26
   #SBATCH --time=02:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:8
   #SBATCH --cpus-per-task=112
   #SBATCH --mem=1800G

   #SBATCH -o tune_moe_k26.%j.out
   #SBATCH -e tune_moe_k26.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_k26

   mkdir -p ${TUNING_DIR}

   python benchmarks/kernels/benchmark_moe.py \
       --model ${MODEL_PATH} \
       --tp-size 8 \
       --dtype bfloat16 \
       --tune \
       --save-dir ${TUNING_DIR}

   echo "Tuning complete. Config written to ${TUNING_DIR}."

The tuning run takes 30-90 minutes. Re-run if you change GPU count, TP
size, or model.

Loading the tuned configuration at serve time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

   export VLLM_TUNED_CONFIG_FOLDER=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_k26

Set this before the ``vllm serve`` call. vLLM logs confirmation of
loading it at startup.

--------------


.. _k26-full-optimised-slurm-job-script:

12. Full optimised SLURM job script
-----------------------------------

Save as ``serve_kimi_k26_optimised.sh``.

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=vllm_kimi_k26_opt
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:8
   #SBATCH --cpus-per-task=112
   #SBATCH --mem=1800G

   #SBATCH -o vllm_kimi_k26_opt.%j.out
   #SBATCH -e vllm_kimi_k26_opt.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4
   EAGLE3_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-eagle3-mla
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_k26

   [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

   vllm serve ${MODEL_PATH} \
       --tensor-parallel-size 8 \
       --mm-encoder-tp-mode data \
       --gpu-memory-utilization 0.92 \
       --max-model-len 131072 \
       --dtype bfloat16 \
       --kv-cache-dtype fp8 \
       --enable-prefix-caching \
       --enable-chunked-prefill \
       --max-num-batched-tokens 8192 \
       --swap-space 32 \
       --enable-expert-parallel \
       --enable-eplb \
       --eplb-config '{"window_size": 1000, "step_interval": 1000}' \
       --skip-non-local-expert-weights \
       --tool-call-parser kimi_k2 \
       --reasoning-parser kimi_k2 \
       --enable-auto-tool-choice \
       --speculative-config "{\"model\": \"${EAGLE3_PATH}\", \"method\": \"eagle3\", \"num_speculative_tokens\": 3}" \
       --trust-remote-code

For Thinking mode workloads, change ``EAGLE3_PATH`` to the
Thinking-Eagle3 model path once K2.6-compatible weights are available.

Submit:

.. code:: bash

   sbatch serve_kimi_k26_optimised.sh

The vLLM server binds to port 8000 by default. Retrieve the compute node
hostname from the job output file and connect your client to
``http://<node_hostname>:8000/v1``.

--------------


.. _k26-benchmarking:

13. Benchmarking
----------------

Submit benchmarks as SLURM jobs. Replace ``<inference_node_hostname>``
with the hostname from the server job output. Save as
``benchmark_kimi_k26.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=benchmark_kimi_k26
   #SBATCH --time=01:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:1
   #SBATCH --cpus-per-task=16
   #SBATCH --mem=64G

   #SBATCH -o benchmark_kimi_k26.%j.out
   #SBATCH -e benchmark_kimi_k26.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   SERVER_URL=http://<inference_node_hostname>:8000

   vllm bench serve \
       --base-url ${SERVER_URL} \
       --backend openai-chat \
       --endpoint /v1/chat/completions \
       --model moonshotai/Kimi-K2.6 \
       --dataset-name hf \
       --dataset-path lmarena-ai/VisionArena-Chat \
       --num-prompts 1000 \
       --request-rate 20 \
       --trust-remote-code

Key metrics to track
~~~~~~~~~~~~~~~~~~~~

-  output tokens per second (decode throughput)
-  time to first token (TTFT) — prefill latency
-  inter-token latency (ITL) — decode latency per token
-  KV cache utilisation — reported in vLLM logs and Prometheus metrics
-  preemption count — an increase indicates KV pressure

Profiling GPU utilisation
~~~~~~~~~~~~~~~~~~~~~~~~~

From within a SLURM job on the compute node:

.. code:: bash

   nvidia-smi dmon -s u -d 1

Under sustained load, all 8 GPUs should show compute utilisation above
70%.

--------------


.. _k26-known-caveats-and-constraints:

14. Known caveats and constraints
---------------------------------

``pip`` inside the Conda environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All ``pip`` invocations in this guide follow
``export PATH=${VIRTUAL_ENV}/bin:${PATH}``. If a new SLURM script omits
this line and calls ``pip`` directly, packages will install into
``~/.local/lib/python3.11/site-packages/`` and consume home directory
quota. Always verify with ``which pip`` before any installation step.

FP8 KV cache with chunked prefill
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a known interaction between ``--kv-cache-dtype fp8`` and
``--enable-chunked-prefill`` in vLLM versions below 0.17.0 that can
produce type incompatibility errors. This has been resolved in the
recommended vLLM 0.19.1.

Speculative decoding and prefix caching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These two features can interact unexpectedly in some vLLM versions.
Verify in server startup logs that both report as active when using them
simultaneously.

Eagle3 weights for K2.6
~~~~~~~~~~~~~~~~~~~~~~~

At the time of writing, Moonshot has not published Eagle3 draft model
weights specifically trained on K2.6. The serve script above includes
the ``--speculative-config`` flag with a placeholder path. Do not
populate this path with K2.5 Eagle3 weights without first confirming
compatibility, as the hidden-state distribution may differ between K2.5
and K2.6 due to the additional post-training applied for K2.6. Remove
the flag entirely if confirmed-compatible Eagle3 weights are not
available.

``--mm-encoder-tp-mode`` and memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``--mm-encoder-tp-mode data`` replicates vision encoder weights across
all TP ranks. If OOM errors occur at startup, reduce
``--gpu-memory-utilization`` from 0.92 to 0.90 as a first remediation.

``--skip-non-local-expert-weights``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This flag only reduces storage I/O on load and has no effect at
inference time. It has no effect if the checkpoint uses a 3D
fused-expert format.

MoE tuning config naming
~~~~~~~~~~~~~~~~~~~~~~~~

The file written by ``benchmark_moe.py --tune --save-dir`` is named
automatically by hardware and expert dimensions. Confirm the file is
present in ``TUNING_DIR`` and that vLLM logs confirmation of loading it
at serve startup.

vLLM version pinning
~~~~~~~~~~~~~~~~~~~~

Moonshot AI has verified K2.6 deployment against vLLM 0.19.1. Pin to
this version for production and test newer versions in a staging
environment before promoting.

Reasoning mode default
~~~~~~~~~~~~~~~~~~~~~~

K2.6 enables Thinking mode by default. To suppress reasoning output on a
per-request basis, pass
``{'chat_template_kwargs': {"thinking": False}}`` in ``extra_body`` when
using vLLM or SGLang, as specified in the K2.6 model card. The
``--reasoning-parser kimi_k2`` flag is still required at serve time
regardless.

Login node usage
~~~~~~~~~~~~~~~~

On Discoverer+, all computationally or I/O-intensive operations must be
submitted as SLURM jobs. This includes Conda environment creation,
package installation, model weight downloading, server startup, and
benchmarking.

Conda activation on Discoverer+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The guide uses ``export PATH=${VIRTUAL_ENV}/bin:${PATH}`` rather than
``conda activate``. This is the recommended approach on Discoverer+ and
does not require initialising a Conda shell.