Running Kimi K2.5+ across two DGX H200 nodes on Discoverer+
===========================================================

This guide covers the additional configuration required to run Kimi
K2.5+ across both DGX H200 nodes available on Discoverer+, using Ray for
multi-node coordination and vLLM for inference. Throughout, K2.5+ means
Kimi K2.5 and compatible later Kimi releases (for example K2.6), using
the Conda environment, weight paths, and flags from the matching
single-node guide. It assumes that single-node setup is already complete
before you follow the steps below.

Contents
--------

1.  `Discoverer+ hardware constraints <mn-discoverer-hardware-constraints_>`_
2.  `Multi-node parallelism concepts <mn-multi-node-parallelism-concepts_>`_
3.  `Additional prerequisites <mn-additional-prerequisites_>`_
4.  `Installing Ray in the Conda environment <mn-installing-ray-in-the-conda-environment_>`_
5.  `Two-node deployment <mn-two-node-deployment_>`_
6.  `MoE Triton kernel tuning for two nodes <mn-moe-triton-kernel-tuning-for-two-nodes_>`_
7.  `Benchmarking a two-node deployment <mn-benchmarking-a-two-node-deployment_>`_
8.  `Verifying InfiniBand inter-node communication <mn-verifying-infiniband-inter-node-communication_>`_
9.  `Known caveats <mn-known-caveats_>`_

--------------


.. _mn-discoverer-hardware-constraints:

1. Discoverer+ hardware constraints
-----------------------------------

Discoverer+ has exactly 2 DGX H200 nodes named ``dgx1``, ``dgx2``, both
in the ``common`` partition on the ``disco-plus`` Slurm cluster. Each
node has 8 NVIDIA H200 GPUs (141 GB HBM3e), giving 16 GPUs and 2,256 GB
total VRAM across the entire cluster.

A two-node job using ``--nodes=2`` requests 16 GPUs — half the total GPU
resources of Discoverer+. The Discoverer+ documentation states: “For
multi-node GPU applications, you may need to set ``--nodes=2``, but that
is rare for a single job to require the use of 16 NVIDIA H200 GPUs.”
This is the maximum that should be used for a single inference server
job on a shared cluster. Requesting more than two nodes would monopolise
the majority of the cluster’s GPU resources.

The inter-node network provides 10 × 400 Gbps InfiniBand ConnectX-7 per
node, which is the fast path required for efficient multi-node GPU
communication via NCCL.

--------------


.. _mn-multi-node-parallelism-concepts:

2. Multi-node parallelism concepts
----------------------------------

Parallelism strategy
~~~~~~~~~~~~~~~~~~~~

For two DGX H200 nodes with 8 GPUs each, the standard configuration
documented by vLLM is:

-  ``--tensor-parallel-size 8`` — all 8 GPUs within each node process
   each layer in parallel
-  ``--pipeline-parallel-size 2`` — the model layers are split across
   the two nodes; node 0 processes approximately the first half of
   layers, node 1 processes the remainder

Cross-node communication occurs only at the layer boundary between the
two pipeline stages. This makes the configuration less sensitive to
inter-node latency than cross-node tensor parallelism, where every layer
requires all-reduce communication across all 16 GPUs simultaneously.

Why Ray is required
~~~~~~~~~~~~~~~~~~~

vLLM’s documentation states that multiprocessing is used by default when
sufficient GPUs are available on the same node; for multi-node
deployments, Ray is required. Ray acts as the cluster coordinator: it
discovers both nodes, allocates GPU resources, and manages distributed
vLLM workers. The flag ``--distributed-executor-backend ray`` instructs
vLLM to use it.

--------------


.. _mn-additional-prerequisites:

3. Additional prerequisites
---------------------------

Shared model storage
~~~~~~~~~~~~~~~~~~~~

``/valhalla/projects/`` is a Lustre parallel filesystem mounted on all
Discoverer+ compute nodes. Store the model weights there once. Both
nodes read from the same path — no per-node download is required.

``/weka`` is also available on Discoverer+ as a faster WEKA parallel
filesystem (273 TB on NVMe). Storing the model on ``/weka`` may reduce
startup time for large checkpoints.

Network configuration for NCCL
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For NCCL to use InfiniBand with GPUDirect RDMA between nodes, set the
following environment variables before launching vLLM. These follow the
Discoverer+ documentation convention for inter-node communication:

.. code:: bash

   export NCCL_IB_HCA=mlx5
   export UCX_NET_DEVICES=mlx5_0:1

Port availability
~~~~~~~~~~~~~~~~~

Ray requires the following ports to be accessible between both nodes:

-  6379 — Ray head port (configurable)
-  8265 — Ray dashboard
-  8076 — Ray object store

vLLM uses port 8000 by default for the inference server. Confirm with
your system administrator that these ports are open between ``dgx``
nodes on Discoverer+.

--------------


.. _mn-installing-ray-in-the-conda-environment:

4. Installing Ray in the Conda environment
------------------------------------------

Ray must be added to the Conda environment you created for your K2.5+
deployment. This is an additional installation on top of that
environment; do not recreate the environment from scratch. Submit the
following as ``install_ray.sh``, setting ``VIRTUAL_ENV`` to the same
path as in your single-node K2.5+ guide:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=install_ray
   #SBATCH --time=00:30:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=2cpu-single-host

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=2
   #SBATCH --cpus-per-task=1
   #SBATCH --mem=16G

   #SBATCH -o install_ray.%j.out
   #SBATCH -e install_ray.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   # VIRTUAL_ENV from your K2.5+ single-node guide (example below uses vllm-kimi-k26)
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26

   [ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   echo "Using pip at: $(which pip)"

   pip install "ray[default]"

   if [ $? -ne 0 ]; then
       echo "Ray installation failed." >&2
       exit 1
   fi

   echo "Ray installed."
   echo "Ray version: $(python -c 'import ray; print(ray.__version__)')"
   echo "Install location: $(python -c 'import ray, os; print(os.path.dirname(ray.__file__))')"

Submit:

.. code:: bash

   sbatch install_ray.sh

Verify in the job output that Ray installed into ``/valhalla`` and not
``~/.local``.

--------------


.. _mn-two-node-deployment:

5. Two-node deployment
----------------------

The following script starts a Ray cluster across both allocated nodes
and then launches vLLM on the head node. Example paths below illustrate
one K2.5+ layout; set ``VIRTUAL_ENV``, ``MODEL_PATH``, ``EAGLE3_PATH``,
and ``TUNING_DIR`` to match your single-node K2.5+ guide.

Save as ``serve_kimi_2node.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=vllm_kimi_2node
   #SBATCH --time=24:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=2
   #SBATCH --ntasks-per-node=1
   #SBATCH --ntasks-per-core=1
   #SBATCH --cpus-per-task=112
   #SBATCH --gres=gpu:8
   #SBATCH --mem=1800G

   #SBATCH -o vllm_kimi_2node.%j.out
   #SBATCH -e vllm_kimi_2node.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   # From your K2.5+ single-node guide (example: vllm-kimi-k26)
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; }
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache

   # Checkpoint and tuning paths from your K2.5+ single-node guide
   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4
   EAGLE3_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-eagle3-mla
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_k26

   [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

   # InfiniBand configuration per Discoverer+ documentation
   export NCCL_IB_HCA=mlx5
   export UCX_NET_DEVICES=mlx5_0:1

   # Retrieve the two node hostnames allocated by SLURM
   NODES=($(scontrol show hostnames ${SLURM_JOB_NODELIST}))
   HEAD_NODE=${NODES[0]}
   HEAD_NODE_IP=$(srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} hostname -i | head -n1)
   WORKER_NODE=${NODES[1]}

   # Derive a unique Ray port from the job ID to avoid conflicts with other jobs
   RAY_PORT=$((6379 + SLURM_JOB_ID % 1000))

   echo "Head node:   ${HEAD_NODE} (${HEAD_NODE_IP})"
   echo "Worker node: ${WORKER_NODE}"
   echo "Ray port:    ${RAY_PORT}"

   # Start Ray head on node 0
   srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} \
       bash -c "
           export PATH=${VIRTUAL_ENV}/bin:${PATH}
           ray start --head \
               --node-ip-address=${HEAD_NODE_IP} \
               --port=${RAY_PORT} \
               --num-gpus=8 \
               --block &
           sleep 10
           echo 'Ray head started on ${HEAD_NODE}.'
       " &

   sleep 15

   # Start Ray worker on node 1
   srun --nodes=1 --ntasks=1 -w ${WORKER_NODE} \
       bash -c "
           export PATH=${VIRTUAL_ENV}/bin:${PATH}
           export NCCL_IB_HCA=mlx5
           ray start \
               --address=${HEAD_NODE_IP}:${RAY_PORT} \
               --num-gpus=8 \
               --block &
           sleep 10
           echo 'Ray worker started on ${WORKER_NODE}.'
       " &

   sleep 20

   # Verify Ray cluster sees both nodes before launching vLLM
   srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} \
       bash -c "
           export PATH=${VIRTUAL_ENV}/bin:${PATH}
           python -c '
   import ray, sys
   ray.init(address=\"${HEAD_NODE_IP}:${RAY_PORT}\")
   nodes = ray.nodes()
   print(f\"Ray cluster: {len(nodes)} node(s)\")
   for n in nodes:
       print(n[\"NodeManagerAddress\"], n[\"Resources\"].get(\"GPU\", 0), \"GPUs\")
   if len(nodes) != 2:
       print(\"ERROR: Expected 2 nodes in Ray cluster\", file=sys.stderr)
       sys.exit(1)
   '
       "

   if [ $? -ne 0 ]; then
       echo "Ray cluster did not form correctly. Exiting." >&2
       exit 1
   fi

   # Launch vLLM on the head node; Ray distributes workers to the second node automatically
   srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} \
       bash -c "
           export PATH=${VIRTUAL_ENV}/bin:${PATH}
           export HF_HOME=${HF_HOME}
           export NCCL_IB_HCA=mlx5
           export UCX_NET_DEVICES=mlx5_0:1
           [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR}

           vllm serve ${MODEL_PATH} \
               --tensor-parallel-size 8 \
               --pipeline-parallel-size 2 \
               --distributed-executor-backend ray \
               --mm-encoder-tp-mode data \
               --gpu-memory-utilization 0.92 \
               --max-model-len 131072 \
               --dtype bfloat16 \
               --kv-cache-dtype fp8 \
               --enable-prefix-caching \
               --enable-chunked-prefill \
               --max-num-batched-tokens 8192 \
               --swap-space 32 \
               --enable-expert-parallel \
               --enable-eplb \
               --eplb-config '{\"window_size\": 1000, \"step_interval\": 1000}' \
               --skip-non-local-expert-weights \
               --tool-call-parser kimi_k2 \
               --reasoning-parser kimi_k2 \
               --enable-auto-tool-choice \
               --speculative-config \"{\\\"model\\\": \\\"${EAGLE3_PATH}\\\", \\\"method\\\": \\\"eagle3\\\", \\\"num_speculative_tokens\\\": 3}\" \
               --trust-remote-code
       "

   wait

Submit:

.. code:: bash

   sbatch serve_kimi_2node.sh

The vLLM server binds to port 8000 on the head node. Retrieve the head
node hostname from the job output and connect your client to
``http://<head_node_hostname>:8000/v1``.

Key flags for multi-node operation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``--pipeline-parallel-size 2`` Splits the 61 model layers across 2
nodes. Node 0 processes approximately the first 30 layers; node 1
processes the remainder. Cross-node communication occurs only at the
pipeline stage boundary.

``--distributed-executor-backend ray`` Required for multi-node
deployments. Instructs vLLM to use Ray for worker coordination across
nodes instead of the single-node multiprocessing backend.

``--tensor-parallel-size 8`` Each node still uses all 8 of its GPUs for
intra-node tensor parallelism. Combined with
``--pipeline-parallel-size 2``, the total GPU count used is 16.

--------------


.. _mn-moe-triton-kernel-tuning-for-two-nodes:

6. MoE Triton kernel tuning for two nodes
-----------------------------------------

The tuned MoE kernel configuration generated for single-node deployment
(TP=8, PP=1) is not valid for two-node deployment (TP=8, PP=2) because
the tensor parallel configuration seen by each node’s workers differs.
Run a separate tuning job for the two-node configuration.

Save as ``tune_moe_2node.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=tune_moe_2node
   #SBATCH --time=02:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=2
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:8
   #SBATCH --cpus-per-task=112
   #SBATCH --mem=1800G

   #SBATCH -o tune_moe_2node.%j.out
   #SBATCH -e tune_moe_2node.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   # VIRTUAL_ENV and MODEL_PATH from your K2.5+ single-node guide
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4
   TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_k26_2node

   mkdir -p ${TUNING_DIR}

   python benchmarks/kernels/benchmark_moe.py \
       --model ${MODEL_PATH} \
       --tp-size 8 \
       --dtype bfloat16 \
       --tune \
       --save-dir ${TUNING_DIR}

   echo "Tuning complete. Config written to ${TUNING_DIR}."

Point ``VLLM_TUNED_CONFIG_FOLDER`` to this separate directory when
running the two-node serve job.

--------------


.. _mn-benchmarking-a-two-node-deployment:

7. Benchmarking a two-node deployment
-------------------------------------

The benchmark job connects to the head node’s server over the network.
It does not need to run on the same nodes as the inference server. Save
as ``benchmark_kimi_2node.sh``:

.. code:: bash

   #!/bin/bash

   #SBATCH --partition=common
   #SBATCH --job-name=benchmark_kimi_2node
   #SBATCH --time=01:00:00

   #SBATCH --account=<your_slurm_project_account_name>
   #SBATCH --qos=<your_qos_name>

   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:1
   #SBATCH --cpus-per-task=16
   #SBATCH --mem=64G

   #SBATCH -o benchmark_kimi_2node.%j.out
   #SBATCH -e benchmark_kimi_2node.%j.err

   cd ${SLURM_SUBMIT_DIR}

   module purge || { echo "Failed to purge modules. Exiting."; exit 1; }
   module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; }

   # VIRTUAL_ENV from your K2.5+ single-node guide
   export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26
   export PATH=${VIRTUAL_ENV}/bin:${PATH}

   SERVER_URL=http://<head_node_hostname>:8000

   # Hugging Face model id for your K2.5+ deployment (must match the running server)
   vllm bench serve \
       --base-url ${SERVER_URL} \
       --backend openai-chat \
       --endpoint /v1/chat/completions \
       --model moonshotai/Kimi-K2.6 \
       --dataset-name hf \
       --dataset-path lmarena-ai/VisionArena-Chat \
       --num-prompts 1000 \
       --request-rate 20 \
       --trust-remote-code

--------------


.. _mn-verifying-infiniband-inter-node-communication:

8. Verifying InfiniBand inter-node communication
------------------------------------------------

Set ``NCCL_DEBUG=TRACE`` inside the vLLM serve subshell and inspect the
job output for:

::

   [send] via NET/IB/GDRDMA

This confirms NCCL is using InfiniBand with GPUDirect RDMA, which is the
efficient path. If instead you see:

::

   [send] via NET/Socket

NCCL has fallen back to raw TCP, which is not efficient for cross-node
parallelism. Check that ``NCCL_IB_HCA=mlx5`` and
``UCX_NET_DEVICES=mlx5_0:1`` are exported inside the vLLM serve
subshell, and ask your system administrator whether GPUDirect RDMA is
enabled between the allocated nodes.

--------------


.. _mn-known-caveats:

9. Known caveats
----------------

Pipeline parallelism and latency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pipeline parallelism increases single-request latency because each token
must traverse both pipeline stages sequentially. For throughput-oriented
batch workloads this cost is acceptable. For interactive workloads,
measure time to first token on representative prompts before committing
to a two-node configuration.

Ray cluster stability
~~~~~~~~~~~~~~~~~~~~~

If either node fails or is preempted by SLURM, the entire Ray cluster
fails without automatic recovery. Design your application to detect
server unavailability and resubmit the SLURM job if needed.

Ray port conflicts
~~~~~~~~~~~~~~~~~~

The serve script derives ``RAY_PORT`` from the SLURM job ID to avoid
conflicts between concurrent jobs. Confirm no other vLLM jobs are
running on overlapping nodes before submitting.

Separate MoE tuning configs for one and two nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The tuned kernel configuration for single-node serving (TP=8, PP=1) must
not be shared with the two-node configuration (TP=8, PP=2). Use the MoE
tuning directory from your K2.5+ single-node guide for one-node jobs,
and a separate directory for two-node jobs (for example append
``_2node`` to that directory name). Point ``VLLM_TUNED_CONFIG_FOLDER``
to the correct directory in each serve script.

Speculative decoding with pipeline parallelism
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Eagle3 speculative decoding with ``--pipeline-parallel-size 2`` may
interact unexpectedly in some vLLM versions. If the server fails to
start with both flags active, remove ``--speculative-config`` first to
confirm the baseline is stable, then re-enable it.

EPLB with pipeline parallelism
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The EPLB expert load balancer requires all EP ranks to trigger
rearrangement at the same step. Across pipeline stages on different
nodes this synchronisation must be maintained via NCCL. If hangs are
observed during rearrangement, increase ``step_interval`` in
``--eplb-config``.

Shared storage health
~~~~~~~~~~~~~~~~~~~~~

Both nodes read model weights from the same ``/valhalla`` path. Verify
the Lustre mount is healthy on both nodes before submitting. A
filesystem issue on either node will cause vLLM weight loading to fail
on that worker.