Running Kimi K2.5+ across two DGX H200 nodes on Discoverer+ =========================================================== This guide covers the additional configuration required to run Kimi K2.5+ across both DGX H200 nodes available on Discoverer+, using Ray for multi-node coordination and vLLM for inference. Throughout, K2.5+ means Kimi K2.5 and compatible later Kimi releases (for example K2.6), using the Conda environment, weight paths, and flags from the matching single-node guide. It assumes that single-node setup is already complete before you follow the steps below. Contents -------- 1. `Discoverer+ hardware constraints `_ 2. `Multi-node parallelism concepts `_ 3. `Additional prerequisites `_ 4. `Installing Ray in the Conda environment `_ 5. `Two-node deployment `_ 6. `MoE Triton kernel tuning for two nodes `_ 7. `Benchmarking a two-node deployment `_ 8. `Verifying InfiniBand inter-node communication `_ 9. `Known caveats `_ -------------- .. _mn-discoverer-hardware-constraints: 1. Discoverer+ hardware constraints ----------------------------------- Discoverer+ has exactly 2 DGX H200 nodes named ``dgx1``, ``dgx2``, both in the ``common`` partition on the ``disco-plus`` Slurm cluster. Each node has 8 NVIDIA H200 GPUs (141 GB HBM3e), giving 16 GPUs and 2,256 GB total VRAM across the entire cluster. A two-node job using ``--nodes=2`` requests 16 GPUs — half the total GPU resources of Discoverer+. The Discoverer+ documentation states: “For multi-node GPU applications, you may need to set ``--nodes=2``, but that is rare for a single job to require the use of 16 NVIDIA H200 GPUs.” This is the maximum that should be used for a single inference server job on a shared cluster. Requesting more than two nodes would monopolise the majority of the cluster’s GPU resources. The inter-node network provides 10 × 400 Gbps InfiniBand ConnectX-7 per node, which is the fast path required for efficient multi-node GPU communication via NCCL. -------------- .. _mn-multi-node-parallelism-concepts: 2. Multi-node parallelism concepts ---------------------------------- Parallelism strategy ~~~~~~~~~~~~~~~~~~~~ For two DGX H200 nodes with 8 GPUs each, the standard configuration documented by vLLM is: - ``--tensor-parallel-size 8`` — all 8 GPUs within each node process each layer in parallel - ``--pipeline-parallel-size 2`` — the model layers are split across the two nodes; node 0 processes approximately the first half of layers, node 1 processes the remainder Cross-node communication occurs only at the layer boundary between the two pipeline stages. This makes the configuration less sensitive to inter-node latency than cross-node tensor parallelism, where every layer requires all-reduce communication across all 16 GPUs simultaneously. Why Ray is required ~~~~~~~~~~~~~~~~~~~ vLLM’s documentation states that multiprocessing is used by default when sufficient GPUs are available on the same node; for multi-node deployments, Ray is required. Ray acts as the cluster coordinator: it discovers both nodes, allocates GPU resources, and manages distributed vLLM workers. The flag ``--distributed-executor-backend ray`` instructs vLLM to use it. -------------- .. _mn-additional-prerequisites: 3. Additional prerequisites --------------------------- Shared model storage ~~~~~~~~~~~~~~~~~~~~ ``/valhalla/projects/`` is a Lustre parallel filesystem mounted on all Discoverer+ compute nodes. Store the model weights there once. Both nodes read from the same path — no per-node download is required. ``/weka`` is also available on Discoverer+ as a faster WEKA parallel filesystem (273 TB on NVMe). Storing the model on ``/weka`` may reduce startup time for large checkpoints. Network configuration for NCCL ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For NCCL to use InfiniBand with GPUDirect RDMA between nodes, set the following environment variables before launching vLLM. These follow the Discoverer+ documentation convention for inter-node communication: .. code:: bash export NCCL_IB_HCA=mlx5 export UCX_NET_DEVICES=mlx5_0:1 Port availability ~~~~~~~~~~~~~~~~~ Ray requires the following ports to be accessible between both nodes: - 6379 — Ray head port (configurable) - 8265 — Ray dashboard - 8076 — Ray object store vLLM uses port 8000 by default for the inference server. Confirm with your system administrator that these ports are open between ``dgx`` nodes on Discoverer+. -------------- .. _mn-installing-ray-in-the-conda-environment: 4. Installing Ray in the Conda environment ------------------------------------------ Ray must be added to the Conda environment you created for your K2.5+ deployment. This is an additional installation on top of that environment; do not recreate the environment from scratch. Submit the following as ``install_ray.sh``, setting ``VIRTUAL_ENV`` to the same path as in your single-node K2.5+ guide: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=install_ray #SBATCH --time=00:30:00 #SBATCH --account= #SBATCH --qos=2cpu-single-host #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=1 #SBATCH --mem=16G #SBATCH -o install_ray.%j.out #SBATCH -e install_ray.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } # VIRTUAL_ENV from your K2.5+ single-node guide (example below uses vllm-kimi-k26) export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26 [ -d ${VIRTUAL_ENV} ] || { echo "Environment ${VIRTUAL_ENV} does not exist. Exiting."; exit 1; } export PATH=${VIRTUAL_ENV}/bin:${PATH} echo "Using pip at: $(which pip)" pip install "ray[default]" if [ $? -ne 0 ]; then echo "Ray installation failed." >&2 exit 1 fi echo "Ray installed." echo "Ray version: $(python -c 'import ray; print(ray.__version__)')" echo "Install location: $(python -c 'import ray, os; print(os.path.dirname(ray.__file__))')" Submit: .. code:: bash sbatch install_ray.sh Verify in the job output that Ray installed into ``/valhalla`` and not ``~/.local``. -------------- .. _mn-two-node-deployment: 5. Two-node deployment ---------------------- The following script starts a Ray cluster across both allocated nodes and then launches vLLM on the head node. Example paths below illustrate one K2.5+ layout; set ``VIRTUAL_ENV``, ``MODEL_PATH``, ``EAGLE3_PATH``, and ``TUNING_DIR`` to match your single-node K2.5+ guide. Save as ``serve_kimi_2node.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=vllm_kimi_2node #SBATCH --time=24:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --ntasks-per-core=1 #SBATCH --cpus-per-task=112 #SBATCH --gres=gpu:8 #SBATCH --mem=1800G #SBATCH -o vllm_kimi_2node.%j.out #SBATCH -e vllm_kimi_2node.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } # From your K2.5+ single-node guide (example: vllm-kimi-k26) export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26 [ -d ${VIRTUAL_ENV} ] || { echo "Conda environment not found. Exiting."; exit 1; } export PATH=${VIRTUAL_ENV}/bin:${PATH} export HF_HOME=/valhalla/projects/${SLURM_JOB_ACCOUNT}/hf_cache # Checkpoint and tuning paths from your K2.5+ single-node guide MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4 EAGLE3_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-eagle3-mla TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_k26 [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR} # InfiniBand configuration per Discoverer+ documentation export NCCL_IB_HCA=mlx5 export UCX_NET_DEVICES=mlx5_0:1 # Retrieve the two node hostnames allocated by SLURM NODES=($(scontrol show hostnames ${SLURM_JOB_NODELIST})) HEAD_NODE=${NODES[0]} HEAD_NODE_IP=$(srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} hostname -i | head -n1) WORKER_NODE=${NODES[1]} # Derive a unique Ray port from the job ID to avoid conflicts with other jobs RAY_PORT=$((6379 + SLURM_JOB_ID % 1000)) echo "Head node: ${HEAD_NODE} (${HEAD_NODE_IP})" echo "Worker node: ${WORKER_NODE}" echo "Ray port: ${RAY_PORT}" # Start Ray head on node 0 srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} \ bash -c " export PATH=${VIRTUAL_ENV}/bin:${PATH} ray start --head \ --node-ip-address=${HEAD_NODE_IP} \ --port=${RAY_PORT} \ --num-gpus=8 \ --block & sleep 10 echo 'Ray head started on ${HEAD_NODE}.' " & sleep 15 # Start Ray worker on node 1 srun --nodes=1 --ntasks=1 -w ${WORKER_NODE} \ bash -c " export PATH=${VIRTUAL_ENV}/bin:${PATH} export NCCL_IB_HCA=mlx5 ray start \ --address=${HEAD_NODE_IP}:${RAY_PORT} \ --num-gpus=8 \ --block & sleep 10 echo 'Ray worker started on ${WORKER_NODE}.' " & sleep 20 # Verify Ray cluster sees both nodes before launching vLLM srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} \ bash -c " export PATH=${VIRTUAL_ENV}/bin:${PATH} python -c ' import ray, sys ray.init(address=\"${HEAD_NODE_IP}:${RAY_PORT}\") nodes = ray.nodes() print(f\"Ray cluster: {len(nodes)} node(s)\") for n in nodes: print(n[\"NodeManagerAddress\"], n[\"Resources\"].get(\"GPU\", 0), \"GPUs\") if len(nodes) != 2: print(\"ERROR: Expected 2 nodes in Ray cluster\", file=sys.stderr) sys.exit(1) ' " if [ $? -ne 0 ]; then echo "Ray cluster did not form correctly. Exiting." >&2 exit 1 fi # Launch vLLM on the head node; Ray distributes workers to the second node automatically srun --nodes=1 --ntasks=1 -w ${HEAD_NODE} \ bash -c " export PATH=${VIRTUAL_ENV}/bin:${PATH} export HF_HOME=${HF_HOME} export NCCL_IB_HCA=mlx5 export UCX_NET_DEVICES=mlx5_0:1 [ -d ${TUNING_DIR} ] && export VLLM_TUNED_CONFIG_FOLDER=${TUNING_DIR} vllm serve ${MODEL_PATH} \ --tensor-parallel-size 8 \ --pipeline-parallel-size 2 \ --distributed-executor-backend ray \ --mm-encoder-tp-mode data \ --gpu-memory-utilization 0.92 \ --max-model-len 131072 \ --dtype bfloat16 \ --kv-cache-dtype fp8 \ --enable-prefix-caching \ --enable-chunked-prefill \ --max-num-batched-tokens 8192 \ --swap-space 32 \ --enable-expert-parallel \ --enable-eplb \ --eplb-config '{\"window_size\": 1000, \"step_interval\": 1000}' \ --skip-non-local-expert-weights \ --tool-call-parser kimi_k2 \ --reasoning-parser kimi_k2 \ --enable-auto-tool-choice \ --speculative-config \"{\\\"model\\\": \\\"${EAGLE3_PATH}\\\", \\\"method\\\": \\\"eagle3\\\", \\\"num_speculative_tokens\\\": 3}\" \ --trust-remote-code " wait Submit: .. code:: bash sbatch serve_kimi_2node.sh The vLLM server binds to port 8000 on the head node. Retrieve the head node hostname from the job output and connect your client to ``http://:8000/v1``. Key flags for multi-node operation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``--pipeline-parallel-size 2`` Splits the 61 model layers across 2 nodes. Node 0 processes approximately the first 30 layers; node 1 processes the remainder. Cross-node communication occurs only at the pipeline stage boundary. ``--distributed-executor-backend ray`` Required for multi-node deployments. Instructs vLLM to use Ray for worker coordination across nodes instead of the single-node multiprocessing backend. ``--tensor-parallel-size 8`` Each node still uses all 8 of its GPUs for intra-node tensor parallelism. Combined with ``--pipeline-parallel-size 2``, the total GPU count used is 16. -------------- .. _mn-moe-triton-kernel-tuning-for-two-nodes: 6. MoE Triton kernel tuning for two nodes ----------------------------------------- The tuned MoE kernel configuration generated for single-node deployment (TP=8, PP=1) is not valid for two-node deployment (TP=8, PP=2) because the tensor parallel configuration seen by each node’s workers differs. Run a separate tuning job for the two-node configuration. Save as ``tune_moe_2node.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=tune_moe_2node #SBATCH --time=02:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:8 #SBATCH --cpus-per-task=112 #SBATCH --mem=1800G #SBATCH -o tune_moe_2node.%j.out #SBATCH -e tune_moe_2node.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } # VIRTUAL_ENV and MODEL_PATH from your K2.5+ single-node guide export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26 export PATH=${VIRTUAL_ENV}/bin:${PATH} MODEL_PATH=/valhalla/projects/${SLURM_JOB_ACCOUNT}/models/kimi-k2.6-int4 TUNING_DIR=/valhalla/projects/${SLURM_JOB_ACCOUNT}/configs/moe_tuning_k26_2node mkdir -p ${TUNING_DIR} python benchmarks/kernels/benchmark_moe.py \ --model ${MODEL_PATH} \ --tp-size 8 \ --dtype bfloat16 \ --tune \ --save-dir ${TUNING_DIR} echo "Tuning complete. Config written to ${TUNING_DIR}." Point ``VLLM_TUNED_CONFIG_FOLDER`` to this separate directory when running the two-node serve job. -------------- .. _mn-benchmarking-a-two-node-deployment: 7. Benchmarking a two-node deployment ------------------------------------- The benchmark job connects to the head node’s server over the network. It does not need to run on the same nodes as the inference server. Save as ``benchmark_kimi_2node.sh``: .. code:: bash #!/bin/bash #SBATCH --partition=common #SBATCH --job-name=benchmark_kimi_2node #SBATCH --time=01:00:00 #SBATCH --account= #SBATCH --qos= #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=16 #SBATCH --mem=64G #SBATCH -o benchmark_kimi_2node.%j.out #SBATCH -e benchmark_kimi_2node.%j.err cd ${SLURM_SUBMIT_DIR} module purge || { echo "Failed to purge modules. Exiting."; exit 1; } module load anaconda3 || { echo "Failed to load anaconda3. Exiting."; exit 1; } # VIRTUAL_ENV from your K2.5+ single-node guide export VIRTUAL_ENV=/valhalla/projects/${SLURM_JOB_ACCOUNT}/virt_envs/vllm-kimi-k26 export PATH=${VIRTUAL_ENV}/bin:${PATH} SERVER_URL=http://:8000 # Hugging Face model id for your K2.5+ deployment (must match the running server) vllm bench serve \ --base-url ${SERVER_URL} \ --backend openai-chat \ --endpoint /v1/chat/completions \ --model moonshotai/Kimi-K2.6 \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat \ --num-prompts 1000 \ --request-rate 20 \ --trust-remote-code -------------- .. _mn-verifying-infiniband-inter-node-communication: 8. Verifying InfiniBand inter-node communication ------------------------------------------------ Set ``NCCL_DEBUG=TRACE`` inside the vLLM serve subshell and inspect the job output for: :: [send] via NET/IB/GDRDMA This confirms NCCL is using InfiniBand with GPUDirect RDMA, which is the efficient path. If instead you see: :: [send] via NET/Socket NCCL has fallen back to raw TCP, which is not efficient for cross-node parallelism. Check that ``NCCL_IB_HCA=mlx5`` and ``UCX_NET_DEVICES=mlx5_0:1`` are exported inside the vLLM serve subshell, and ask your system administrator whether GPUDirect RDMA is enabled between the allocated nodes. -------------- .. _mn-known-caveats: 9. Known caveats ---------------- Pipeline parallelism and latency ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pipeline parallelism increases single-request latency because each token must traverse both pipeline stages sequentially. For throughput-oriented batch workloads this cost is acceptable. For interactive workloads, measure time to first token on representative prompts before committing to a two-node configuration. Ray cluster stability ~~~~~~~~~~~~~~~~~~~~~ If either node fails or is preempted by SLURM, the entire Ray cluster fails without automatic recovery. Design your application to detect server unavailability and resubmit the SLURM job if needed. Ray port conflicts ~~~~~~~~~~~~~~~~~~ The serve script derives ``RAY_PORT`` from the SLURM job ID to avoid conflicts between concurrent jobs. Confirm no other vLLM jobs are running on overlapping nodes before submitting. Separate MoE tuning configs for one and two nodes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The tuned kernel configuration for single-node serving (TP=8, PP=1) must not be shared with the two-node configuration (TP=8, PP=2). Use the MoE tuning directory from your K2.5+ single-node guide for one-node jobs, and a separate directory for two-node jobs (for example append ``_2node`` to that directory name). Point ``VLLM_TUNED_CONFIG_FOLDER`` to the correct directory in each serve script. Speculative decoding with pipeline parallelism ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Eagle3 speculative decoding with ``--pipeline-parallel-size 2`` may interact unexpectedly in some vLLM versions. If the server fails to start with both flags active, remove ``--speculative-config`` first to confirm the baseline is stable, then re-enable it. EPLB with pipeline parallelism ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The EPLB expert load balancer requires all EP ranks to trigger rearrangement at the same step. Across pipeline stages on different nodes this synchronisation must be maintained via NCCL. If hangs are observed during rearrangement, increase ``step_interval`` in ``--eplb-config``. Shared storage health ~~~~~~~~~~~~~~~~~~~~~ Both nodes read model weights from the same ``/valhalla`` path. Verify the Lustre mount is healthy on both nodes before submitting. A filesystem issue on either node will cause vLLM weight loading to fail on that worker.