Expert Parallel Deployment¶

vLLM 支援 Expert Parallelism (EP)，它允許將 Mixture-of-Experts (MoE) 模型中的專家部署在獨立的 GPU 上，從而提高區域性性、效率和整體吞吐量。

EP 通常與 Data Parallelism (DP) 結合使用。雖然 DP 可以獨立於 EP 使用，但 EP 與 DP 結合使用時效率更高。您可以在此處閱讀更多關於資料並行性的資訊。

先決條件¶

在使用 EP 之前，您需要安裝必要的依賴項。我們正在積極努力使未來的安裝過程更加簡便。

安裝 DeepEP 和 pplx-kernels：根據 vLLM EP 核心指南此處設定主機環境。
安裝 DeepGEMM 庫：請遵循官方說明。
對於分離式服務：透過執行 install_gdrcopy.sh 指令碼安裝 gdrcopy（例如，install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"）。您可以在此處找到可用的作業系統版本。

Backend Selection Guide¶

vLLM 提供多種 EP 通訊後端。使用 --all2all-backend 選擇一個。

Backend	Use Case	特性	Best For
`allgather_reducescatter`	Default backend	Standard all2all using allgather/reducescatter primitives	General purpose, works with any EP+DP configuration
`pplx`	Single node	Chunked prefill support, efficient intra-node communication	Single-node deployments, development
`deepep_high_throughput`	Multi-node prefill	Grouped GEMM with continuous layout, optimized for prefill	Prefill-dominated workloads, high-throughput scenarios
`deepep_low_latency`	Multi-node decode	CUDA graph support, masked layout, optimized for decode	Decode-dominated workloads, low-latency scenarios
`flashinfer_all2allv`	MNNVL systems	FlashInfer alltoallv kernels for multi-node NVLink	Systems with NVLink across nodes
`naive`	Testing/debugging	Simple broadcast-based implementation	Debugging, not recommended for production

Single Node Deployment¶

警告

EP is an experimental feature. Argument names and default values may change in the future.

配置¶

透過設定 --enable-expert-parallel 標誌來啟用 EP。EP 大小會自動計算為

EP_SIZE = TP_SIZE × DP_SIZE

其中

TP_SIZE: Tensor parallel size
DP_SIZE: Data parallel size
EP_SIZE: Expert parallel size (computed automatically)

Layer Behavior with EP Enabled¶

When EP is enabled, different layers in MoE models behave differently

Layer Type	Behavior	Parallelism Used
Expert (MoE) Layers	Sharded across all EP ranks	Expert Parallel (EP) of size `TP × DP`
Attention Layers	Behavior depends on TP size	See below

Attention layer parallelism

When TP = 1: Attention weights are replicated across all DP ranks (data parallelism)
When TP > 1: Attention weights are sharded using tensor parallelism across TP ranks within each DP group

For example, with TP=2, DP=4 (8 GPUs total)

Expert layers form an EP group of size 8, with experts distributed across all GPUs
Attention layers use TP=2 within each of the 4 DP groups

Key Difference from Data Parallel Deployment

Without --enable-expert-parallel, MoE layers would use tensor parallelism (forming a TP group of size TP × DP), similar to dense models. With EP enabled, expert layers switch to expert parallelism, which can provide better efficiency and locality for MoE models.

Example Command¶

The following command serves a DeepSeek-V3-0324 model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.

# Single node EP deployment with pplx backend
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 1 \       # Tensor parallelism across 1 GPU
    --data-parallel-size 8 \         # Data parallelism across 8 processes
    --enable-expert-parallel \       # Enable expert parallelism
    --all2all-backend pplx           # Use pplx communication backend

Multi-Node Deployment¶

For multi-node deployment, use the DeepEP communication kernel with one of two modes (see Backend Selection Guide above).

部署步驟¶

Run one command per node - Each node requires its own launch command
Configure networking - Ensure proper IP addresses and port configurations
Set node roles - First node handles requests, additional nodes run in headless mode

Example: 2-Node Deployment¶

The following example deploys DeepSeek-V3-0324 across 2 nodes using deepep_low_latency mode

# Node 1 (Primary - handles incoming requests)
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --all2all-backend deepep_low_latency \
    --tensor-parallel-size 1 \               # TP size per node
    --enable-expert-parallel \               # Enable EP
    --data-parallel-size 16 \                # Total DP size across all nodes
    --data-parallel-size-local 8 \           # Local DP size on this node (8 GPUs per node)
    --data-parallel-address 192.168.1.100 \  # Replace with actual IP of Node 1
    --data-parallel-rpc-port 13345 \         # RPC communication port, can be any port as long as reachable by all nodes
    --api-server-count=8                     # Number of API servers for load handling (scaling this out to # local ranks is recommended)

# Node 2 (Secondary - headless mode, no API server)
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --all2all-backend deepep_low_latency \
    --tensor-parallel-size 1 \               # TP size per node
    --enable-expert-parallel \               # Enable EP
    --data-parallel-size 16 \                # Total DP size across all nodes
    --data-parallel-size-local 8 \           # Local DP size on this node
    --data-parallel-start-rank 8 \           # Starting rank offset for this node
    --data-parallel-address 192.168.1.100 \  # IP of primary node (Node 1)
    --data-parallel-rpc-port 13345 \         # Same RPC port as primary
    --headless                               # No API server, worker only

Key Configuration Notes¶

Headless mode: Secondary nodes run with --headless flag, meaning all client requests are handled by the primary node
Rank calculation: --data-parallel-start-rank should equal the cumulative local DP size of previous nodes
Load scaling: Adjust --api-server-count on the primary node to handle higher request loads

Network Configuration¶

InfiniBand Clusters

On InfiniBand networked clusters, set this environment variable to prevent initialization hangs

export GLOO_SOCKET_IFNAME=eth0

This ensures torch distributed group discovery uses Ethernet instead of InfiniBand for initial setup.

Expert Parallel Load Balancer (EPLB)¶

While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM provides an Expert Parallel Load Balancer (EPLB) to redistribute expert mappings across EP ranks, evening the load across experts.

Configuration¶

Enable EPLB with the --enable-eplb flag.

When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.

EPLB Parameters¶

Configure EPLB with the --eplb-config argument, which accepts a JSON string. The available keys and their descriptions are

引數	描述	預設值
`window_size`	Number of engine steps to track for rebalancing decisions	1000
`step_interval`	Frequency of rebalancing (every N engine steps)	3000
`log_balancedness`	Log balancedness metrics (avg tokens per expert ÷ max tokens per expert)	`false`
`num_redundant_experts`	Additional global experts per EP rank beyond equal distribution	`0`
`use_async`	Use non-blocking EPLB for reduced latency overhead	`false`
`policy`	The policy type for expert parallel load balancing	`"default"`

例如

vllm serve Qwen/Qwen3-30B-A3B \
  --enable-eplb \
  --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'

Prefer individual arguments instead of JSON?

vllm serve Qwen/Qwen3-30B-A3B \
        --enable-eplb \
        --eplb-config.window_size 1000 \
        --eplb-config.step_interval 3000 \
        --eplb-config.num_redundant_experts 2 \
        --eplb-config.log_balancedness true

Expert Distribution Formula¶

Default: Each EP rank has NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS experts
With redundancy: Each EP rank has (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS experts

Memory Footprint Overhead¶

EPLB uses redundant experts that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium.

This overhead equals NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS. For DeepSeekV3, this is approximately 2.4 GB for one redundant expert per EP rank.

Example Command¶

Single node deployment with EPLB enabled

# Single node with EPLB load balancing
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 1 \       # Tensor parallelism
    --data-parallel-size 8 \         # Data parallelism
    --enable-expert-parallel \       # Enable EP
    --all2all-backend pplx \         # Use pplx communication backend
    --enable-eplb \                  # Enable load balancer
    --eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'

For multi-node deployment, add these EPLB flags to each node's command. We recommend setting --eplb-config '{"num_redundant_experts":32}' to 32 in large scale use cases so the most popular experts are always available.

Advanced Configuration¶

Performance Optimization¶

DeepEP kernels: The high_throughput and low_latency kernels are optimized for disaggregated serving and may show poor performance for mixed workloads
Dual Batch Overlap: Use --enable-dbo to overlap all-to-all communication with compute. See Dual Batch Overlap for more details.
Async scheduling (experimental): Try --async-scheduling to overlap scheduling with model execution.

故障排除¶

non-zero status: 7 cannot register cq buf: When using Infiniband/RoCE, make sure host VM and pods show ulimit -l "unlimited".
init failed for transport: IBGDA: The InfiniBand GDA kernel modules are missing. Run tools/ep_kernels/configure_system_drivers.sh on each GPU node and reboot. Also fixes error NVSHMEM API called before NVSHMEM initialization has completed.
NVSHMEM peer disconnect: Usually a networking misconfiguration. If deploying via Kubernetes, verify that every pod runs with hostNetwork: true, securityContext.privileged: true to access Infiniband.

基準測試¶

Use simulator flags VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random and VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 so token routing is balanced across EP ranks.
Increasing VLLM_MOE_DP_CHUNK_SIZE may increase throughput by increasing the maximum batch size for inter-rank token transfers. This may cause DeepEP to throw assert self.nvshmem_qp_depth >= (num_max_dispatch_tokens_per_rank + 1) * 2, which can be fixed by increasing environment variable NVSHMEM_QP_DEPTH.

Disaggregated Serving (Prefill/Decode Split)¶

For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations.

架構概述¶

Prefill Instance: Uses deepep_high_throughput backend for optimal prefill performance
Decode Instance: Uses deepep_low_latency backend for minimal decode latency
KV Cache Transfer: Connects instances via NIXL or other KV connectors

Setup Steps¶

Install gdrcopy/ucx/nixl: For maximum performance, run the install_gdrcopy.sh 指令碼安裝 gdrcopy（例如，install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"）。您可以在此處找到可用的作業系統版本。如果未安裝 gdrcopy，使用純 pip install nixl 也能正常工作，但效能較低。nixl 和 ucx 作為依賴項透過 pip 安裝。對於非 CUDA 平臺，要使用非 CUDA UCX 構建安裝 nixl，請執行 install_nixl_from_source_ubuntu.py 指令碼。
Configure Both Instances: Add this flag to both prefill and decode instances --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}. Noted, you may also specify one or multiple NIXL_Backend. Such as: --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'
Client Orchestration: Use the client-side script below to coordinate prefill/decode operations. We are actively working on routing solutions.

Client Orchestration Example¶

from openai import OpenAI
import uuid

try:
    # 1: Set up clients for prefill and decode instances
    openai_api_key = "EMPTY"  # vLLM doesn't require a real API key

    # Replace these IP addresses with your actual instance addresses
    prefill_client = OpenAI(
        api_key=openai_api_key,
        base_url="http://192.168.1.100:8000/v1",  # Prefill instance URL
    )
    decode_client = OpenAI(
        api_key=openai_api_key,
        base_url="http://192.168.1.101:8001/v1",  # Decode instance URL  
    )

    # Get model name from prefill instance
    models = prefill_client.models.list()
    model = models.data[0].id
    print(f"Using model: {model}")

    # 2: Prefill Phase
    # Generate unique request ID to link prefill and decode operations
    request_id = str(uuid.uuid4())
    print(f"Request ID: {request_id}")

    prefill_response = prefill_client.completions.create(
        model=model,
        # Prompt must exceed vLLM's block size (16 tokens) for PD to work
        prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations",
        max_tokens=1,  # Force prefill-only operation
        extra_body={
            "kv_transfer_params": {
                "do_remote_decode": True,     # Enable remote decode
                "do_remote_prefill": False,   # This is the prefill instance
                "remote_engine_id": None,     # Will be populated by vLLM
                "remote_block_ids": None,     # Will be populated by vLLM
                "remote_host": None,          # Will be populated by vLLM
                "remote_port": None,          # Will be populated by vLLM
            }
        },
        extra_headers={"X-Request-Id": request_id},
    )

    print("-" * 50)
    print("✓ Prefill completed successfully")
    print(f"Prefill response: {prefill_response.choices[0].text}")

    # 3: Decode Phase
    # Transfer KV cache parameters from prefill to decode instance
    decode_response = decode_client.completions.create(
        model=model,
        prompt="This prompt is ignored during decode",  # Original prompt not needed
        max_tokens=150,  # Generate up to 150 tokens
        extra_body={
            "kv_transfer_params": prefill_response.kv_transfer_params  # Pass KV cache info
        },
        extra_headers={"X-Request-Id": request_id},  # Same request ID
    )

    print("-" * 50)
    print("✓ Decode completed successfully")
    print(f"Final response: {decode_response.choices[0].text}")

except Exception as e:
    print(f"❌ Error during disaggregated serving: {e}")
    print("Check that both prefill and decode instances are running and accessible")

Benchmarking¶

To simulate the decode deployment of disaggregated serving, pass --kv-transfer-config '{"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}' to the vllm serve invocation. The connector populates KV cache with random values so decode can be profiled in isolation.
CUDAGraph capture: Use --compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' to enable CUDA graph capture for decode only and save KV cache.