Fused MoE Kernel Features¶

本文件旨在概述各種 MoE 核心（包括模組化和非模組化核心），以便更容易地為特定情況選擇合適的核心集。這包括有關模組化核心使用的 all2all 後端的資訊。

Fused MoE Modular All2All backends¶

有許多 all2all 通訊後端用於實現 FusedMoE 層的專家並行 (EP)。不同的 FusedMoEPrepareAndFinalize 子類為每個 all2all 後端提供了一個介面。

下表描述了每個後端的相關功能，即啟用格式、支援的量化方案和非同步支援。

輸出啟用格式（標準或批次）對應於 FusedMoEPrepareAndFinalize 子類的 prepare 步驟的輸出，並且 finalize 步驟需要相同的格式。所有後端 prepare 方法都期望標準格式的啟用，所有 finalize 方法都返回標準格式的啟用。有關格式的更多詳細資訊，請參閱 Fused MoE Modular Kernel 文件。

量化型別和格式列出了每個 FusedMoEPrepareAndFinalize 類支援的量化方案。量化可以在分派之前或之後發生，具體取決於 all2all 後端支援的格式，例如，deepep_high_throughput 僅支援塊量化的 fp8 格式。任何其他格式將導致以更高精度分派並在之後進行量化。每個後端 prepare 步驟的輸出是量化型別。finalize 步驟通常需要與原始啟用相同的輸入型別，例如，如果原始輸入是 bfloat16 並且量化方案是具有每張量的 fp8 尺度，則 prepare 將返回 fp8/每張量尺度啟用，而 finalize 將接受 bfloat16 啟用。有關 MoE 過程每個步驟啟用的型別和格式的更多詳細資訊，請參閱 Fused MoE Modular Kernel 中的圖表。如果未指定量化型別，則核心在 float16 和/或 bfloat16 上執行。

非同步後端支援 DBO（雙批次重疊）和共享專家重疊（在 combine 步驟中計算共享專家）。

某些模型要求將 topk 權重應用於輸入啟用而不是輸出啟用（當 topk==1 時），例如 Llama。對於模組化核心，此功能由 FusedMoEPrepareAndFinalize 子類支援。對於非模組化核心，專家函式負責處理此標誌。

除非另有說明，否則後端透過 --all2all-backend 命令列引數（或 ParallelConfig 中的 all2all_backend 引數）進行控制。除了 flashinfer 之外，所有後端僅與 EP+DP 或 EP+TP 一起工作。 Flashinfer 可以與 EP 或 DP（無 EP）一起工作。

Backend	Output act. format	Quant. types	Quant. format	Async	Apply Weight On Input	Subclass
naive	standard	all¹	G,A,T	N	⁶	layer.py
pplx	batched	fp8,int8	G,A,T	Y	Y	`PplxPrepareAndFinalize`
deepep_high_throughput	standard	fp8	G(128),A,T²	Y	Y	`DeepEPLLPrepareAndFinalize`
deepep_low_latency	batched	fp8	G(128),A,T³	Y	Y	`DeepEPHTPrepareAndFinalize`
flashinfer_all2allv	standard	nvfp4,fp8	G,A,T	N	N	`FlashInferAllToAllMoEPrepareAndFinalize`
flashinfer⁴	standard	nvfp4,fp8	G,A,T	N	N	`FlashInferCutlassMoEPrepareAndFinalize`
MoEPrepareAndFinalizeNoEP⁵	standard	fp8,int8	G,A,T	N	Y	`MoEPrepareAndFinalizeNoEP`
BatchedPrepareAndFinalize⁵	batched	fp8,int8	G,A,T	N	Y	`BatchedPrepareAndFinalize`

Table key

All types: mxfp4, nvfp4, int4, int8, fp8
A,T quantization occurs after dispatch.
All quantization happens after dispatch.
Controlled by different env vars (VLLM_FLASHINFER_MOE_BACKEND "throughput" or "latency")
This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs without dispatch or combine. These cannot be selected via environment variable. These are generally use for testing or adapting an expert subclass to the fused_experts API.
This depends on the experts implementation.

G - Grouped
G(N) - Grouped w/block size N
A - Per activation token
T - Per tensor

Modular kernels are supported by the following FusedMoEMethodBase classes.

Fused Experts Kernels¶

有多種 MoE experts 核心實現，適用於不同的量化型別和架構。大多數遵循 Triton 的基本 API fused_experts 函式。許多具有模組化核心介面卡，因此可以與相容的 all2all 後端一起使用。下表列出了每個 experts 核心及其特定屬性。

每個核心都必須提供一種支援的輸入啟用格式。某些核心型別支援標準格式和批次格式，透過不同的入口點，例如 TritonExperts 和 BatchedTritonExperts。批次格式核心目前僅用於匹配某些 all2all 後端，例如 pplx 和 DeepEPLLPrepareAndFinalize。

與後端核心類似，每個 experts 核心僅支援特定的量化格式。對於非模組化專家，啟用將採用原始型別並在核心內部進行量化。模組化專家將期望啟用已採用量化格式。兩種型別的專家都將產生原始啟用型別的輸出。

每個 experts 核心支援一種或多種啟用函式，例如 silu 或 gelu，這些函式應用於中間結果。

與後端一樣，一些專家支援將 topk 權重應用於輸入啟用。此表中該列的條目僅適用於非模組化專家。

大多數專家型別都包含等效的模組化介面，該介面將是 FusedMoEPermuteExpertsUnpermute 的子類。

為了與特定的 FusedMoEPrepareAndFinalize 子類一起使用，MoE 核心必須具有相容的啟用格式、量化型別和量化格式。

Kernel	Input act. format	Quant. types	Quant. format	Activation function	Apply Weight On Input	Modular	Source
triton	standard	all¹	G,A,T	silu, gelu, swigluoai, silu_no_mul, gelu_no_mul	Y	Y	`fused_experts`, `TritonExperts`
triton (batched)	batched	all¹	G,A,T	silu, gelu	⁶	Y	`BatchedTritonExperts`
deep gemm	standard, batched	fp8	G(128),A,T	silu, gelu	⁶	Y	`deep_gemm_moe_fp8`, `DeepGemmExperts`, `BatchedDeepGemmExperts`
cutlass_fp4	standard, batched	nvfp4	A,T	silu	Y	Y	`cutlass_moe_fp4`, `CutlassExpertsFp4`
cutlass_fp8	standard, batched	fp8	A,T	silu, gelu	Y	Y	`cutlass_moe_fp8`, `CutlassExpertsFp8`, `CutlasBatchedExpertsFp8`
flashinfer	standard	nvfp4, fp8	T	⁵	N	Y	`flashinfer_cutlass_moe_fp4`, `FlashInferExperts`
gpt oss triton	standard	不適用	不適用	⁵	Y	Y	`triton_kernel_fused_experts`, `OAITritonExperts`
marlin	standard, batched	³ / N/A	³ / N/A	silu, swigluoai	Y	Y	`fused_marlin_moe`, `MarlinExperts`, `BatchedMarlinExperts`
trtllm	standard	mxfp4, nvfp4	G(16),G(32)	⁵	N	Y	`TrtLlmGenExperts`
pallas	standard	不適用	不適用	silu	N	N	`fused_moe`
iterative	standard	不適用	不適用	silu	N	N	`fused_moe`
rocm aiter moe	standard	fp8	G(128),A,T	silu, gelu	Y	N	`rocm_aiter_fused_experts`
cpu_fused_moe	standard	不適用	不適用	silu	N	N	`CPUFusedMOE`
naive batched⁴	batched	int8, fp8	G,A,T	silu, gelu	⁶	Y	`NaiveBatchedExperts`

Table key

All types: mxfp4, nvfp4, int4, int8, fp8
A dispatcher wrapper around triton and deep gemm experts. Will select based on type + shape + quantization params
uint4, uint8, fp8, fp4
This is a naive implementation of experts that supports batched format. Mainly used for testing.
The activation parameter is ignored and SwiGlu is used by default instead.
Only handled by or supported when used with modular kernels.

Modular Kernel "families"¶

下表顯示了旨在協同工作的模組化核心的“系列”。有些組合可能有效但尚未經過測試，例如 flashinfer 與其他 fp8 專家。請注意，“naive”後端可以與任何非模組化專家一起使用。

backend	`FusedMoEPrepareAndFinalize` subclasses	`FusedMoEPermuteExpertsUnpermute` subclasses
deepep_high_throughput	`DeepEPHTPrepareAndFinalize`	`DeepGemmExperts`, `TritonExperts`, `TritonOrDeepGemmExperts`, `CutlassExpertsFp8`, `MarlinExperts`
deepep_low_latency, pplx	`DeepEPLLPrepareAndFinalize`, `PplxPrepareAndFinalize`	`BatchedDeepGemmExperts`, `BatchedTritonExperts`, `CutlassBatchedExpertsFp8`, `BatchedMarlinExperts`
flashinfer	`FlashInferCutlassMoEPrepareAndFinalize`	`FlashInferExperts`