支援的模型¶

vLLM 支援各種任務的生成式和池化模型。如果一個模型支援多項任務，您可以透過 --task 引數設定任務。

對於每項任務，我們列出了在 vLLM 中已實現的模型架構。在每個架構旁邊，我們還會包含一些使用該架構的流行模型。

模型實現¶

vLLM¶

如果 vLLM 原生支援某個模型，其實現可以在 vllm/model_executor/models 中找到。

這些模型是我們列在supported-text-models 和supported-mm-models 中的模型。

Transformers¶

vLLM 還支援 Transformers 中可用的模型實現。目前並非所有模型都支援，但大多數解碼器語言模型和常見的視覺語言模型都受到支援！視覺語言模型目前僅接受影像輸入。未來版本將增加對影片輸入的支援。

要檢查建模後端是否為 Transformers，您可以簡單地執行以下操作：

from vllm import LLM
llm = LLM(model=..., task="generate")  # Name or path of your model
llm.apply_model(lambda model: print(type(model)))

如果是 TransformersForCausalLM 或 TransformersForMultimodalLM，則表示它基於 Transformers！

提示

您可以透過為離線推理設定 model_impl="transformers" 或為OpenAI 相容伺服器設定 --model-impl transformers 來強制使用 TransformersForCausalLM。

注意

vLLM 可能無法完全最佳化 Transformers 的實現，因此，如果將原生模型與 vLLM 中的 Transformers 模型進行比較，您可能會看到效能下降。

注意

在視覺語言模型的情況下，如果您使用 dtype="auto" 載入，vLLM 會根據配置中的 dtype（如果存在）載入整個模型。相比之下，原生 Transformers 會遵循模型中每個骨幹的 dtype 屬性。這可能會導致效能上的一些細微差異。

自定義模型¶

如果模型既不受 vLLM 原生支援，也不受 Transformers 支援，它仍然可以在 vLLM 中使用！

要使模型與 vLLM 的 Transformers 後端相容，它必須

是 Transformers 相容的自定義模型（參見 Transformers - 自定義模型）
- 模型目錄必須具有正確的結構（例如 config.json 存在）。
- config.json 必須包含 auto_map.AutoModel。
是 vLLM 相容的 Transformers 後端模型（參見編寫自定義模型）
- 定製應在基礎模型中完成（例如在 MyModel 中，而不是 MyModelForCausalLM 中）。

如果相容模型在

Hugging Face Model Hub 上，只需為離線推理設定 trust_remote_code=True 或為OpenAI 相容伺服器設定 --trust-remote-code。
本地目錄中，只需為離線推理將目錄路徑傳遞給 model=<MODEL_DIR> 或為OpenAI 相容伺服器傳遞 vllm serve <MODEL_DIR>。

這意味著，藉助 vLLM 的 Transformers 後端，可以在新模型在 Transformers 或 vLLM 中正式支援之前使用它們！

編寫自定義模型¶

本節詳細介紹了對 Transformers 相容的自定義模型進行必要的修改，使其與 vLLM 的 Transformers 後端相容。（我們假設已經建立了一個 Transformers 相容的自定義模型，參見 Transformers - 自定義模型）。

為了使您的模型與 Transformers 後端相容，它需要

kwargs 從 MyModel 到 MyAttention 透過所有模組傳遞。
MyAttention 必須使用 ALL_ATTENTION_FUNCTIONS 來呼叫注意力。
MyModel 必須包含 _supports_attention_backend = True。

modeling_my_model.py

from transformers import PreTrainedModel
from torch import nn

class MyAttention(nn.Module):

    def forward(self, hidden_states, **kwargs):
        ...
        attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
        attn_output, attn_weights = attention_interface(
            self,
            query_states,
            key_states,
            value_states,
            **kwargs,
        )
        ...

class MyModel(PreTrainedModel):
    _supports_attention_backend = True

載入此模型時後臺發生的情況如下：

配置已載入。
從配置中的 auto_map 載入 MyModel Python 類，並檢查模型是否 is_backend_compatible()。
MyModel 被載入到 TransformersForCausalLM 或 TransformersForMultimodalLM 中（參見 vllm/model_executor/models/transformers.py)，它設定 self.config._attn_implementation = "vllm"，以便使用 vLLM 的注意力層。

就是這樣！

為了使您的模型與 vLLM 的張量並行和/或流水線並行功能相容，您必須將 base_model_tp_plan 和/或 base_model_pp_plan 新增到您模型的配置類中

configuration_my_model.py

from transformers import PretrainedConfig

class MyConfig(PretrainedConfig):
    base_model_tp_plan = {
        "layers.*.self_attn.k_proj": "colwise",
        "layers.*.self_attn.v_proj": "colwise",
        "layers.*.self_attn.o_proj": "rowwise",
        "layers.*.mlp.gate_proj": "colwise",
        "layers.*.mlp.up_proj": "colwise",
        "layers.*.mlp.down_proj": "rowwise",
    }
    base_model_pp_plan = {
        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
        "norm": (["hidden_states"], ["hidden_states"]),
    }

base_model_tp_plan 是一個 dict，它將完全限定的層名稱模式對映到張量並行樣式（目前僅支援 "colwise" 和 "rowwise"）。
base_model_pp_plan 是一個 dict，它將直接子層名稱對映到 tuple 型別的 list 列表，其中包含 str
- 您只需要對未在所有流水線階段都存在的層執行此操作
- vLLM 假定只有一個 nn.ModuleList，它分佈在流水線階段
- tuple 第一個元素中的 list 包含輸入引數的名稱
- tuple 最後一個元素中的 list 包含層在您的建模程式碼中輸出的變數名稱

載入模型¶

Hugging Face Hub¶

預設情況下，vLLM 從 Hugging Face (HF) Hub 載入模型。要更改模型的下載路徑，您可以設定 HF_HOME 環境變數；有關更多詳細資訊，請參閱其官方文件。

要確定給定模型是否原生支援，您可以檢查 HF 倉庫中的 config.json 檔案。如果 "architectures" 欄位包含下面列出的模型架構，則它應該原生支援。

模型不需要原生支援即可在 vLLM 中使用。Transformers 後端允許您直接使用其 Transformers 實現（甚至 Hugging Face Model Hub 上的遠端程式碼！）執行模型。

提示

檢查您的模型是否在執行時真正受支援的最簡單方法是執行以下程式

from vllm import LLM

# For generative models (task=generate) only
llm = LLM(model=..., task="generate")  # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)

# For pooling models (task={embed,classify,reward,score}) only
llm = LLM(model=..., task="embed")  # Name or path of your model
output = llm.encode("Hello, my name is")
print(output)

如果 vLLM 成功返回文字（對於生成模型）或隱藏狀態（對於池化模型），則表示您的模型受支援。

否則，請參閱新增新模型以獲取如何在 vLLM 中實現您的模型的說明。另外，您也可以在 GitHub 上提交問題以請求 vLLM 支援。

下載模型¶

如果您願意，可以使用 Hugging Face CLI 下載模型或模型倉庫中的特定檔案

# Download a model
huggingface-cli download HuggingFaceH4/zephyr-7b-beta

# Specify a custom cache directory
huggingface-cli download HuggingFaceH4/zephyr-7b-beta --cache-dir ./path/to/cache

# Download a specific file from a model repo
huggingface-cli download HuggingFaceH4/zephyr-7b-beta eval_results.json

列出已下載的模型¶

使用 Hugging Face CLI 管理本地快取中儲存的模型

# List cached models
huggingface-cli scan-cache

# Show detailed (verbose) output
huggingface-cli scan-cache -v

# Specify a custom cache directory
huggingface-cli scan-cache --dir ~/.cache/huggingface/hub

刪除快取的模型¶

使用 Hugging Face CLI 互動式地從快取中刪除已下載的模型

命令

# The `delete-cache` command requires extra dependencies to work with the TUI.
# Please run `pip install huggingface_hub[cli]` to install them.

# Launch the interactive TUI to select models to delete
$ huggingface-cli delete-cache
? Select revisions to delete: 1 revisions selected counting for 438.9M.
  ○ None of the following (if selected, nothing will be deleted).
Model BAAI/bge-base-en-v1.5 (438.9M, used 1 week ago)
❯ ◉ a5beb1e3: main # modified 1 week ago

Model BAAI/bge-large-en-v1.5 (1.3G, used 1 week ago)
  ○ d4aa6901: main # modified 1 week ago

Model BAAI/bge-reranker-base (1.1G, used 4 weeks ago)
  ○ 2cfc18c9: main # modified 4 weeks ago

Press <space> to select, <enter> to validate and <ctrl+c> to quit without modification.

# Need to confirm after selected
? Select revisions to delete: 1 revision(s) selected.
? 1 revisions selected counting for 438.9M. Confirm deletion ? Yes
Start deletion.
Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.

使用代理¶

以下是使用代理從 Hugging Face 載入/下載模型的一些提示：

為您的會話全域性設定代理（或在配置檔案中設定）

export http_proxy=http://your.proxy.server:port
export https_proxy=http://your.proxy.server:port

僅為當前命令設定代理

https_proxy=http://your.proxy.server:port huggingface-cli download <model_name>

# or use vllm cmd directly
https_proxy=http://your.proxy.server:port  vllm serve <model_name> --disable-log-requests

在 Python 直譯器中設定代理

import os

os.environ['http_proxy'] = 'http://your.proxy.server:port'
os.environ['https_proxy'] = 'http://your.proxy.server:port'

ModelScope¶

要使用 ModelScope 而不是 Hugging Face Hub 的模型，請設定環境變數

export VLLM_USE_MODELSCOPE=True

並與 trust_remote_code=True 一起使用。

from vllm import LLM

llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)

# For generative models (task=generate) only
output = llm.generate("Hello, my name is")
print(output)

# For pooling models (task={embed,classify,reward,score}) only
output = llm.encode("Hello, my name is")
print(output)

功能狀態圖例¶

✅︎ 表示該模型支援此功能。
🚧 表示該功能已計劃但尚未支援該模型。
⚠️ 表示該功能可用，但可能存在已知問題或限制。

僅文字語言模型列表¶

生成模型¶

有關如何使用生成模型的更多資訊，請參見此頁面。

文字生成¶

使用 --task generate 指定。

架構	模型	示例 HF 模型	LoRA	PP	V1
`AquilaForCausalLM`	Aquila, Aquila2	`BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, 等。	✅︎	✅︎	✅︎
`ArceeForCausalLM`	Arcee (AFM)	`arcee-ai/AFM-4.5B-Base`, 等。	✅︎	✅︎	✅︎
`ArcticForCausalLM`	Arctic	`Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, 等。		✅︎	✅︎
`BaiChuanForCausalLM`	Baichuan2, Baichuan	`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, 等。	✅︎	✅︎	✅︎
`BailingMoeForCausalLM`	Ling	`inclusionAI/Ling-lite-1.5`, `inclusionAI/Ling-plus`, 等。	✅︎	✅︎	✅︎
`BambaForCausalLM`	Bamba	`ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B`	✅︎	✅︎	✅︎
`BloomForCausalLM`	BLOOM, BLOOMZ, BLOOMChat	`bigscience/bloom`, `bigscience/bloomz`, 等。		✅︎
`BartForConditionalGeneration`	BART	`facebook/bart-base`, `facebook/bart-large-cnn`, 等。
`ChatGLMModel`, `ChatGLMForConditionalGeneration`	ChatGLM	`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, `ShieldLM-6B-chatglm3`, 等。	✅︎	✅︎	✅︎
`CohereForCausalLM`, `Cohere2ForCausalLM`	Command-R	`CohereForAI/c4ai-command-r-v01`, `CohereForAI/c4ai-command-r7b-12-2024`, 等。	✅︎	✅︎	✅︎
`DbrxForCausalLM`	DBRX	`databricks/dbrx-base`, `databricks/dbrx-instruct`, 等。		✅︎	✅︎
`DeciLMForCausalLM`	DeciLM	`nvidia/Llama-3_3-Nemotron-Super-49B-v1`, 等。	✅︎	✅︎	✅︎
`DeepseekForCausalLM`	DeepSeek	`deepseek-ai/deepseek-llm-67b-base`, `deepseek-ai/deepseek-llm-7b-chat`, 等。		✅︎	✅︎
`DeepseekV2ForCausalLM`	DeepSeek-V2	`deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat`, 等。		✅︎	✅︎
`DeepseekV3ForCausalLM`	DeepSeek-V3	`deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3`, 等。		✅︎	✅︎
`Dots1ForCausalLM`	dots.llm1	`rednote-hilab/dots.llm1.base`, `rednote-hilab/dots.llm1.inst`, 等。		✅︎	✅︎
`Ernie4_5_ForCausalLM`	Ernie4.5	`baidu/ERNIE-4.5-0.3B-PT`, 等。	✅︎	✅︎	✅︎
`Ernie4_5_MoeForCausalLM`	Ernie4.5MoE	`baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, 等。	✅︎	✅︎	✅︎
`ExaoneForCausalLM`	EXAONE-3	`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, 等。	✅︎	✅︎	✅︎
`Exaone4ForCausalLM`	EXAONE-4	`LGAI-EXAONE/EXAONE-4.0-32B`, 等。	✅︎	✅︎	✅︎
`Fairseq2LlamaForCausalLM`	Llama (fairseq2 格式)	`mgleize/fairseq2-dummy-Llama-3.2-1B`, 等。	✅︎	✅︎	✅︎
`FalconForCausalLM`	Falcon	`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, 等。		✅︎	✅︎
`FalconMambaForCausalLM`	FalconMamba	`tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, 等。		✅︎	✅︎
`FalconH1ForCausalLM`	Falcon-H1	`tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, 等。	✅︎	✅︎	✅︎
`GemmaForCausalLM`	Gemma	`google/gemma-2b`, `google/gemma-1.1-2b-it`, 等。	✅︎	✅︎	✅︎
`Gemma2ForCausalLM`	Gemma 2	`google/gemma-2-9b`, `google/gemma-2-27b`, 等。	✅︎	✅︎	✅︎
`Gemma3ForCausalLM`	Gemma 3	`google/gemma-3-1b-it`, 等。	✅︎	✅︎	✅︎
`Gemma3nForConditionalGeneration`	Gemma 3n	`google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, 等。			✅︎
`GlmForCausalLM`	GLM-4	`THUDM/glm-4-9b-chat-hf`, 等。	✅︎	✅︎	✅︎
`Glm4ForCausalLM`	GLM-4-0414	`THUDM/GLM-4-32B-0414`, 等。	✅︎	✅︎	✅︎
`GPT2LMHeadModel`	GPT-2	`gpt2`, `gpt2-xl`, 等。		✅︎	✅︎
`GPTBigCodeForCausalLM`	StarCoder, SantaCoder, WizardCoder	`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, 等。	✅︎	✅︎	✅︎
`GPTJForCausalLM`	GPT-J	`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, 等。		✅︎	✅︎
`GPTNeoXForCausalLM`	GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM	`EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, 等。		✅︎	✅︎
`GraniteForCausalLM`	Granite 3.0, Granite 3.1, PowerLM	`ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, 等。	✅︎	✅︎	✅︎
`GraniteMoeForCausalLM`	Granite 3.0 MoE, PowerMoE	`ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, 等。	✅︎	✅︎	✅︎
`GraniteMoeHybridForCausalLM`	Granite 4.0 MoE Hybrid	`ibm-granite/granite-4.0-tiny-preview`, 等。	✅︎	✅︎	✅︎
`GraniteMoeSharedForCausalLM`	Granite MoE Shared	`ibm-research/moe-7b-1b-active-shared-experts` (測試模型)	✅︎	✅︎	✅︎
`GritLM`	GritLM	`parasail-ai/GritLM-7B-vllm`.	✅︎	✅︎
`Grok1ModelForCausalLM`	Grok1	`hpcai-tech/grok-1`.	✅︎	✅︎	✅︎
`HunYuanDenseV1ForCausalLM`	Hunyuan-7B-Instruct-0124	`tencent/Hunyuan-7B-Instruct-0124`	✅︎		✅︎
`HunYuanMoEV1ForCausalLM`	Hunyuan-80B-A13B	`tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`, 等。	✅︎		✅︎
`HCXVisionForCausalLM`	HyperCLOVAX-SEED-Vision-Instruct-3B	`naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B`			✅︎
`InternLMForCausalLM`	InternLM	`internlm/internlm-7b`, `internlm/internlm-chat-7b`, 等。	✅︎	✅︎	✅︎
`InternLM2ForCausalLM`	InternLM2	`internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, 等。	✅︎	✅︎	✅︎
`InternLM3ForCausalLM`	InternLM3	`internlm/internlm3-8b-instruct`, 等。	✅︎	✅︎	✅︎
`JAISLMHeadModel`	Jais	`inceptionai/jais-13b`, `inceptionai/jais-13b-chat`, `inceptionai/jais-30b-v3`, `inceptionai/jais-30b-chat-v3`, 等。		✅︎	✅︎
`JambaForCausalLM`	Jamba	`ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, 等。	✅︎	✅︎
`LlamaForCausalLM`	Llama 3.1, Llama 3, Llama 2, LLaMA, Yi	`meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, 等。	✅︎	✅︎	✅︎
`MambaForCausalLM`	Mamba	`state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, 等。		✅︎
`Mamba2ForCausalLM`	Mamba2	`mistralai/Mamba-Codestral-7B-v0.1`, 等。		✅︎	✅︎
`MiMoForCausalLM`	MiMo	`XiaomiMiMo/MiMo-7B-RL`, 等。	✅︎	✅︎	✅︎
`MiniCPMForCausalLM`	MiniCPM	`openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, 等。	✅︎	✅︎	✅︎
`MiniCPM3ForCausalLM`	MiniCPM3	`openbmb/MiniCPM3-4B`, 等。	✅︎	✅︎	✅︎
`MistralForCausalLM`	Mistral, Mistral-Instruct	`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, 等。	✅︎	✅︎	✅︎
`MixtralForCausalLM`	Mixtral-8x7B, Mixtral-8x7B-Instruct	`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, 等。	✅︎	✅︎	✅︎
`MPTForCausalLM`	MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter	`mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, 等。		✅︎	✅︎
`NemotronForCausalLM`	Nemotron-3, Nemotron-4, Minitron	`nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, 等。	✅︎	✅︎	✅︎
`NemotronHForCausalLM`	Nemotron-H	`nvidia/Nemotron-H-8B-Base-8K`, `nvidia/Nemotron-H-47B-Base-8K`, `nvidia/Nemotron-H-56B-Base-8K`, 等。	✅︎	✅︎	✅︎
`OLMoForCausalLM`	OLMo	`allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, 等。		✅︎	✅︎
`OLMo2ForCausalLM`	OLMo2	`allenai/OLMo-2-0425-1B`, 等。		✅︎	✅︎
`OLMoEForCausalLM`	OLMoE	`allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, 等。		✅︎	✅︎
`OPTForCausalLM`	OPT, OPT-IML	`facebook/opt-66b`, `facebook/opt-iml-max-30b`, 等。		✅︎	✅︎
`OrionForCausalLM`	Orion	`OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, 等。		✅︎	✅︎
`PhiForCausalLM`	Phi	`microsoft/phi-1_5`, `microsoft/phi-2`, 等。	✅︎	✅︎	✅︎
`Phi3ForCausalLM`	Phi-4, Phi-3	`microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, 等。	✅︎	✅︎	✅︎
`PhiMoEForCausalLM`	Phi-3.5-MoE	`microsoft/Phi-3.5-MoE-instruct`, 等。	✅︎	✅︎	✅︎
`Phi4FlashForCausalLM`	Phi-4-mini-flash-reasoning	`microsoft/microsoft/Phi-4-mini-instruct`, 等。
`PersimmonForCausalLM`	Persimmon	`adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, 等。		✅︎	✅︎
`Plamo2ForCausalLM`	PLaMo2	`pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, 等。
`QWenLMHeadModel`	Qwen	`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, 等。	✅︎	✅︎	✅︎
`Qwen2ForCausalLM`	QwQ, Qwen2	`Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, 等。	✅︎	✅︎	✅︎
`Qwen2MoeForCausalLM`	Qwen2MoE	`Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, 等。	✅︎	✅︎	✅︎
`Qwen3ForCausalLM`	Qwen3	`Qwen/Qwen3-8B`, 等。	✅︎	✅︎	✅︎
`Qwen3MoeForCausalLM`	Qwen3MoE	`Qwen/Qwen3-30B-A3B`, 等。	✅︎	✅︎	✅︎
`StableLmForCausalLM`	StableLM	`stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, 等。			✅︎
`Starcoder2ForCausalLM`	Starcoder2	`bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, 等。		✅︎	✅︎
`SolarForCausalLM`	Solar Pro	`upstage/solar-pro-preview-instruct`, 等。	✅︎	✅︎	✅︎
`TeleChat2ForCausalLM`	TeleChat2	`Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, 等。	✅︎	✅︎	✅︎
`TeleFLMForCausalLM`	TeleFLM	`CofeAI/FLM-2-52B-Instruct-2407`, `CofeAI/Tele-FLM`, 等。	✅︎	✅︎	✅︎
`XverseForCausalLM`	XVERSE	`xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, 等。	✅︎	✅︎	✅︎
`MiniMaxM1ForCausalLM`	MiniMax-Text	`MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`, 等。
`MiniMaxText01ForCausalLM`	MiniMax-Text	`MiniMaxAI/MiniMax-Text-01`, 等。
`Zamba2ForCausalLM`	Zamba2	`Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, 等。			✅︎

注意

目前，vLLM 的 ROCm 版本僅支援 Mistral 和 Mixtral，上下文長度最長為 4096。

注意

Gemma3nForConditionalGeneration 目前僅支援文字輸入。要使用此模型，請將 Hugging Face Transformers 升級到 4.53.0 版本。

池化模型¶

有關如何使用池化模型的更多資訊，請參見此頁面。

重要

由於某些模型架構同時支援生成任務和池化任務，您應該明確指定任務型別，以確保模型在池化模式而非生成模式下使用。

文字嵌入¶

使用 --task embed 指定。

架構	模型	示例 HF 模型	LoRA	PP	V1
`BertModel`	基於 BERT	`BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, 等。
`Gemma2Model`	基於 Gemma 2	`BAAI/bge-multilingual-gemma2`, 等。	✅︎		✅︎
`GritLM`	GritLM	`parasail-ai/GritLM-7B-vllm`.	✅︎	✅︎
`GteModel`	Arctic-Embed-2.0-M	`Snowflake/snowflake-arctic-embed-m-v2.0`.
`GteNewModel`	mGTE-TRM (參見注意)	`Alibaba-NLP/gte-multilingual-base`, 等。
`ModernBertModel`	基於 ModernBERT	`Alibaba-NLP/gte-modernbert-base`, 等。
`NomicBertModel`	Nomic BERT	`nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, 等。
`LlamaModel`, `LlamaForCausalLM`, `MistralModel`, 等。	基於 Llama	`intfloat/e5-mistral-7b-instruct`, 等。	✅︎	✅︎	✅︎
`Qwen2Model`, `Qwen2ForCausalLM`	基於 Qwen2	`ssmits/Qwen2-7B-Instruct-embed-base` (參見注意), `Alibaba-NLP/gte-Qwen2-7B-instruct` (參見注意), 等。	✅︎	✅︎	✅︎
`Qwen3Model`, `Qwen3ForCausalLM`	基於 Qwen3	`Qwen/Qwen3-Embedding-0.6B`, 等。	✅︎	✅︎	✅︎
`RobertaModel`, `RobertaForMaskedLM`	基於 RoBERTa	`sentence-transformers/all-roberta-large-v1`, 等。

注意

ssmits/Qwen2-7B-Instruct-embed-base 的 Sentence Transformers 配置定義不正確。您需要透過傳遞 --override-pooler-config '{"pooling_type": "MEAN"}' 手動設定均值池化。

注意

對於 Alibaba-NLP/gte-Qwen2-*，您需要啟用 --trust-remote-code 才能正確載入分詞器。請參閱 HF Transformers 上的相關問題。

注意

jinaai/jina-embeddings-v3 透過 LoRA 支援多工，而 vllm 暫時只通過合併 LoRA 權重支援文字匹配任務。

注意

第二代 GTE 模型 (mGTE-TRM) 被命名為 NewModel。NewModel 這個名字太通用了，您應該設定 --hf-overrides '{"architectures": ["GteNewModel"]}' 來指定使用 GteNewModel 架構。

如果您的模型不在上述列表中，我們將嘗試使用 as_embedding_model 自動轉換模型。預設情況下，整個提示的嵌入是從與最後一個 token 對應的歸一化隱藏狀態中提取的。

獎勵模型¶

使用 --task reward 指定。

架構	模型	示例 HF 模型	LoRA	PP	V1
`InternLM2ForRewardModel`	基於 InternLM2	`internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, 等。	✅︎	✅︎	✅︎
`LlamaForCausalLM`	基於 Llama	`peiyi9979/math-shepherd-mistral-7b-prm`, 等。	✅︎	✅︎	✅︎
`Qwen2ForRewardModel`	基於 Qwen2	`Qwen/Qwen2.5-Math-RM-72B`, 等。	✅︎	✅︎	✅︎
`Qwen2ForProcessRewardModel`	基於 Qwen2	`Qwen/Qwen2.5-Math-PRM-7B`, 等。	✅︎	✅︎	✅︎

如果您的模型不在上述列表中，我們將嘗試使用 as_reward_model 自動轉換模型。預設情況下，我們直接返回每個 token 的隱藏狀態。

重要

對於過程監督的獎勵模型，例如 peiyi9979/math-shepherd-mistral-7b-prm，應明確設定池化配置，例如：--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'。

分類¶

使用 --task classify 指定。

架構	模型	示例 HF 模型	LoRA	PP	V1
`JambaForSequenceClassification`	Jamba	`ai21labs/Jamba-tiny-reward-dev`, 等。	✅︎	✅︎
`GPT2ForSequenceClassification`	GPT2	`nie3e/sentiment-polish-gpt2-small`			✅︎

如果您的模型不在上述列表中，我們將嘗試使用 as_seq_cls_model 自動轉換模型。預設情況下，類別機率是從與最後一個 token 對應的 softmaxed 隱藏狀態中提取的。

句子對評分¶

使用 --task score 指定。

架構	模型	示例 HF 模型	V1
`BertForSequenceClassification`	基於 BERT	`cross-encoder/ms-marco-MiniLM-L-6-v2`, 等。
`GemmaForSequenceClassification`	基於 Gemma	`BAAI/bge-reranker-v2-gemma` (參見注意), 等。
`Qwen2ForSequenceClassification`	基於 Qwen2	`mixedbread-ai/mxbai-rerank-base-v2` (參見注意), 等。	✅︎
`Qwen3ForSequenceClassification`	基於 Qwen3	`tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (參見注意), 等。	✅︎
`RobertaForSequenceClassification`	基於 RoBERTa	`cross-encoder/quora-roberta-base`, 等。
`XLMRobertaForSequenceClassification`	基於 XLM-RoBERTa	`BAAI/bge-reranker-v2-m3`, 等。

注意

使用以下命令載入官方原始的 BAAI/bge-reranker-v2-gemma。

vllm serve BAAI/bge-reranker-v2-gemma --hf_overrides '{"architectures": ["GemmaForSequenceClassification"],"classifier_from_token": ["Yes"],"method": "no_post_processing"}'

注意

使用以下命令載入官方原始的 mxbai-rerank-v2。

vllm serve mixedbread-ai/mxbai-rerank-base-v2 --hf_overrides '{"architectures": ["Qwen2ForSequenceClassification"],"classifier_from_token": ["0", "1"], "method": "from_2_way_softmax"}'

注意

使用以下命令載入官方原始的 Qwen3 Reranker。更多資訊請參見： examples/offline_inference/qwen3_reranker.py。

vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'

多模態語言模型列表¶

根據模型，支援以下模態

Text (文字)
Image (影像)
Video (影片)
Audio (音訊)

支援由 + 連線的任意模態組合。

例如：T + I 意味著模型支援純文字輸入、純影像輸入以及文字與影像混合輸入。

另一方面，由 / 分隔的模態是互斥的。

例如：T / I 意味著模型支援純文字輸入和純影像輸入，但不支援文字與影像混合輸入。

有關如何將多模態輸入傳遞給模型的詳細資訊，請參見此頁面。

重要

要在 vLLM V0 中為每個文字提示啟用多個多模態專案，您必須設定 limit_mm_per_prompt（離線推理）或 --limit-mm-per-prompt（線上服務）。例如，要為每個文字提示傳遞最多 4 張影像

離線推理

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2-VL-7B-Instruct",
    limit_mm_per_prompt={"image": 4},
)

線上服務

vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt '{"image":4}'

如果您使用的是 vLLM V1，則不再需要此項。

注意

vLLM 目前僅支援將 LoRA 新增到多模態模型的語言骨幹。

生成模型¶

有關如何使用生成模型的更多資訊，請參見此頁面。

文字生成¶

使用 --task generate 指定。

架構	模型	輸入	示例 HF 模型	LoRA	PP	V1
`AriaForConditionalGeneration`	Aria	T + I⁺	`rhymes-ai/Aria`			✅︎
`AyaVisionForConditionalGeneration`	Aya Vision	T + I⁺	`CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, 等。		✅︎	✅︎
`Blip2ForConditionalGeneration`	BLIP-2	T + I^E	`Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, 等。		✅︎	✅︎
`ChameleonForConditionalGeneration`	Chameleon	T + I	`facebook/chameleon-7b`, 等。		✅︎	✅︎
`DeepseekVLV2ForCausalLM`^{^}	DeepSeek-VL2	T + I⁺	`deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2`, 等。		✅︎	✅︎
`Florence2ForConditionalGeneration`	Florence-2	T + I	`microsoft/Florence-2-base`, `microsoft/Florence-2-large`, 等。
`FuyuForCausalLM`	Fuyu	T + I	`adept/fuyu-8b`, 等。		✅︎	✅︎
`Gemma3ForConditionalGeneration`	Gemma 3	T + I⁺	`google/gemma-3-4b-it`, `google/gemma-3-27b-it`, 等。	✅︎	✅︎	⚠️
`GLM4VForCausalLM`^{^}	GLM-4V	T + I	`THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220`, 等。	✅︎	✅︎	✅︎
`Glm4vForConditionalGeneration`	GLM-4.1V-Thinking	T + I^E+ + V^E+	`THUDM/GLM-4.1V-9B-Thinking`, 等。	✅︎	✅︎	✅︎
`Glm4MoeForCausalLM`	GLM-4.5	T + I^E+ + V^E+	`THUDM/GLM-4.5`, 等。	✅︎	✅︎	✅︎
`GraniteSpeechForConditionalGeneration`	Granite Speech	T + A	`ibm-granite/granite-speech-3.3-8b`	✅︎	✅︎	✅︎
`H2OVLChatModel`	H2OVL	T + I^E+	`h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, 等。		✅︎	✅︎
`Idefics3ForConditionalGeneration`	Idefics3	T + I	`HuggingFaceM4/Idefics3-8B-Llama3`, 等。	✅︎		✅︎
`InternVLChatModel`	InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0	T + I^E+ + (V^E+)	`OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, 等。	✅︎	✅︎	✅︎
`KeyeForConditionalGeneration`	Keye-VL-8B-Preview	T + I^E+ + V^E+	`Kwai-Keye/Keye-VL-8B-Preview`			✅︎
`KimiVLForConditionalGeneration`	Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking	T + I⁺	`moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking`			✅︎
`Llama4ForConditionalGeneration`	Llama 4	T + I⁺	`meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, 等。		✅︎	✅︎
`Llama_Nemotron_Nano_VL`	Llama Nemotron Nano VL	T + I^E+	`nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1`	✅︎	✅︎	✅︎
`LlavaForConditionalGeneration`	LLaVA-1.5, Pixtral (HF Transformers)	T + I^E+	`llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (參見注意), `mistral-community/pixtral-12b`, 等。		✅︎	✅︎
`LlavaNextForConditionalGeneration`	LLaVA-NeXT	T + I^E+	`llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, 等。		✅︎	✅︎
`LlavaNextVideoForConditionalGeneration`	LLaVA-NeXT-Video	T + V	`llava-hf/LLaVA-NeXT-Video-7B-hf`, 等。		✅︎	✅︎
`LlavaOnevisionForConditionalGeneration`	LLaVA-Onevision	T + I⁺ + V⁺	`llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, 等。		✅︎	✅︎
`MiniCPMO`	MiniCPM-O	T + I^E+ + V^E+ + A^E+	`openbmb/MiniCPM-o-2_6`, 等。	✅︎	✅︎	✅︎
`MiniCPMV`	MiniCPM-V	T + I^E+ + V^E+	`openbmb/MiniCPM-V-2` (參見注意), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, 等。	✅︎		✅︎
`MiniMaxVL01ForConditionalGeneration`	MiniMax-VL	T + I^E+	`MiniMaxAI/MiniMax-VL-01`, 等。		✅︎	✅︎
`Mistral3ForConditionalGeneration`	Mistral3 (HF Transformers)	T + I⁺	`mistralai/Mistral-Small-3.1-24B-Instruct-2503`, 等。	✅︎	✅︎	✅︎
`MllamaForConditionalGeneration`	Llama 3.2	T + I⁺	`meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, 等。
`MolmoForCausalLM`	Molmo	T + I⁺	`allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, 等。	✅︎	✅︎	✅︎
`NVLM_D_Model`	NVLM-D 1.0	T + I⁺	`nvidia/NVLM-D-72B`, 等。		✅︎	✅︎
`Ovis`	Ovis2, Ovis1.6	T + I⁺	`AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, 等。		✅︎	✅︎
`PaliGemmaForConditionalGeneration`	PaliGemma, PaliGemma 2	T + I^E	`google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, 等。		✅︎	⚠️
`Phi3VForCausalLM`	Phi-3-Vision, Phi-3.5-Vision	T + I^E+	`microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, 等。		✅︎	✅︎
`Phi4MMForCausalLM`	Phi-4-multimodal	T + I⁺ / T + A⁺ / I⁺ + A⁺	`microsoft/Phi-4-multimodal-instruct`, 等。	✅︎	✅︎	✅︎
`PixtralForConditionalGeneration`	Mistral 3 (Mistral 格式), Pixtral (Mistral 格式)	T + I⁺	`mistralai/Mistral-Small-3.1-24B-Instruct-2503`, `mistralai/Pixtral-12B-2409`, 等。		✅︎	✅︎
`QwenVLForConditionalGeneration`^{^}	Qwen-VL	T + I^E+	`Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, 等。	✅︎	✅︎	✅︎
`Qwen2AudioForConditionalGeneration`	Qwen2-Audio	T + A⁺	`Qwen/Qwen2-Audio-7B-Instruct`		✅︎	✅︎
`Qwen2VLForConditionalGeneration`	QVQ, Qwen2-VL	T + I^E+ + V^E+	`Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, 等。	✅︎	✅︎	✅︎
`Qwen2_5_VLForConditionalGeneration`	Qwen2.5-VL	T + I^E+ + V^E+	`Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, 等。	✅︎	✅︎	✅︎
`Qwen2_5OmniThinkerForConditionalGeneration`	Qwen2.5-Omni	T + I^E+ + V^E+ + A⁺	`Qwen/Qwen2.5-Omni-7B`		✅︎	✅︎
`SkyworkR1VChatModel`	Skywork-R1V-38B	T + I	`Skywork/Skywork-R1V-38B`		✅︎	✅︎
`SmolVLMForConditionalGeneration`	SmolVLM2	T + I	`SmolVLM2-2.2B-Instruct`	✅︎		✅︎
`TarsierForConditionalGeneration`	Tarsier	T + I^E+	`omni-search/Tarsier-7b`, `omni-search/Tarsier-34b`		✅︎	✅︎
`Tarsier2ForConditionalGeneration`^{^}	Tarsier2	T + I^E+ + V^E+	`omni-research/Tarsier2-Recap-7b`, `omni-research/Tarsier2-7b-0115`		✅︎	✅︎

有些模型僅透過Transformers 後端支援。下表旨在確認我們正式以這種方式支援的模型。日誌將顯示正在使用 Transformers 後端，您不會看到任何關於這是回退行為的警告。這意味著，如果您在使用下列任何模型時遇到問題，請提交問題，我們將盡力解決！

架構	模型	輸入	示例 HF 模型	LoRA	PP	V1
`Emu3ForConditionalGeneration`	Emu3	T + I	`BAAI/Emu3-Chat-hf`	✅︎	✅︎	✅︎

^{^} 您需要透過 --hf-overrides 設定架構名稱，使其與 vLLM 中的名稱匹配。
• 例如，要使用 DeepSeek-VL2 系列模型
--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'
^E 可以為該模態輸入預計算的嵌入。
⁺ 可以為該模態的每個文字提示輸入多個專案。

警告

V0 和 V1 都支援 Gemma3ForConditionalGeneration 的純文字輸入。然而，它們在處理文字+影像輸入方面存在差異

V0 正確實現了模型的注意力模式： - 對同一影像的影像 token 之間使用雙向注意力 - 對其他 token 使用因果注意力 - 透過（樸素的）PyTorch SDPA 和掩碼張量實現 - 注意：對於包含影像的長提示，可能會佔用大量記憶體

V1 目前使用簡化的注意力模式： - 對所有 token（包括影像 token）使用因果注意力 - 生成的結果合理，但在文字+影像輸入時與原始模型的注意力不匹配，尤其是在 {"do_pan_and_scan": true} 時 - 未來將更新以支援正確行為

存在此限制是因為模型的混合注意力模式（影像雙向，其他因果）尚未受 vLLM 的注意力後端支援。

注意

目前只有帶有 Qwen2.5 文字骨幹的 InternVLChatModel (OpenGVLab/InternVL3-2B, OpenGVLab/InternVL2.5-1B 等) 支援影片輸入。

注意

要使用 TIGER-Lab/Mantis-8B-siglip-llama3，您在執行 vLLM 時必須傳遞 --hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'。

警告

AllenAI/Molmo-7B-D-0924 的輸出質量（尤其是在物件定位任務中）在最近的更新中有所下降。

為獲得最佳效果，我們建議使用以下依賴版本（在 A10 和 L40 上測試）

依賴版本

# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
torch==2.5.1
torchvision==0.20.1
transformers==4.48.1
tokenizers==0.21.0
tiktoken==0.7.0
vllm==0.7.0

# Optional but recommended for improved performance and stability
triton==3.1.0
xformers==0.0.28.post3
uvloop==0.21.0
protobuf==5.29.3
openai==1.60.2
opencv-python-headless==4.11.0.86
pillow==10.4.0

# Installed FlashAttention (for float16 only)
flash-attn>=2.5.6  # Not used in float32, but should be documented

注意： 確保您瞭解使用過時軟體包的安全隱患。

注意

官方的 openbmb/MiniCPM-V-2 暫無法工作，因此我們目前需要使用其一個分支 (HwwwH/MiniCPM-V-2)。更多詳情請參閱： Pull Request #4087

警告

我們的 PaliGemma 實現與 Gemma 3 存在相同的問題（見上文），適用於 V0 和 V1。

注意

對於 Qwen2.5-Omni，目前在 V0 上支援從影片預處理中讀取音訊（--mm-processor-kwargs '{"use_audio_in_video": true}'）（但 V1 不支援），因為 V1 尚不支援模態重疊。

轉錄¶

使用 --task transcription 指定。

專門為自動語音識別訓練的 Speech2Text 模型。

架構	模型	示例 HF 模型	LoRA	PP	V1
`WhisperForConditionalGeneration`	Whisper	`openai/whisper-small`, `openai/whisper-large-v3-turbo`, 等。

池化模型¶

有關如何使用池化模型的更多資訊，請參見此頁面。

重要

由於某些模型架構同時支援生成任務和池化任務，您應該明確指定任務型別，以確保模型在池化模式而非生成模式下使用。

文字嵌入¶

使用 --task embed 指定。

任何文字生成模型都可以透過傳遞 --task embed 轉換為嵌入模型。

注意

為了獲得最佳結果，您應該使用專門為此訓練的池化模型。

下表列出了 vLLM 中經過測試的模型。

架構	模型	輸入	示例 HF 模型	LoRA	PP	V1
`LlavaNextForConditionalGeneration`	基於 LLaVA-NeXT	T / I	`royokong/e5-v`
`Phi3VForCausalLM`	基於 Phi-3-Vision	T + I	`TIGER-Lab/VLM2Vec-Full`	🚧	✅︎

評分¶

使用 --task score 指定。

架構	模型	輸入	示例 HF 模型	[LoRA][lora-adapter]	[PP][distributed-serving]	V1
`JinaVLForSequenceClassification`	基於 JinaVL	T + I^E+	`jinaai/jina-reranker-m0`, 等。			✅︎

模型支援策略¶

在 vLLM，我們致力於促進第三方模型在我們的生態系統中的整合和支援。我們的方法旨在平衡魯棒性需求和支援廣泛模型的實際限制。以下是我們管理第三方模型支援的方式：

社群驅動支援：我們鼓勵社群貢獻以新增新模型。當用戶請求對新模型的支援時，我們歡迎社群提交拉取請求 (PR)。這些貢獻主要根據其生成的輸出的合理性進行評估，而不是嚴格與現有實現（例如 Transformers 中的實現）的一致性。呼籲貢獻： 我們非常感謝直接來自模型供應商的 PR！
盡力而為的一致性：雖然我們旨在保持 vLLM 中實現的模型與其他框架（如 Transformers）之間的一致性水平，但完全對齊並不總是可行的。加速技術和低精度計算的使用可能會引入差異。我們的承諾是確保實現的模型功能正常併產生合理的結果。

提示

當比較 Hugging Face Transformers 的 model.generate 輸出與 vLLM 的 llm.generate 輸出時，請注意前者會讀取模型的生成配置檔案（即 generation_config.json）並應用預設生成引數，而後者僅使用傳遞給函式的引數。比較輸出時，請確保所有采樣引數都相同。
問題解決和模型更新：鼓勵使用者報告他們在第三方模型中遇到的任何錯誤或問題。建議的修復應透過 PR 提交，並清楚解釋問題和所提解決方案的理由。如果對一個模型的修復影響到另一個模型，我們依賴社群突出和解決這些跨模型依賴關係。注意：對於錯誤修復 PR，告知原作者以徵求其反饋是良好的禮儀。
監控和更新：對特定模型感興趣的使用者應監控這些模型的提交歷史（例如，透過跟蹤 main/vllm/model_executor/models 目錄中的更改）。這種積極主動的方法有助於使用者瞭解可能影響他們使用的模型的更新和更改。
選擇性關注：我們的資源主要集中在具有顯著使用者興趣和影響力的模型上。使用頻率較低的模型可能會受到較少關注，我們依靠社群在它們的維護和改進中發揮更積極的作用。

透過這種方法，vLLM 營造了一個協作環境，核心開發團隊和更廣泛的社群都為我們生態系統中支援的第三方模型的魯棒性和多樣性做出了貢獻。

請注意，作為推理引擎，vLLM 不會引入新模型。因此，vLLM 支援的所有模型在這方面都是第三方模型。

我們對模型有以下幾個測試級別：

嚴格一致性：我們在貪婪解碼下比較模型與 HuggingFace Transformers 庫中模型的輸出。這是最嚴格的測試。請參閱模型測試，瞭解透過此測試的模型。
輸出合理性：我們透過測量輸出的困惑度並檢查是否存在明顯錯誤來檢查模型的輸出是否合理和連貫。這是一個不太嚴格的測試。
執行時功能性：我們檢查模型是否可以載入並執行而沒有錯誤。這是最不嚴格的測試。請參閱功能測試和示例，瞭解透過此測試的模型。
社群反饋：我們依賴社群提供模型反饋。如果模型損壞或未按預期工作，我們鼓勵使用者提出問題進行報告或開啟拉取請求進行修復。其餘模型屬於此類別。