池化模型¶

vLLM 也支援池化模型，例如嵌入、分類和獎勵模型。

在 vLLM 中，池化模型實現了 VllmModelForPooling 介面。這些模型使用 Pooler 在返回輸入之前提取最終的隱藏狀態。

注意

我們目前主要為了方便支援池化模型。這不保證會比直接使用 Hugging Face Transformers 或 Sentence Transformers 提供任何效能改進。

我們計劃在 vLLM 中最佳化池化模型。請在 Issue #21796 提出您的建議！

配置¶

模型執行器¶

透過選項 --runner pooling 以池化模式執行模型。

提示

在絕大多數情況下，無需設定此選項，因為 vLLM 可以透過 --runner auto 自動檢測合適的模型執行器。

模型轉換¶

vLLM 可以透過選項 --convert <type> 來調整模型以適應各種池化任務。

如果已設定（手動或自動）--runner pooling 但模型未實現 VllmModelForPooling 介面，vLLM 將嘗試根據下表中顯示的架構名稱自動轉換模型。

架構	`--convert`	支援的池化任務
`ForTextEncoding`, `EmbeddingModel`, `*Model`	`embed`	`token_embed`, `embed`
`ForRewardModeling`, `RewardModel`	`embed`	`token_embed`, `embed`
`ForClassification`, `*ClassificationModel`	`classify`	`token_classify`, `classify`, `score`

提示

您可以明確設定 --convert <type> 來指定如何轉換模型。

池化任務¶

vLLM 中的每個池化模型都支援一個或多個這些任務，具體取決於 Pooler.get_supported_tasks，從而啟用相應的 API。

任務	API
`embed`	`LLM.embed(...)`, `LLM.score(...)`*, `LLM.encode(..., pooling_task="embed")`
`classify`	`LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`
`score`	`LLM.score(...)`
`token_classify`	`LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")`
`token_embed`	`LLM.encode(..., pooling_task="token_embed")`
`plugin`	`LLM.encode(..., pooling_task="plugin")`

* 如果模型不支援 score 任務，LLM.score(...) API 將回退到 embed 任務。

池化器配置¶

預定義模型¶

如果模型定義的 Pooler 接受 pooler_config，則可以透過 --pooler-config 選項覆蓋其某些屬性。

已轉換模型¶

如果模型已透過 --convert（如上所示）進行轉換，則為每個任務分配的池化器具有以下預設屬性：

任務	池化型別	標準化	Softmax
`embed`	`LAST`	✅︎	❌
`classify`	`LAST`	❌	✅︎

載入 Sentence Transformers 模型時，其 Sentence Transformers 配置檔案（modules.json）的優先順序高於模型的預設設定。

您可以透過 --pooler-config 選項進一步自定義此設定，該選項的優先順序高於模型和 Sentence Transformers 的預設設定。

離線推理¶

LLM 類提供各種用於離線推理的方法。有關初始化模型時的選項列表，請參閱配置。

`LLM.embed`¶

embed 方法為每個提示輸出一個嵌入向量。它主要設計用於嵌入模型。

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

可以在此處找到程式碼示例： examples/offline_inference/basic/embed.py

`LLM.classify`¶

classify 方法為每個提示輸出一個機率向量。它主要設計用於分類模型。

from vllm import LLM

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")

可以在此處找到程式碼示例： examples/offline_inference/basic/classify.py

`LLM.score`¶

score 方法輸出句子對之間的相似度得分。它專為嵌入模型和交叉編碼器模型設計。嵌入模型使用餘弦相似度，而交叉編碼器模型在 RAG 系統中用作候選查詢-文件對之間的重排序器。

注意

vLLM 只能執行 RAG 的模型推理元件（例如，嵌入、重排序）。要處理更高級別的 RAG，您應該使用 LangChain 等整合框架。

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = output.outputs.score
print(f"Score: {score}")

可以在此處找到程式碼示例： examples/offline_inference/basic/score.py

`LLM.reward`¶

reward 方法可用於 vLLM 中的所有獎勵模型。

from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.reward("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")

可以在此處找到程式碼示例： examples/offline_inference/basic/reward.py

`LLM.encode`¶

encode 方法可用於 vLLM 中的所有池化模型。

注意

請使用更具體的某個方法，或在使用 LLM.encode 時直接設定任務。

對於嵌入，請使用 LLM.embed(...) 或 pooling_task="embed"。
對於分類 logits，請使用 LLM.classify(...) 或 pooling_task="classify"。
對於相似度得分，請使用 LLM.score(...)。
對於獎勵，請使用 LLM.reward(...) 或 pooling_task="token_classify"。
對於 token 分類，請使用 pooling_task="token_classify"。
對於多向量檢索，請使用 pooling_task="token_embed"。
對於 IO 處理器外掛，請使用 pooling_task="plugin"。

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="embed")

data = output.outputs.data
print(f"Data: {data!r}")

線上服務¶

我們的 OpenAI 相容伺服器提供了與離線 API 相對應的端點。

嵌入 API 類似於 LLM.embed，它接受文字和多模態輸入用於嵌入模型。
分類 API 類似於 LLM.classify，適用於序列分類模型。
得分 API 類似於用於交叉編碼器模型的 LLM.score。
池化 API 類似於 LLM.encode，適用於所有型別的池化模型。

注意

在使用池化 API 時，請使用更具體的端點之一或直接設定任務。

對於嵌入，請使用嵌入 API 或 "task":"embed"。
對於分類 logits，請使用分類 API 或 "task":"classify"。
對於相似度得分，請使用得分 API。
對於獎勵，請使用 "task":"token_classify"。
對於 token 分類，請使用 "task":"token_classify"。
對於多向量檢索，請使用 "task":"token_embed"。
對於 IO 處理器外掛，請使用 "task":"plugin"。

# start a supported embeddings model server with `vllm serve`, e.g.
# vllm serve intfloat/e5-small
import requests

host = "localhost"
port = "8000"
model_name = "intfloat/e5-small"

api_url = f"http://{host}:{port}/pooling"

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
prompt = {"model": model_name, "input": prompts, "task": "embed"}

response = requests.post(api_url, json=prompt)

for output in response.json()["data"]:
    data = output["data"]
    print(f"Data: {data!r} (size={len(data)})")

Matryoshka 嵌入¶

Matryoshka 嵌入或 Matryoshka 表示學習 (MRL) 是一種用於訓練嵌入模型的技術。它允許使用者在效能和成本之間進行權衡。

警告

並非所有嵌入模型都使用 Matryoshka 表示學習進行訓練。為避免濫用 dimensions 引數，vLLM 會對試圖更改不支援 Matryoshka 嵌入的模型輸出維度的請求返回錯誤。

例如，在使用 BAAI/bge-m3 模型時設定 dimensions 引數將導致以下錯誤。

{"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}

手動啟用 Matryoshka 嵌入¶

目前沒有指定 Matryoshka 嵌入支援的官方介面。在 vLLM 中，如果 config.json 中的 is_matryoshka 為 True，則可以將輸出維度更改為任意值。使用 matryoshka_dimensions 來控制允許的輸出維度。

對於支援 Matryoshka 嵌入但未被 vLLM 識別的模型，可以透過 hf_overrides={"is_matryoshka": True} 或 hf_overrides={"matryoshka_dimensions": [<允許的輸出維度>]}（離線），或 --hf-overrides '{"is_matryoshka": true}' 或 --hf-overrides '{"matryoshka_dimensions": [<允許的輸出維度>]}'（線上）手動覆蓋配置。

以下是啟用 Matryoshka 嵌入的模型服務的示例。

vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf-overrides '{"matryoshka_dimensions":[256]}'

離線推理¶

您可以透過在 PoolingParams 中使用 dimensions 引數來更改支援 Matryoshka 嵌入的嵌入模型的輸出維度。

from vllm import LLM, PoolingParams

llm = LLM(
    model="jinaai/jina-embeddings-v3",
    runner="pooling",
    trust_remote_code=True,
)
outputs = llm.embed(
    ["Follow the white rabbit."],
    pooling_params=PoolingParams(dimensions=32),
)
print(outputs[0].outputs)

可以在此處找到程式碼示例： examples/pooling/embed/embed_matryoshka_fy.py

線上推理¶

使用以下命令啟動 vLLM 伺服器。

vllm serve jinaai/jina-embeddings-v3 --trust-remote-code

您可以透過 dimensions 引數來更改支援 Matryoshka 嵌入的嵌入模型的輸出維度。

curl http://127.0.0.1:8000/v1/embeddings \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Follow the white rabbit.",
    "model": "jinaai/jina-embeddings-v3",
    "encoding_format": "float",
    "dimensions": 32
  }'

預期輸出

{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}

可以在此處找到 OpenAI 客戶端示例： examples/pooling/embed/openai_embedding_matryoshka_fy.py

已棄用功能¶

Encode 任務¶

我們將 encode 任務拆分為兩個更具體的 token 級任務：token_embed 和 token_classify。

token_embed 與 embed 相同，使用標準化作為啟用函式。
token_classify 與 classify 相同，預設使用 softmax 作為啟用函式。

從 PoolingParams 中移除 softmax¶

我們將在 v0.15 中移除 PoolingParams 中的 softmax 和 activation。改為使用 use_activation，因為我們允許 classify 和 token_classify 使用任何啟用函式。

as_reward_model¶

警告

我們將在 v0.15 中移除 --convert reward，而是使用 --convert embed。

池化模型現在預設支援所有池化，您無需任何設定即可使用。

提取隱藏狀態優先使用 token_embed 任務。
獎勵模型優先使用 token_classify 任務。

池化模型¶

配置¶

模型執行器¶

模型轉換¶

池化任務¶

池化器配置¶

預定義模型¶

已轉換模型¶

離線推理¶

LLM.embed¶

LLM.classify¶

LLM.score¶

LLM.reward¶

LLM.encode¶

線上服務¶

Matryoshka 嵌入¶

手動啟用 Matryoshka 嵌入¶

離線推理¶

線上推理¶

已棄用功能¶

Encode 任務¶

從 PoolingParams 中移除 softmax¶

as_reward_model¶

`LLM.embed`¶

`LLM.classify`¶

`LLM.score`¶

`LLM.reward`¶

`LLM.encode`¶