OpenAI 相容伺服器¶

vLLM 提供了一個 HTTP 伺服器，實現了 OpenAI 的 Completions API、Chat API 等！透過此功能，您可以提供模型並透過 HTTP 客戶端與之互動。

在您的終端中，您可以安裝 vLLM，然後使用 vllm serve 命令啟動伺服器。（您也可以使用我們的 Docker 映象。）

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

要呼叫伺服器，請在您喜歡的文字編輯器中建立一個使用 HTTP 客戶端的指令碼。在指令碼中包含您想傳送給模型的任何訊息。然後執行該指令碼。下面是一個使用官方 OpenAI Python 客戶端的示例指令碼。

程式碼

from openai import OpenAI
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message)

提示

vLLM 支援一些 OpenAI API 不支援的引數，例如 top_k。您可以透過 OpenAI 客戶端在請求的 extra_body 引數中傳遞這些引數給 vLLM，即對於 top_k，可以設定為 extra_body={"top_k": 50}。

重要

預設情況下，伺服器會應用 Hugging Face 模型儲存庫中的 generation_config.json 檔案（如果存在）。這意味著某些取樣引數的預設值可以被模型建立者推薦的值覆蓋。

要停用此行為，請在啟動伺服器時傳遞 --generation-config vllm。

支援的 API¶

我們目前支援以下 OpenAI API

Completions API (/v1/completions)
- 僅適用於文字生成模型。
- 注意：suffix 引數不受支援。
Responses API (/v1/responses)
- 僅適用於文字生成模型。
Chat Completions API (/v1/chat/completions)
- 僅適用於具有聊天模板的文字生成模型。
- 注意：user 引數被忽略。
- 注意：將 parallel_tool_calls 引數設定為 false 可確保 vLLM 每個請求僅返回零個或一個工具呼叫。將其設定為 true（預設值）則允許每個請求返回多個工具呼叫。如果設定為 true，也不能保證會返回多個工具呼叫，因為這種行為取決於模型，並非所有模型都設計為支援並行工具呼叫。
Embeddings API (/v1/embeddings)
- 僅適用於嵌入模型。
Transcriptions API (/v1/audio/transcriptions)
- 僅適用於語音識別 (ASR) 模型。
Translation API (/v1/audio/translations)
- 僅適用於語音識別 (ASR) 模型。

此外，我們還有以下自定義 API

Tokenizer API (/tokenize, /detokenize)
- 適用於任何帶有分詞器的模型。
Pooling API (/pooling)
- 適用於所有池化模型。
Classification API (/classify)
- 僅適用於分類模型。
Score API (/score)
- 適用於嵌入模型和交叉編碼器模型。
Re-rank API (/rerank, /v1/rerank, /v2/rerank)
- 實現了 Jina AI 的 v1 re-rank API
- 同時相容 Cohere 的 v1 & v2 re-rank API
- Jina 和 Cohere 的 API 非常相似；Jina 的 API 在 rerank 端點的響應中包含額外資訊。
- 僅適用於交叉編碼器模型。

聊天模板¶

為了讓語言模型支援聊天協議，vLLM 要求模型在其分詞器配置中包含一個聊天模板。聊天模板是一個 Jinja2 模板，它指定了如何將角色、訊息和其他特定於聊天的 token 編碼到輸入中。

NousResearch/Meta-Llama-3-8B-Instruct 的示例聊天模板可以在此處找到

有些模型即使經過指令/聊天微調，也沒有提供聊天模板。對於這些模型，您可以透過 --chat-template 引數手動指定其聊天模板，引數值可以是聊天模板的檔案路徑或模板本身（字串形式）。沒有聊天模板，伺服器將無法處理聊天，所有聊天請求都會出錯。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM 社群為熱門模型提供了一組聊天模板。您可以在示例目錄中找到。

隨著多模態聊天 API 的加入，OpenAI 規範現在接受一種新的訊息格式，該格式同時指定了 type 和 text 欄位。下面提供了一個示例。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
            ],
        },
    ],
)

大多數 LLM 的聊天模板都期望 content 欄位是一個字串，但有些較新的模型（如 meta-llama/Llama-Guard-3-1B）期望內容根據請求中的 OpenAI schema 進行格式化。vLLM 提供盡力支援來自動檢測此格式，並會記錄一條類似於“Detected the chat template content format to be…”的字串，同時在內部轉換傳入的請求以匹配檢測到的格式，該格式可以是以下之一：

"string"：字串。
- 示例："Hello world"
"openai"：字典列表，類似於 OpenAI schema。
- 示例：[{"type": "text", "text": "Hello world!"}]

如果結果不是您期望的，您可以設定 --chat-template-content-format 命令列引數來覆蓋要使用的格式。

額外引數¶

vLLM 支援一組不屬於 OpenAI API 的引數。要使用它們，您可以將它們作為額外引數傳遞給 OpenAI 客戶端。或者，如果您直接使用 HTTP 呼叫，可以將它們直接合併到 JSON 負載中。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_body={
        "structured_outputs": {"choice": ["positive", "negative"]},
    },
)

額外 HTTP 頭¶

目前僅支援 X-Request-Id HTTP 請求頭。可以透過 --enable-request-id-headers 啟用它。

程式碼

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    },
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    },
)
print(completion._request_id)

API 參考¶

Completions API¶

我們的 Completions API 與 OpenAI 的 Completions API 相容；您可以使用官方 OpenAI Python 客戶端與之互動。

程式碼示例：示例/online_serving/openai_completion_client.py

額外引數¶

支援以下取樣引數。

程式碼

    use_beam_search: bool = False
    top_k: int | None = None
    min_p: float | None = None
    repetition_penalty: float | None = None
    length_penalty: float = 1.0
    stop_token_ids: list[int] | None = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
    allowed_token_ids: list[int] | None = None
    prompt_logprobs: int | None = None

支援以下額外引數

程式碼

    prompt_embeds: bytes | list[bytes] | None = None
    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."
        ),
    )
    response_format: AnyResponseFormat | None = Field(
        default=None,
        description=(
            "Similar to chat completion, this parameter specifies the format "
            "of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
            ", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
        ),
    )
    structured_outputs: StructuredOutputsParams | None = Field(
        default=None,
        description="Additional kwargs for structured outputs",
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    logits_processors: LogitsProcessors | None = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."
        ),
    )

    return_tokens_as_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."
        ),
    )
    return_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified, the result will include token IDs alongside the "
            "generated text. In streaming mode, prompt_token_ids is included "
            "only in the first chunk, and token_ids contains the delta tokens "
            "for each chunk. This is useful for debugging or when you "
            "need to map generated text back to input tokens."
        ),
    )

    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )

    kv_transfer_params: dict[str, Any] | None = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.",
    )

    vllm_xargs: dict[str, str | int | float] | None = Field(
        default=None,
        description=(
            "Additional request parameters with string or "
            "numeric values, used by custom extensions."
        ),
    )

Chat API¶

我們的 Chat API 與 OpenAI 的 Chat Completions API 相容；您可以使用官方 OpenAI Python 客戶端與之互動。

我們同時支援 Vision 和 Audio 相關的引數；有關更多資訊，請參閱我們的多模態輸入指南。

注意：image_url.detail 引數不受支援。

程式碼示例：示例/online_serving/openai_chat_completion_client.py

額外引數¶

支援以下取樣引數。

程式碼

    use_beam_search: bool = False
    top_k: int | None = None
    min_p: float | None = None
    repetition_penalty: float | None = None
    length_penalty: float = 1.0
    stop_token_ids: list[int] | None = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
    prompt_logprobs: int | None = None
    allowed_token_ids: list[int] | None = None
    bad_words: list[str] = Field(default_factory=list)

支援以下額外引數

程式碼

    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."
        ),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=(
            "If true, the generation prompt will be added to the chat template. "
            "This is a parameter used by chat template in tokenizer config of the "
            "model."
        ),
    )
    continue_final_message: bool = Field(
        default=False,
        description=(
            "If this is set, the chat will be formatted so that the final "
            "message in the chat is open-ended, without any EOS tokens. The "
            "model will continue this message rather than starting a new one. "
            'This allows you to "prefill" part of the model\'s response for it. '
            "Cannot be used at the same time as `add_generation_prompt`."
        ),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."
        ),
    )
    documents: list[dict[str, str]] | None = Field(
        default=None,
        description=(
            "A list of dicts representing documents that will be accessible to "
            "the model if it is performing RAG (retrieval-augmented generation)."
            " If the template does not support RAG, this argument will have no "
            "effect. We recommend that each document should be a dict containing "
            '"title" and "text" keys.'
        ),
    )
    chat_template: str | None = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."
        ),
    )
    chat_template_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    structured_outputs: StructuredOutputsParams | None = Field(
        default=None,
        description="Additional kwargs for structured outputs",
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    logits_processors: LogitsProcessors | None = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."
        ),
    )
    return_tokens_as_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."
        ),
    )
    return_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified, the result will include token IDs alongside the "
            "generated text. In streaming mode, prompt_token_ids is included "
            "only in the first chunk, and token_ids contains the delta tokens "
            "for each chunk. This is useful for debugging or when you "
            "need to map generated text back to input tokens."
        ),
    )
    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )
    kv_transfer_params: dict[str, Any] | None = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.",
    )

    vllm_xargs: dict[str, str | int | float | list[str | int | float]] | None = Field(
        default=None,
        description=(
            "Additional request parameters with (list of) string or "
            "numeric values, used by custom extensions."
        ),
    )

Responses API¶

我們的 Responses API 與 OpenAI 的 Responses API 相容；您可以使用官方 OpenAI Python 客戶端與之互動。

程式碼示例：示例/online_serving/openai_responses_client_with_tools.py

額外引數¶

支援請求物件中的以下額外引數

程式碼

    request_id: str = Field(
        default_factory=lambda: f"resp_{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )

    enable_response_messages: bool = Field(
        default=False,
        description=(
            "Dictates whether or not to return messages as part of the "
            "response object. Currently only supported for"
            "non-background and gpt-oss only. "
        ),
    )
    # similar to input_messages / output_messages in ResponsesResponse
    # we take in previous_input_messages (ie in harmony format)
    # this cannot be used in conjunction with previous_response_id
    # TODO: consider supporting non harmony messages as well
    previous_input_messages: list[OpenAIHarmonyMessage | dict] | None = None

支援響應物件中的以下額外引數

程式碼

    # These are populated when enable_response_messages is set to True
    # NOTE: custom serialization is needed
    # see serialize_input_messages and serialize_output_messages
    input_messages: ResponseInputOutputMessage | None = Field(
        default=None,
        description=(
            "If enable_response_messages, we can show raw token input to model."
        ),
    )
    output_messages: ResponseInputOutputMessage | None = Field(
        default=None,
        description=(
            "If enable_response_messages, we can show raw token output of model."
        ),
    )

Embeddings API¶

我們的 Embeddings API 與 OpenAI 的 Embeddings API 相容；您可以使用官方 OpenAI Python 客戶端與之互動。

程式碼示例：示例/pooling/embed/openai_embedding_client.py

如果模型具有聊天模板，您可以將 inputs 替換為 messages 列表（與 Chat API 相同的 schema），這將作為單個提示輸入到模型中。下面是一個用於在保留 OpenAI 型別註解的情況下呼叫 API 的便捷函式。

程式碼

from openai import OpenAI
from openai._types import NOT_GIVEN, NotGiven
from openai.types.chat import ChatCompletionMessageParam
from openai.types.create_embedding_response import CreateEmbeddingResponse

def create_chat_embeddings(
    client: OpenAI,
    *,
    messages: list[ChatCompletionMessageParam],
    model: str,
    encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
) -> CreateEmbeddingResponse:
    return client.post(
        "/embeddings",
        cast_to=CreateEmbeddingResponse,
        body={"messages": messages, "model": model, "encoding_format": encoding_format},
    )

您可以透過定義伺服器的自定義聊天模板並傳遞 messages 列表到請求中來將多模態輸入傳遞給嵌入模型。請參閱下面的示例進行說明。

VLM2VecDSE-Qwen2-MRL

要提供模型

vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
  --trust-remote-code \
  --max-model-len 4096 \
  --chat-template examples/template_vlm2vec_phi3v.jinja

重要

由於 VLM2Vec 的模型架構與 Phi-3.5-Vision 相同，我們必須顯式傳遞 --runner pooling 以在嵌入模式下執行此模型，而不是文字生成模式。

自定義聊天模板與該模型的原始模板完全不同，可以在此處找到：示例/template_vlm2vec_phi3v.jinja

由於請求 schema 不是由 OpenAI 客戶端定義的，我們使用較低階的 requests 庫向伺服器傳送請求。

程式碼

from openai import OpenAI
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="EMPTY",
)
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = create_chat_embeddings(
    client,
    model="TIGER-Lab/VLM2Vec-Full",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }
    ],
    encoding_format="float",
)

print("Image embedding output:", response.data[0].embedding)

要提供模型

vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
  --trust-remote-code \
  --max-model-len 8192 \
  --chat-template examples/template_dse_qwen2_vl.jinja

重要

與 VLM2Vec 一樣，我們也必須顯式傳遞 --runner pooling。

此外，MrLight/dse-qwen2-2b-mrl-v1 需要一個最小影像尺寸的佔位符影像用於文字查詢嵌入。請參閱下面的完整程式碼示例瞭解詳情。

重要

MrLight/dse-qwen2-2b-mrl-v1 需要一個最小影像尺寸的佔位符影像用於文字查詢嵌入。請參閱下面的完整程式碼示例瞭解詳情。

完整示例：示例/pooling/embed/openai_chat_embedding_client_for_multimodal.py

額外引數¶

支援以下池化引數。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    dimensions: int | None = None
    normalize: bool | None = None

預設支援以下額外引數

程式碼

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."
        ),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    normalize: bool | None = Field(
        default=None,
        description="Whether to normalize the embeddings outputs. Default is True.",
    )
    embed_dtype: EmbedDType = Field(
        default="float32",
        description=(
            "What dtype to use for encoding. Default to using float32 for base64 "
            "encoding to match the OpenAI python client behavior. "
            "This parameter will affect base64 and binary_response."
        ),
    )
    endianness: Endianness = Field(
        default="native",
        description=(
            "What endianness to use for encoding. Default to using native for "
            "base64 encoding to match the OpenAI python client behavior."
            "This parameter will affect base64 and binary_response."
        ),
    )

對於類似聊天的輸入（即如果傳遞了 messages），則支援以下額外引數

程式碼

    add_generation_prompt: bool = Field(
        default=False,
        description=(
            "If true, the generation prompt will be added to the chat template. "
            "This is a parameter used by chat template in tokenizer config of the "
            "model."
        ),
    )

    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."
        ),
    )
    chat_template: str | None = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."
        ),
    )
    chat_template_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    normalize: bool | None = Field(
        default=None,
        description="Whether to normalize the embeddings outputs. Default is True.",
    )
    embed_dtype: EmbedDType = Field(
        default="float32",
        description=(
            "What dtype to use for encoding. Default to using float32 for base64 "
            "encoding to match the OpenAI python client behavior. "
            "This parameter will affect base64 and binary_response."
        ),
    )
    endianness: Endianness = Field(
        default="native",
        description=(
            "What endianness to use for encoding. Default to using native for "
            "base64 encoding to match the OpenAI python client behavior."
            "This parameter will affect base64 and binary_response."
        ),
    )

Transcriptions API¶

我們的 Transcriptions API 與 OpenAI 的 Transcriptions API 相容；您可以使用官方 OpenAI Python 客戶端與之互動。

注意

要使用 Transcriptions API，請使用 pip install vllm[audio] 安裝帶有額外音訊依賴項的版本。

程式碼示例：示例/online_serving/openai_transcription_client.py

API 強制限制¶

透過 VLLM_MAX_AUDIO_CLIP_FILESIZE_MB 環境變數設定 VLLM 將接受的最大音訊檔案大小（以 MB 為單位）。預設為 25 MB。

上傳音訊檔案¶

Transcriptions API 支援上傳多種格式的音訊檔案，包括 FLAC、MP3、MP4、MPEG、MPGA、M4A、OGG、WAV 和 WEBM。

使用 OpenAI Python 客戶端

程式碼

from openai import OpenAI

client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

# Upload audio file from disk
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3-turbo",
        file=audio_file,
        language="en",
        response_format="verbose_json",
    )

print(transcription.text)

使用 curl 進行 multipart/form-data 上傳

程式碼

curl -X POST "https://:8000/v1/audio/transcriptions" \
  -H "Authorization: Bearer token-abc123" \
  -F "[email protected]" \
  -F "model=openai/whisper-large-v3-turbo" \
  -F "language=en" \
  -F "response_format=verbose_json"

支援的引數

file：要轉錄的音訊檔案（必需）
model：用於轉錄的模型（必需）
language：語言程式碼（例如，“en”、“zh”）（可選）
prompt：可選文字，用於指導轉錄風格（可選）
response_format：響應格式（“json”、“text”）（可選）
temperature：取樣溫度，介於 0 和 1 之間（可選）

有關支援的引數的完整列表，包括取樣引數和 vLLM 擴充套件，請參閱協議定義。

響應格式

對於 verbose_json 響應格式

程式碼

{
  "text": "Hello, this is a transcription of the audio file.",
  "language": "en",
  "duration": 5.42,
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a transcription",
      "tokens": [50364, 938, 428, 307, 275, 28347],
      "temperature": 0.0,
      "avg_logprob": -0.245,
      "compression_ratio": 1.235,
      "no_speech_prob": 0.012
    }
  ]
}

目前“verbose_json”響應格式不支援 avg_logprob、compression_ratio、no_speech_prob。

額外引數¶

支援以下取樣引數。

程式碼

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

    top_p: float | None = None
    """Enables nucleus (top-p) sampling, where tokens are selected from the
    smallest possible set whose cumulative probability exceeds `p`.
    """

    top_k: int | None = None
    """Limits sampling to the `k` most probable tokens at each step."""

    min_p: float | None = None
    """Filters out tokens with a probability lower than `min_p`, ensuring a
    minimum likelihood threshold during sampling.
    """

    seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    frequency_penalty: float | None = 0.0
    """The frequency penalty to use for sampling."""

    repetition_penalty: float | None = None
    """The repetition penalty to use for sampling."""

    presence_penalty: float | None = 0.0
    """The presence penalty to use for sampling."""

    max_completion_tokens: int | None = None
    """The maximum number of tokens to generate."""

支援以下額外引數

程式碼

    # Flattened stream option to simplify form data.
    stream_include_usage: bool | None = False
    stream_continuous_usage_stats: bool | None = False

    vllm_xargs: dict[str, str | int | float] | None = Field(
        default=None,
        description=(
            "Additional request parameters with string or "
            "numeric values, used by custom extensions."
        ),
    )

Translations API¶

我們的 Translation API 與 OpenAI 的 Translations API 相容；您可以使用官方 OpenAI Python 客戶端與之互動。Whisper 模型可以將 55 種非英語支援語言中的任何一種的音訊翻譯成英語。請注意，流行的 openai/whisper-large-v3-turbo 模型不支援翻譯。

注意

要使用 Translation API，請使用 pip install vllm[audio] 安裝帶有額外音訊依賴項的版本。

程式碼示例：示例/online_serving/openai_translation_client.py

額外引數¶

支援以下取樣引數。

    seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

支援以下額外引數

    language: str | None = None
    """The language of the input audio we translate from.

    Supplying the input language in
    [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) format
    will improve accuracy.
    """

    to_language: str | None = None
    """The language of the input audio we translate to.

    Please note that this is not supported by all models, refer to the specific
    model documentation for more details.
    For instance, Whisper only supports `to_language=en`.
    """

    stream: bool | None = False
    """Custom field not present in the original OpenAI definition. When set,
    it will enable output to be streamed in a similar fashion as the Chat
    Completion endpoint.
    """
    # Flattened stream option to simplify form data.
    stream_include_usage: bool | None = False
    stream_continuous_usage_stats: bool | None = False

    max_completion_tokens: int | None = None
    """The maximum number of tokens to generate."""

Tokenizer API¶

我們的 Tokenizer API 是 HuggingFace 風格分詞器的簡單包裝器。它包含兩個端點：

/tokenize 對應呼叫 tokenizer.encode()。
/detokenize 對應呼叫 tokenizer.decode()。

Pooling API¶

我們的 Pooling API 使用池化模型對輸入提示進行編碼，並返回相應的隱藏狀態。

輸入格式與 Embeddings API 相同，但輸出資料可以包含任意的巢狀列表，而不僅僅是浮點數的一維列表。

程式碼示例：示例/pooling/pooling/openai_pooling_client.py

Classification API¶

我們的 Classification API 直接支援 Hugging Face 的 sequence-classification 模型，例如 ai21labs/Jamba-tiny-reward-dev 和 jason9693/Qwen2.5-1.5B-apeach。

我們透過 as_seq_cls_model() 自動包裝任何其他 Transformer 模型，該方法會在最後一個 token 上進行池化，附加一個 RowParallelLinear 頭，並應用 softmax 來產生每個類的機率。

程式碼示例：示例/pooling/classify/openai_classification_client.py

示例請求¶

您可以透過傳遞字串陣列來對多個文字進行分類。

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": [
      "Loved the new café—coffee was great.",
      "This update broke everything. Frustrating."
    ]
  }'

響應

{
  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
  "object": "list",
  "created": 1745383065,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    },
    {
      "index": 1,
      "label": "Spoiled",
      "probs": [
        0.26448777318000793,
        0.7355121970176697
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 20,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

您也可以直接將字串傳遞給 input 欄位。

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": "Loved the new café—coffee was great."
  }'

響應

{
  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
  "object": "list",
  "created": 1745383213,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

額外引數¶

支援以下池化引數。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    softmax: bool | None = None
    activation: bool | None = None
    use_activation: bool | None = None

支援以下額外引數

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    softmax: bool | None = Field(
        default=None,
        description="softmax will be deprecated, please use use_activation instead.",
    )

    activation: bool | None = Field(
        default=None,
        description="activation will be deprecated, please use use_activation instead.",
    )

    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for classification outputs. "
        "Default is True.",
    )

Score API¶

我們的 Score API 可以應用交叉編碼器模型或嵌入模型來預測句子對或多模態對的分數。當使用嵌入模型時，分數對應於每個嵌入對之間的餘弦相似度。通常，句子對的分數是指兩個句子之間的相似度，範圍從 0 到 1。

您可以在 sbert.net 上找到交叉編碼器模型的文件。

程式碼示例：示例/pooling/score/openai_cross_encoder_score.py

單次推理¶

您可以將字串傳遞給 text_1 和 text_2，形成一個句子對。

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'

響應

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

批次推理¶

您可以將字串傳遞給 text_1，將列表傳遞給 text_2，形成多個句子對，其中每個對由 text_1 和 text_2 中的字串構成。對的總數為 len(text_2)。

請求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

響應

{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以將列表傳遞給 text_1 和 text_2，形成多個句子對，其中每個對由 text_1 中的字串和 text_2 中的相應字串構成（類似於 zip()）。對的總數為 len(text_2)。

請求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

響應

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以透過在請求中傳遞包含多模態輸入（影像等）列表的 content 來將多模態輸入傳遞給評分模型。請參閱下面的示例進行說明。

JinaVL-Reranker

要提供模型

vllm serve jinaai/jina-reranker-m0

由於請求 schema 不是由 OpenAI 客戶端定義的，我們使用較低階的 requests 庫向伺服器傳送請求。

程式碼

import requests

response = requests.post(
    "https://:8000/v1/score",
    json={
        "model": "jinaai/jina-reranker-m0",
        "text_1": "slm markdown",
        "text_2": {
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
                    },
                },
            ],
        },
    },
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])

完整示例：示例/pooling/score/openai_cross_encoder_score_for_multimodal.py

額外引數¶

支援以下池化引數。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    softmax: bool | None = None
    activation: bool | None = None
    use_activation: bool | None = None

支援以下額外引數

    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )

    softmax: bool | None = Field(
        default=None,
        description="softmax will be deprecated, please use use_activation instead.",
    )

    activation: bool | None = Field(
        default=None,
        description="activation will be deprecated, please use use_activation instead.",
    )

    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for classification outputs. "
        "Default is True.",
    )

Re-rank API¶

我們的 Re-rank API 可以應用嵌入模型或交叉編碼器模型來預測單個查詢與一組文件之間的相關分數。通常，句子對的分數是指兩個句子或多模態輸入（影像等）之間的相似度，範圍為 0 到 1。

您可以在 sbert.net 上找到交叉編碼器模型的文件。

rerank 端點支援流行的 re-rank 模型，如 BAAI/bge-reranker-base 以及其他支援 score 任務的模型。此外，/rerank、/v1/rerank 和 /v2/rerank 端點與 Jina AI 的 re-rank API 介面和 Cohere 的 re-rank API 介面相容，以確保與流行的開源工具相容。

程式碼示例：示例/pooling/score/openai_reranker.py

示例請求¶

請注意，top_n 請求引數是可選的，預設值為 documents 欄位的長度。結果文件將按相關性排序，index 屬性可用於確定原始順序。

請求

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'

響應

{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}

額外引數¶

支援以下池化引數。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    softmax: bool | None = None
    activation: bool | None = None
    use_activation: bool | None = None

支援以下額外引數

    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )

    softmax: bool | None = Field(
        default=None,
        description="softmax will be deprecated, please use use_activation instead.",
    )

    activation: bool | None = Field(
        default=None,
        description="activation will be deprecated, please use use_activation instead.",
    )

    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for classification outputs. "
        "Default is True.",
    )

Ray Serve LLM¶

Ray Serve LLM 使 vLLM 引擎能夠進行可擴充套件的、生產級的服務。它與 vLLM 緊密整合，並增加了自動擴充套件、負載均衡和背壓等功能。

關鍵功能

公開 OpenAI 相容的 HTTP API 以及 Pythonic API。
無需更改程式碼即可從單 GPU 擴充套件到多節點叢集。
透過 Ray Dashboard 和指標提供可觀測性和自動擴充套件策略。

以下示例展示瞭如何使用 Ray Serve LLM 部署像 DeepSeek R1 這樣的大模型：示例/online_serving/ray_serve_deepseek.py。

透過官方 Ray Serve LLM 文件瞭解更多關於 Ray Serve LLM 的資訊。

OpenAI 相容伺服器¶

支援的 API¶

聊天模板¶

額外引數¶

額外 HTTP 頭¶

API 參考¶

Completions API¶

額外引數¶

Chat API¶

額外引數¶

Responses API¶

額外引數¶

Embeddings API¶

多模態輸入¶

額外引數¶

Transcriptions API¶

API 強制限制¶

上傳音訊檔案¶

額外引數¶

Translations API¶

額外引數¶

Tokenizer API¶

Pooling API¶

Classification API¶

示例請求¶

額外引數¶

Score API¶

單次推理¶

批次推理¶

多模態輸入¶

額外引數¶

Re-rank API¶

示例請求¶

額外引數¶

Ray Serve LLM¶