OpenAI 相容伺服器¶

vLLM 提供了一個 HTTP 伺服器，該伺服器實現了 OpenAI 的補全 API、聊天 API 等！此功能允許您使用 HTTP 客戶端來服務模型並與它們進行互動。

在您的終端中，您可以安裝 vLLM，然後使用vllm serve 命令啟動伺服器。（您也可以使用我們的 Docker 映象。）

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

要呼叫伺服器，請在您偏好的文字編輯器中建立一個使用 HTTP 客戶端的指令碼。包含您想要傳送給模型的任何訊息。然後執行該指令碼。下面是使用官方 OpenAI Python 客戶端的示例指令碼。

程式碼

from openai import OpenAI
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

提示

vLLM 支援 OpenAI 不支援的一些引數，例如 top_k。您可以透過 OpenAI 客戶端在請求的 extra_body 引數中將這些引數傳遞給 vLLM，即 extra_body={"top_k": 50} 用於 top_k。

重要

預設情況下，如果 Hugging Face 模型倉庫中存在 generation_config.json，伺服器會應用該檔案。這意味著某些取樣引數的預設值可以被模型建立者推薦的值覆蓋。

要停用此行為，請在啟動伺服器時傳遞 --generation-config vllm 引數。

支援的 API¶

我們目前支援以下 OpenAI API

補全 API (/v1/completions)
- 僅適用於文字生成模型 (--task generate)。
- 注意：不支援 suffix 引數。
聊天補全 API (/v1/chat/completions)
- 僅適用於具有聊天模板的文字生成模型 (--task generate)。
- 注意：parallel_tool_calls 和 user 引數將被忽略。
嵌入 API (/v1/embeddings)
- 僅適用於嵌入模型 (--task embed)。
轉錄 API (/v1/audio/transcriptions)
- 僅適用於自動語音識別 (ASR) 模型 (OpenAI Whisper) (--task generate)。
翻譯 API (/v1/audio/translations)
- 僅適用於自動語音識別 (ASR) 模型 (OpenAI Whisper) (--task generate)。

此外，我們還有以下自定義 API

分詞器 API (/tokenize, /detokenize)
- 適用於任何帶有分詞器的模型。
池化 API (/pooling)
- 適用於所有池化模型。
分類 API (/classify)
- 僅適用於分類模型 (--task classify)。
評分 API (/score)
- 適用於嵌入模型和交叉編碼器模型 (--task score)。
重排 API (/rerank, /v1/rerank, /v2/rerank)
- 實現 Jina AI 的 v1 重排 API
- 也相容 Cohere 的 v1 和 v2 重排 API
- Jina 和 Cohere 的 API 非常相似；Jina 的 API 在重排端點的響應中包含額外資訊。
- 僅適用於交叉編碼器模型 (--task score)。

聊天模板¶

為了讓語言模型支援聊天協議，vLLM 要求模型在其分詞器配置中包含一個聊天模板。聊天模板是一個 Jinja2 模板，它指定了角色、訊息和其他聊天專用令牌在輸入中如何編碼。

NousResearch/Meta-Llama-3-8B-Instruct 的示例聊天模板可在此處找到

有些模型即使經過指令/聊天微調，也不提供聊天模板。對於這些模型，您可以透過 --chat-template 引數手動指定其聊天模板，可以是檔案路徑或字串形式。如果沒有聊天模板，伺服器將無法處理聊天，並且所有聊天請求都將出錯。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM 社群為流行模型提供了一組聊天模板。您可以在 examples 目錄下找到它們。

隨著多模態聊天 API 的引入，OpenAI 規範現在接受一種新格式的聊天訊息，該格式同時指定 type 和 text 欄位。示例如下

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
    ]
)

大多數 LLM 的聊天模板期望 content 欄位是字串，但有些較新的模型（例如 meta-llama/Llama-Guard-3-1B）期望內容根據請求中的 OpenAI 模式進行格式化。vLLM 盡力自動檢測此情況，並將其記錄為類似於 "Detected the chat template content format to be..." 的字串，然後內部將傳入請求轉換為與檢測到的格式匹配，該格式可以是以下之一：

"string": 一個字串。
- 示例："Hello world"
"openai": 一個字典列表，類似於 OpenAI 模式。
- 示例：[{"type": "text", "text": "Hello world!"}]

如果結果不是您所期望的，您可以設定 --chat-template-content-format CLI 引數來覆蓋要使用的格式。

額外引數¶

vLLM 支援一組不屬於 OpenAI API 的引數。為了使用它們，您可以在 OpenAI 客戶端中將它們作為額外引數傳遞。或者，如果您直接使用 HTTP 呼叫，則直接將它們合併到 JSON 有效負載中。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_body={
        "guided_choice": ["positive", "negative"]
    }
)

額外 HTTP 請求頭¶

目前僅支援 X-Request-Id HTTP 請求頭。可以透過 --enable-request-id-headers 啟用它。

程式碼

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    }
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    }
)
print(completion._request_id)

API 參考¶

補全 API¶

我們的補全 API 與 OpenAI 的補全 API 相容；您可以使用官方 OpenAI Python 客戶端與其互動。

程式碼示例： examples/online_serving/openai_completion_client.py

額外引數¶

支援以下取樣引數。

程式碼

    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    allowed_token_ids: Optional[list[int]] = None
    prompt_logprobs: Optional[int] = None

支援以下額外引數

程式碼

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    response_format: Optional[AnyResponseFormat] = Field(
        default=None,
        description=(
            "Similar to chat completion, this parameter specifies the format "
            "of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
            ", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
        ),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description="If specified, the output will follow the JSON schema.",
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be one of "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))

    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))

    cache_salt: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit). Not supported by vLLM engine V0."))

    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

聊天 API¶

我們的聊天 API 與 OpenAI 的聊天補全 API 相容；您可以使用官方 OpenAI Python 客戶端與其互動。

我們支援與視覺和音訊相關的引數；更多資訊請參閱我們的多模態輸入指南。- 注意：不支援 image_url.detail 引數。

程式碼示例： examples/online_serving/openai_chat_completion_client.py

額外引數¶

支援以下取樣引數。

程式碼

    best_of: Optional[int] = None
    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    prompt_logprobs: Optional[int] = None
    allowed_token_ids: Optional[list[int]] = None
    bad_words: list[str] = Field(default_factory=list)

支援以下額外引數

程式碼

    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=
        ("If true, the generation prompt will be added to the chat template. "
         "This is a parameter used by chat template in tokenizer config of the "
         "model."),
    )
    continue_final_message: bool = Field(
        default=False,
        description=
        ("If this is set, the chat will be formatted so that the final "
         "message in the chat is open-ended, without any EOS tokens. The "
         "model will continue this message rather than starting a new one. "
         "This allows you to \"prefill\" part of the model's response for it. "
         "Cannot be used at the same time as `add_generation_prompt`."),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    documents: Optional[list[dict[str, str]]] = Field(
        default=None,
        description=
        ("A list of dicts representing documents that will be accessible to "
         "the model if it is performing RAG (retrieval-augmented generation)."
         " If the template does not support RAG, this argument will have no "
         "effect. We recommend that each document should be a dict containing "
         "\"title\" and \"text\" keys."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description=("If specified, the output will follow the JSON schema."),
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    structural_tag: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the structural tag schema."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be either "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))
    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))
    cache_salt: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit). Not supported by vLLM engine V0."))
    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

嵌入 API¶

我們的嵌入 API 與 OpenAI 的嵌入 API 相容；您可以使用官方 OpenAI Python 客戶端與其互動。

如果模型有聊天模板，您可以用一個 messages 列表（與聊天 API 具有相同的模式）替換 inputs，該列表將被視為對模型的單個提示。

程式碼示例： examples/online_serving/openai_embedding_client.py

您可以透過為伺服器定義自定義聊天模板並在請求中傳遞 messages 列表，將多模態輸入傳遞給嵌入模型。請參閱以下示例進行說明。

VLM2VecDSE-Qwen2-MRL

服務模型

vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
  --trust-remote-code \
  --max-model-len 4096 \
  --chat-template examples/template_vlm2vec.jinja

重要

由於 VLM2Vec 與 Phi-3.5-Vision 具有相同的模型架構，我們必須明確傳遞 --task embed 以在嵌入模式而不是文字生成模式下執行此模型。

此模型的自定義聊天模板與原始模板完全不同，可在此處找到： examples/template_vlm2vec.jinja

由於請求模式未由 OpenAI 客戶端定義，我們使用底層 requests 庫向伺服器傳送請求。

程式碼

import requests

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = requests.post(
    "https://:8000/v1/embeddings",
    json={
        "model": "TIGER-Lab/VLM2Vec-Full",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }],
        "encoding_format": "float",
    },
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])

服務模型

vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
  --trust-remote-code \
  --max-model-len 8192 \
  --chat-template examples/template_dse_qwen2_vl.jinja

重要

與 VLM2Vec 類似，我們必須明確傳遞 --task embed。

此外，MrLight/dse-qwen2-2b-mrl-v1 需要一個用於嵌入的 EOS 令牌，這由一個自定義聊天模板處理： examples/template_dse_qwen2_vl.jinja

重要

MrLight/dse-qwen2-2b-mrl-v1 需要一個最小影像大小的佔位符影像用於文字查詢嵌入。請參閱下面的完整程式碼示例瞭解詳情。

完整示例： examples/online_serving/openai_chat_embedding_client_for_multimodal.py

額外引數¶

支援以下池化引數。

預設支援以下額外引數

程式碼

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )

對於類聊天輸入（即，如果傳遞了 messages），則支援這些額外引數

程式碼

    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )

轉錄 API¶

我們的轉錄 API 與 OpenAI 的轉錄 API 相容；您可以使用官方 OpenAI Python 客戶端與其互動。

注意

要使用轉錄 API，請使用 pip install vllm[audio] 安裝額外的音訊依賴項。

程式碼示例： examples/online_serving/openai_transcription_client.py

API 強制限制¶

透過 VLLM_MAX_AUDIO_CLIP_FILESIZE_MB 環境變數設定 VLLM 將接受的最大音訊檔案大小（以 MB 為單位）。預設值為 25 MB。

額外引數¶

支援以下取樣引數。

程式碼

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

    top_p: Optional[float] = None
    """Enables nucleus (top-p) sampling, where tokens are selected from the
    smallest possible set whose cumulative probability exceeds `p`.
    """

    top_k: Optional[int] = None
    """Limits sampling to the `k` most probable tokens at each step."""

    min_p: Optional[float] = None
    """Filters out tokens with a probability lower than `min_p`, ensuring a
    minimum likelihood threshold during sampling.
    """

    seed: Optional[int] = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    frequency_penalty: Optional[float] = 0.0
    """The frequency penalty to use for sampling."""

    repetition_penalty: Optional[float] = None
    """The repetition penalty to use for sampling."""

    presence_penalty: Optional[float] = 0.0
    """The presence penalty to use for sampling."""

支援以下額外引數

程式碼

    # Flattened stream option to simplify form data.
    stream_include_usage: Optional[bool] = False
    stream_continuous_usage_stats: Optional[bool] = False

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

翻譯 API¶

我們的翻譯 API 與 OpenAI 的翻譯 API 相容；您可以使用官方 OpenAI Python 客戶端與其互動。Whisper 模型可以將音訊從 55 種非英語支援語言之一翻譯成英語。請注意，流行的 openai/whisper-large-v3-turbo 模型不支援翻譯。

注意

要使用翻譯 API，請使用 pip install vllm[audio] 安裝額外的音訊依賴項。

程式碼示例： examples/online_serving/openai_translation_client.py

額外引數¶

支援以下取樣引數。

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

支援以下額外引數

    language: Optional[str] = None
    """The language of the input audio we translate from.

    Supplying the input language in
    [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) format
    will improve accuracy.
    """

    stream: Optional[bool] = False
    """Custom field not present in the original OpenAI definition. When set,
    it will enable output to be streamed in a similar fashion as the Chat
    Completion endpoint.
    """
    # Flattened stream option to simplify form data.
    stream_include_usage: Optional[bool] = False
    stream_continuous_usage_stats: Optional[bool] = False

分詞器 API¶

我們的分詞器 API 是 HuggingFace 風格分詞器的簡單封裝。它包含兩個端點

/tokenize 對應於呼叫 tokenizer.encode()。
/detokenize 對應於呼叫 tokenizer.decode()。

池化 API¶

我們的池化 API 使用池化模型對輸入提示進行編碼，並返回相應的隱藏狀態。

輸入格式與嵌入 API 相同，但輸出資料可以包含任意巢狀列表，而不僅僅是浮點數的一維列表。

程式碼示例： examples/online_serving/openai_pooling_client.py

分類 API¶

我們的分類 API 直接支援 Hugging Face 序列分類模型，例如 ai21labs/Jamba-tiny-reward-dev 和 jason9693/Qwen2.5-1.5B-apeach。

我們透過 as_seq_cls_model() 自動封裝任何其他 Transformer 模型，該函式在最後一個 token 上進行池化，附加一個 RowParallelLinear 頭，並應用 softmax 以生成每個類別的機率。

程式碼示例： examples/online_serving/openai_classification_client.py

示例請求¶

您可以透過傳遞字串陣列來分類多個文字

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": [
      "Loved the new café—coffee was great.",
      "This update broke everything. Frustrating."
    ]
  }'

響應

{
  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
  "object": "list",
  "created": 1745383065,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    },
    {
      "index": 1,
      "label": "Spoiled",
      "probs": [
        0.26448777318000793,
        0.7355121970176697
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 20,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

您也可以直接將字串傳遞給 input 欄位

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": "Loved the new café—coffee was great."
  }'

響應

{
  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
  "object": "list",
  "created": 1745383213,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

額外引數¶

支援以下池化引數。

支援以下額外引數

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

評分 API¶

我們的評分 API 可以應用交叉編碼器模型或嵌入模型來預測句子或多模態對的評分。當使用嵌入模型時，評分對應於每個嵌入對之間的餘弦相似度。通常，句子對的評分指兩句話之間的相似度，範圍為 0 到 1。

您可以在 sbert.net 上找到交叉編碼器模型的文件。

程式碼示例： examples/online_serving/openai_cross_encoder_score.py

單次推理¶

您可以將字串傳遞給 text_1 和 text_2，形成一個單獨的句子對。

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'

響應

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

批次推理¶

您可以將字串傳遞給 text_1，並將列表傳遞給 text_2，從而形成多個句子對，其中每個對都由 text_1 和 text_2 中的一個字串構建。總對數是 len(text_2)。

請求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

響應

{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以將列表傳遞給 text_1 和 text_2，從而形成多個句子對，其中每個對都由 text_1 中的一個字串和 text_2 中的相應字串構建（類似於 zip()）。總對數是 len(text_2)。

請求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

響應

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以透過在請求中傳遞包含多模態輸入（影像等）列表的 content，將多模態輸入傳遞給評分模型。請參閱以下示例進行說明。

JinaVL-Reranker

服務模型

vllm serve jinaai/jina-reranker-m0

由於請求模式未由 OpenAI 客戶端定義，我們使用底層 requests 庫向伺服器傳送請求。

程式碼

import requests

response = requests.post(
    "https://:8000/v1/score",
    json={
        "model": "jinaai/jina-reranker-m0",
        "text_1": "slm markdown",
        "text_2": {
          "content": [
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                      },
                  },
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
                      },
                  },
              ]
          }
        },
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])

完整示例： examples/online_serving/openai_cross_encoder_score_for_multimodal.py

額外引數¶

支援以下池化引數。

支援以下額外引數

    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

重排 API¶

我們的重排 API 可以應用嵌入模型或交叉編碼器模型來預測單個查詢與文件列表中的每個文件之間的相關性得分。通常，句子對的得分指的是兩個句子或多模態輸入（影像等）之間的相似度，範圍為 0 到 1。

您可以在 sbert.net 上找到交叉編碼器模型的文件。

重排端點支援流行的重排模型，例如 BAAI/bge-reranker-base 以及其他支援 score 任務的模型。此外，/rerank、/v1/rerank 和 /v2/rerank 端點相容 Jina AI 的重排 API 介面和 Cohere 的重排 API 介面，以確保與流行的開源工具相容。

程式碼示例： examples/online_serving/jinaai_rerank_client.py

示例請求¶

請注意，top_n 請求引數是可選的，預設為 documents 欄位的長度。結果文件將按相關性排序，並且 index 屬性可用於確定原始順序。

請求

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'

響應

{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}

額外引數¶

支援以下池化引數。

支援以下額外引數

    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

Ray Serve LLM¶

Ray Serve LLM 支援 vLLM 引擎的可擴充套件、生產級服務。它與 vLLM 緊密整合，並擴充套件了自動擴縮、負載均衡和背壓等功能。

主要功能

公開 OpenAI 相容的 HTTP API 以及 Pythonic API。
無需更改程式碼即可從單個 GPU 擴充套件到多節點叢集。
透過 Ray 儀表板和指標提供可觀察性和自動擴縮策略。

以下示例展示瞭如何使用 Ray Serve LLM 部署大型模型，例如 DeepSeek R1： examples/online_serving/ray_serve_deepseek.py。

透過官方的 Ray Serve LLM 文件瞭解更多關於 Ray Serve LLM 的資訊。

OpenAI 相容伺服器¶

支援的 API¶

聊天模板¶

額外引數¶

額外 HTTP 請求頭¶

API 參考¶

補全 API¶

額外引數¶

聊天 API¶

額外引數¶

嵌入 API¶

多模態輸入¶

額外引數¶

轉錄 API¶

API 強制限制¶

額外引數¶

翻譯 API¶

額外引數¶

分詞器 API¶

池化 API¶

分類 API¶

示例請求¶

額外引數¶

評分 API¶

單次推理¶

批次推理¶

多模態輸入¶

額外引數¶

重排 API¶

示例請求¶

額外引數¶

Ray Serve LLM¶