工具呼叫¶

vLLM 目前支援命名函式呼叫，以及在 chat completion API 的 tool_choice 欄位中的 auto、required (從 vllm>=0.8.3 版本開始) 和 none 選項。

快速入門¶

啟動啟用了工具呼叫的伺服器。此示例使用 Meta 的 Llama 3.1 8B 模型，因此我們需要使用 vLLM 示例目錄中的 llama3_json 工具呼叫聊天模板。

vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --chat-template examples/tool_chat_template_llama3.1_json.jinja

接下來，發出一個觸發模型使用可用工具的請求。

程式碼

from openai import OpenAI
import json

client = OpenAI(base_url="https://:8000/v1", api_key="dummy")

def get_weather(location: str, unit: str):
    return f"Getting the weather for {location} in {unit}..."
tool_functions = {"get_weather": get_weather}

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location", "unit"],
            },
        },
    },
]

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
    tools=tools,
    tool_choice="auto",
)

tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
print(f"Result: {tool_functions[tool_call.name](**json.loads(tool_call.arguments))}")

示例輸出

Function called: get_weather
Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"}
Result: Getting the weather for San Francisco, CA in fahrenheit...

本示例演示了

設定啟用工具呼叫的伺服器
定義實際函式來處理工具呼叫
發出帶有 tool_choice="auto" 的請求
處理結構化響應並執行相應的函式

您還可以透過設定 tool_choice={"type": "function", "function": {"name": "get_weather"}} 來指定特定函式，使用命名函式呼叫。請注意，這將使用結構化輸出後端——因此，當首次使用此功能時，將有幾秒鐘（或更長）的延遲，因為 FSM 會在首次編譯後被快取以供後續請求使用。

請記住，呼叫者有責任

在請求中定義適當的工具
在聊天訊息中包含相關上下文
在您的應用程式邏輯中處理工具呼叫

有關更高階的用法，包括並行工具呼叫和不同的模型特定解析器，請參閱下面的部分。

命名函式呼叫¶

vLLM 預設在 chat completion API 中支援命名函式呼叫。這應該適用於 vLLM 支援的大多數結構化輸出後端。您保證會獲得一個可解析的有效函式呼叫——但不保證其質量。

vLLM 將使用結構化輸出來確保響應與 tools 引數中 JSON schema 定義的工具引數物件匹配。為了獲得最佳結果，我們建議確保在提示中指定預期的輸出格式/ schema，以確保模型的預期生成與結構化輸出後端強制其生成的 schema 對齊。

要使用命名函式，您需要在 chat completion 請求的 tools 引數中定義函式，並在 chat completion 請求的 tool_choice 引數中指定其中一個工具的 name。

必需函式呼叫¶

vLLM 支援 chat completion API 中的 tool_choice='required' 選項。與命名函式呼叫類似，它也使用結構化輸出，因此這是預設啟用的，並且適用於任何支援的模型。然而，V1 引擎的路線圖上支援其他解碼後端。

當設定 tool_choice='required' 時，模型保證會根據 tools 引數中指定的工具列表生成一個或多個工具呼叫。工具呼叫的數量取決於使用者的查詢。輸出格式嚴格遵循 tools 引數中定義的 schema。

無函式呼叫¶

vLLM 支援 chat completion API 中的 tool_choice='none' 選項。當設定此選項時，模型不會生成任何工具呼叫，並且只會以常規文字內容響應，即使請求中定義了工具。

注意

當請求中指定了工具時，vLLM 預設會在提示中包含工具定義，而不管 tool_choice 的設定。要排除 tool_choice='none' 時的工具定義，請使用 --exclude-tools-when-tool-choice-none 選項。

自動函式呼叫¶

要啟用此功能，您應該設定以下標誌

--enable-auto-tool-choice -- **強制** 自動工具選擇。它告訴 vLLM 您希望啟用模型在認為合適時生成自己的工具呼叫。
--tool-call-parser -- 選擇要使用的工具解析器 (列在下方)。未來將繼續新增其他工具解析器。您也可以在 --tool-parser-plugin 中註冊自己的工具解析器。
--tool-parser-plugin -- **可選** 工具解析器外掛，用於將使用者定義的工具解析器註冊到 vllm，註冊的工具解析器名稱可以在 --tool-call-parser 中指定。
--chat-template -- **可選** 用於自動工具選擇。它是處理 tool 角色訊息和包含先前生成的工具呼叫的 assistant 角色訊息的聊天模板的路徑。Hermes、Mistral 和 Llama 模型在其 tokenizer_config.json 檔案中有與工具相容的聊天模板，但您可以指定自定義模板。此引數可以設定為 tool_use，如果您的模型在 tokenizer_config.json 中配置了特定於工具使用的聊天模板。在這種情況下，它將按照 transformers 規範使用。更多資訊在此來自 HuggingFace；您可以在 tokenizer_config.json 中找到一個示例。

如果您喜歡的工具呼叫模型不受支援，請隨時貢獻解析器和工具使用聊天模板！

Hermes Models (`hermes`)¶

所有比 Hermes 2 Pro 更新的 Nous Research Hermes 系列模型都應該得到支援。

NousResearch/Hermes-2-Pro-*
NousResearch/Hermes-2-Theta-*
NousResearch/Hermes-3-*

請注意，Hermes 2 **Theta** 模型由於在建立過程中進行了合併，已知其工具呼叫質量和能力有所下降。.

Flags: --tool-call-parser hermes

Mistral Models (`mistral`)¶

支援的模型

mistralai/Mistral-7B-Instruct-v0.3 (已確認)
其他 Mistral 函式呼叫模型也相容。

已知問題

Mistral 7B 在正確生成並行工具呼叫方面存在困難。
**僅限 Transformers 分詞後端**：Mistral 的 tokenizer_config.json 聊天模板需要完全是 9 位數字的工具呼叫 ID，這比 vLLM 生成的要短得多。由於當此條件不滿足時會丟擲異常，因此提供了以下附加聊天模板
- examples/tool_chat_template_mistral.jinja - 這是“官方” Mistral 聊天模板，但經過調整，使其能夠與 vLLM 的工具呼叫 ID 一起使用（前提是 tool_call_id 欄位被截斷為最後 9 位數字）。
- examples/tool_chat_template_mistral_parallel.jinja - 這是“更好”的版本，當提供工具時會新增一個工具使用系統提示，這使得在處理並行工具呼叫時具有更高的可靠性。

推薦的標誌

使用官方 Mistral AI 格式

--tool-call-parser mistral
在可用時使用 Transformers 格式

--tokenizer_mode hf --config_format hf --load_format hf --tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja

注意

Mistral AI 官方釋出的模型有兩種可能的格式

使用 auto 或 mistral 引數預設使用的官方格式

--tokenizer_mode mistral --config_format mistral --load_format mistral 此格式使用 mistral-common，即 Mistral AI 的分詞器後端。
可用時使用 hf 引數的 Transformers 格式

--tokenizer_mode hf --config_format hf --load_format hf --chat-template examples/tool_chat_template_mistral_parallel.jinja

Llama Models (`llama3_json`)¶

支援的模型

所有 Llama 3.1、3.2 和 4 模型都應該得到支援。

meta-llama/Llama-3.1-*
meta-llama/Llama-3.2-*
meta-llama/Llama-4-*

支援的工具呼叫是基於 JSON 的工具呼叫。關於 Llama-3.2 模型引入的 pythonic 工具呼叫，請參閱下方的 pythonic 工具解析器。至於 Llama 4 模型，推薦使用 llama4_pythonic 工具解析器。

不支援其他工具呼叫格式，如內建的 python 工具呼叫或自定義工具呼叫。

已知問題

Llama 3 不支援並行工具呼叫，但 Llama 4 模型支援。
模型可能會生成格式不正確的引數，例如將陣列生成為字串而不是陣列。

VLLM 為 Llama 3.1 和 3.2 提供了兩個基於 JSON 的聊天模板

examples/tool_chat_template_llama3.1_json.jinja - 這是 Llama 3.1 模型的“官方”聊天模板，但經過調整，使其與 vLLM 更好地配合。
examples/tool_chat_template_llama3.2_json.jinja - 此模板擴充套件了 Llama 3.1 聊天模板，增加了對影像的支援。

推薦的標誌: --tool-call-parser llama3_json --chat-template {see_above}

VLLM 還為 Llama 4 提供了 pythonic 和基於 JSON 的聊天模板，但推薦使用 pythonic 工具呼叫。

examples/tool_chat_template_llama4_pythonic.jinja - 這是 Llama 4 模型官方聊天模板的基礎。

對於 Llama 4 模型，請使用 --tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja。

IBM Granite¶

支援的模型

ibm-granite/granite-4.0-h-small 和其他 Granite 4.0 模型

推薦的標誌: --tool-call-parser hermes
ibm-granite/granite-3.0-8b-instruct

推薦的標誌: --tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja

examples/tool_chat_template_granite.jinja: 這是對 Hugging Face 上原始模板的修改。支援並行函式呼叫。
ibm-granite/granite-3.1-8b-instruct

推薦的標誌: --tool-call-parser granite

可以直接使用 Huggingface 提供的聊天模板。支援並行函式呼叫。
ibm-granite/granite-20b-functioncalling

推薦的標誌: --tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja

examples/tool_chat_template_granite_20b_fc.jinja: 這是對 Hugging Face 上原始模板的修改，它不相容 vLLM。它融合了 Hermes 模板中的函式描述元素，並遵循論文中“響應生成”模式相同的系統提示。支援並行函式呼叫。

InternLM Models (`internlm`)¶

支援的模型

internlm/internlm2_5-7b-chat (已確認)
其他 internlm2.5 函式呼叫模型也相容。

已知問題

雖然此實現也支援 InternLM2，但在使用 internlm/internlm2-chat-7b 模型進行測試時，工具呼叫結果不穩定。

推薦的標誌: --tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja

Jamba Models (`jamba`)¶

AI21 的 Jamba-1.5 模型得到支援。

ai21labs/AI21-Jamba-1.5-Mini
ai21labs/AI21-Jamba-1.5-Large

Flags: --tool-call-parser jamba

xLAM Models (`xlam`)¶

xLAM 工具解析器旨在支援生成各種 JSON 格式工具呼叫的模型。它可以檢測多種不同輸出風格的函式呼叫。

直接 JSON 陣列: 輸出字串為 JSON 陣列，以 [ 開頭，以 ] 結尾。
思考標籤: 使用包含 JSON 陣列的 <think>...</think> 標籤。
程式碼塊: JSON 在程式碼塊中 (json ...)。
工具呼叫標籤: 使用 [TOOL_CALLS] 或 <tool_call>...</tool_call> 標籤。

支援並行函式呼叫，並且解析器可以有效地將文字內容與工具呼叫分開。

支援的模型

Salesforce Llama-xLAM 模型: Salesforce/Llama-xLAM-2-8B-fc-r, Salesforce/Llama-xLAM-2-70B-fc-r
Qwen-xLAM 模型: Salesforce/xLAM-1B-fc-r, Salesforce/xLAM-3B-fc-r, Salesforce/Qwen-xLAM-32B-fc-r

Flags

對於基於 Llama 的 xLAM 模型: --tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja
對於基於 Qwen 的 xLAM 模型: --tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja

Qwen Models¶

對於 Qwen2.5，tokenizer_config.json 中的聊天模板已經包含了對 Hermes 風格工具使用的支援。因此，您可以使用 hermes 解析器來啟用 Qwen 模型的工具呼叫。有關更詳細的資訊，請參閱官方 Qwen 文件。

Qwen/Qwen2.5-*
Qwen/QwQ-32B

Flags: --tool-call-parser hermes

MiniMax Models (`minimax_m1`)¶

支援的模型

MiniMaxAi/MiniMax-M1-40k (使用 examples/tool_chat_template_minimax_m1.jinja)
MiniMaxAi/MiniMax-M1-80k (使用 examples/tool_chat_template_minimax_m1.jinja)

Flags: --tool-call-parser minimax --chat-template examples/tool_chat_template_minimax_m1.jinja

DeepSeek-V3 Models (`deepseek_v3`)¶

支援的模型

deepseek-ai/DeepSeek-V3-0324 (使用 examples/tool_chat_template_deepseekv3.jinja)
deepseek-ai/DeepSeek-R1-0528 (使用 examples/tool_chat_template_deepseekr1.jinja)

Flags: --tool-call-parser deepseek_v3 --chat-template {see_above}

DeepSeek-V3.1 Models (`deepseek_v31`)¶

支援的模型

deepseek-ai/DeepSeek-V3.1 (使用 examples/tool_chat_template_deepseekv31.jinja)

Flags: --tool-call-parser deepseek_v31 --chat-template {see_above}

Kimi-K2 Models (`kimi_k2`)¶

支援的模型

moonshotai/Kimi-K2-Instruct

Flags: --tool-call-parser kimi_k2

Hunyuan Models (`hunyuan_a13b`)¶

支援的模型

tencent/Hunyuan-A13B-Instruct (聊天模板已包含在 Hugging Face 模型檔案中。)

Flags

對於非推理: --tool-call-parser hunyuan_a13b
對於推理: --tool-call-parser hunyuan_a13b --reasoning-parser hunyuan_a13b

LongCat-Flash-Chat Models (`longcat`)¶

支援的模型

meituan-longcat/LongCat-Flash-Chat
meituan-longcat/LongCat-Flash-Chat-FP8

Flags: --tool-call-parser longcat

GLM-4.5 Models (`glm45`)¶

支援的模型

zai-org/GLM-4.5
zai-org/GLM-4.5-Air
zai-org/GLM-4.6

Flags: --tool-call-parser glm45

GLM-4.7 Models (`glm47`)¶

支援的模型

zai-org/GLM-4.7

Flags: --tool-call-parser glm47

Qwen3-Coder Models (`qwen3_xml`)¶

支援的模型

Qwen/Qwen3-480B-A35B-Instruct
Qwen/Qwen3-Coder-30B-A3B-Instruct

Flags: --tool-call-parser qwen3_xml

Olmo 3 Models (`olmo3`)¶

Olmo 3 模型以與 pythonic 解析器預期格式非常相似的格式輸出工具呼叫（見下文），但有一些區別。每個工具呼叫都是一個 pythonic 字串，但並行工具呼叫由換行符分隔，呼叫被包裝在 XML 標籤內，如 <function_calls>..</function_calls>。此外，除了 pythonic 字面量 (True, False, 和 None) 之外，解析器還允許 JSON 布林值和 null 字面量 (true, false, 和 null)。

支援的模型

allenai/Olmo-3-7B-Instruct
allenai/Olmo-3-32B-Think

Flags: --tool-call-parser olmo3

Gigachat 3 Models (`gigachat3`)¶

使用 Hugging Face 模型檔案中的聊天模板。

支援的模型

ai-sage/GigaChat3-702B-A36B-preview
ai-sage/GigaChat3-702B-A36B-preview-bf16
ai-sage/GigaChat3-10B-A1.8B
ai-sage/GigaChat3-10B-A1.8B-bf16

Flags: --tool-call-parser gigachat3

Models with Pythonic Tool Calls (`pythonic`)¶

越來越多的模型使用 Python 列表來表示工具呼叫，而不是使用 JSON。這具有內在支援並行工具呼叫並消除 JSON schema 對工具呼叫的歧義的優勢。pythonic 工具解析器可以支援此類模型。

作為具體示例，這些模型可以透過生成以下內容來查詢舊金山和西雅圖的天氣：

[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]

限制

模型在同一次生成中不得同時生成文字和工具呼叫。這對於特定模型來說可能不難更改，但社群目前在開始和結束工具呼叫時應發出哪些 token 方面缺乏共識。（特別是，Llama 3.2 模型不發出任何此類 token。）
Llama 的較小模型在有效使用工具方面存在困難。

支援的示例模型

meta-llama/Llama-3.2-1B-Instruct ⚠️ (使用 examples/tool_chat_template_llama3.2_pythonic.jinja)
meta-llama/Llama-3.2-3B-Instruct ⚠️ (使用 examples/tool_chat_template_llama3.2_pythonic.jinja)
Team-ACE/ToolACE-8B (使用 examples/tool_chat_template_toolace.jinja)
fixie-ai/ultravox-v0_4-ToolACE-8B (使用 examples/tool_chat_template_toolace.jinja)
meta-llama/Llama-4-Scout-17B-16E-Instruct ⚠️ (使用 examples/tool_chat_template_llama4_pythonic.jinja)
meta-llama/Llama-4-Maverick-17B-128E-Instruct ⚠️ (使用 examples/tool_chat_template_llama4_pythonic.jinja)

Flags: --tool-call-parser pythonic --chat-template {see_above}

警告

Llama 的較小模型經常無法以正確的格式發出工具呼叫。結果可能因模型而異。

如何編寫工具解析器外掛¶

工具解析器外掛是一個包含一個或多個 ToolParser 實現的 Python 檔案。您可以像 vllm/tool_parsers/hermes_tool_parser.py 中的 Hermes2ProToolParser 類似地編寫 ToolParser。

外掛檔案的摘要

程式碼

# import the required packages

# define a tool parser and register it to vllm
# the name list in register_module can be used
# in --tool-call-parser. you can define as many
# tool parsers as you want here.
class ExampleToolParser(ToolParser):
    def __init__(self, tokenizer: TokenizerLike):
        super().__init__(tokenizer)

    # adjust request. e.g.: set skip special tokens
    # to False for tool call output.
    def adjust_request(self, request: ChatCompletionRequest) -> ChatCompletionRequest:
        return request

    # implement the tool call parse for stream call
    def extract_tool_calls_streaming(
        self,
        previous_text: str,
        current_text: str,
        delta_text: str,
        previous_token_ids: Sequence[int],
        current_token_ids: Sequence[int],
        delta_token_ids: Sequence[int],
        request: ChatCompletionRequest,
    ) -> DeltaMessage | None:
        return delta

    # implement the tool parse for non-stream call
    def extract_tool_calls(
        self,
        model_output: str,
        request: ChatCompletionRequest,
    ) -> ExtractedToolCallInformation:
        return ExtractedToolCallInformation(tools_called=False,
                                            tool_calls=[],
                                            content=text)
# register the tool parser to ToolParserManager
ToolParserManager.register_lazy_module(
    name="example",
    module_path="vllm.tool_parsers.example",
    class_name="ExampleToolParser",
)

然後，您可以在命令列中使用此外掛，如下所示。

    --enable-auto-tool-choice \
    --tool-parser-plugin <absolute path of the plugin file>
    --tool-call-parser example \
    --chat-template <your chat template> \

工具呼叫¶

快速入門¶

命名函式呼叫¶

必需函式呼叫¶

無函式呼叫¶

自動函式呼叫¶

Hermes Models (hermes)¶

Mistral Models (mistral)¶

Llama Models (llama3_json)¶

IBM Granite¶

InternLM Models (internlm)¶

Jamba Models (jamba)¶

xLAM Models (xlam)¶

Qwen Models¶

MiniMax Models (minimax_m1)¶

DeepSeek-V3 Models (deepseek_v3)¶

DeepSeek-V3.1 Models (deepseek_v31)¶

Kimi-K2 Models (kimi_k2)¶

Hunyuan Models (hunyuan_a13b)¶

LongCat-Flash-Chat Models (longcat)¶

GLM-4.5 Models (glm45)¶

GLM-4.7 Models (glm47)¶

Qwen3-Coder Models (qwen3_xml)¶

Olmo 3 Models (olmo3)¶

Gigachat 3 Models (gigachat3)¶

Models with Pythonic Tool Calls (pythonic)¶

如何編寫工具解析器外掛¶

Hermes Models (`hermes`)¶

Mistral Models (`mistral`)¶

Llama Models (`llama3_json`)¶

InternLM Models (`internlm`)¶

Jamba Models (`jamba`)¶

xLAM Models (`xlam`)¶

MiniMax Models (`minimax_m1`)¶

DeepSeek-V3 Models (`deepseek_v3`)¶

DeepSeek-V3.1 Models (`deepseek_v31`)¶

Kimi-K2 Models (`kimi_k2`)¶

Hunyuan Models (`hunyuan_a13b`)¶

LongCat-Flash-Chat Models (`longcat`)¶

GLM-4.5 Models (`glm45`)¶

GLM-4.7 Models (`glm47`)¶

Qwen3-Coder Models (`qwen3_xml`)¶

Olmo 3 Models (`olmo3`)¶

Gigachat 3 Models (`gigachat3`)¶

Models with Pythonic Tool Calls (`pythonic`)¶