vllm bench serve¶

JSON 命令列引數¶

當傳遞 JSON 命令列引數時，以下幾組引數是等效的

--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'
--json-arg.key1 value1 --json-arg.key2.key3 value2

此外，列表元素可以使用 + 單獨傳遞

--json-arg '{"key4": ["value3", "value4", "value5"]}'
--json-arg.key4+ value3 --json-arg.key4+='value4,value5'

引數¶

`--seed`¶

預設值: 0

`--num-prompts`¶

要處理的提示數量。

預設值: 1000

`--dataset-name`¶

可能選項: sharegpt, burstgpt, sonnet, random, random-mm, random-rerank, hf, custom, prefix_repetition, spec_bench

要進行基準測試的資料集名稱。

預設值: random

`--no-stream`¶

不以流式模式載入資料集。

預設值: False

`--dataset-path`¶

sharegpt/sonnet 資料集的路徑。如果使用 HF 資料集，則是 Huggingface 資料集 ID。

預設值: None

`--no-oversample`¶

如果資料集樣本少於 num-prompts，則不進行過取樣。

預設值: False

`--skip-chat-template`¶

跳過為支援聊天模板的資料集應用聊天模板。

預設值: False

`--disable-shuffle`¶

停用資料集樣本的隨機排序，以實現確定性順序。

預設值: False

`--label`¶

基準測試結果的標籤（字首）。如果未指定，則使用 '--backend' 的值作為標籤。

預設值: None

`--backend`¶

可能選項: vllm, openai, openai-chat, openai-audio, openai-embeddings, openai-embeddings-chat, openai-embeddings-clip, openai-embeddings-vlm2vec, infinity-embeddings, infinity-embeddings-clip, vllm-rerank

要用於基準測試的後端或端點的型別。

預設值: openai

`--base-url`¶

伺服器或 API 的基本 URL，如果未使用 http 主機和埠。

預設值: None

`--host`¶

預設值: 127.0.0.1

`--port`¶

預設值: 8000

`--endpoint`¶

API 端點。

預設值: /v1/completions

`--header`¶

鍵值對（例如，--header x-additional-info=0.3.3），用於在每次請求中傳遞的標頭。這些標頭會覆蓋每個後端常量以及透過環境變數設定的值，並且會被其他引數（如請求 ID）覆蓋。

預設值: None

`--max-concurrency`¶

最大併發請求數。這可用於模擬一個更高級別的元件強制執行最大併發請求數的環境。雖然 --request-rate 引數控制請求啟動的速率，但此引數將控制一次實際允許執行的請求數量。這意味著，當組合使用時，如果伺服器處理請求的速度不夠快，實際的請求速率可能會低於 --request-rate 中指定的速率。

預設值: None

`--model`¶

模型的名稱。如果未指定，將從伺服器的 /v1/models 端點獲取第一個模型。

預設值: None

`--input-len`¶

資料集的一般輸入長度。對映到特定於資料集的輸入長度引數（例如，--random-input-len，--sonnet-input-len）。如果未指定，則使用資料集的預設值。

預設值: None

`--output-len`¶

資料集的一般輸出長度。對映到特定於資料集的輸出長度引數（例如，--random-output-len，--sonnet-output-len）。如果未指定，則使用資料集的預設值。

預設值: None

`--tokenizer`¶

分詞器的名稱或路徑，如果未使用預設分詞器。

預設值: None

`--tokenizer-mode`¶

分詞器模式

    - "auto" will use the tokenizer from `mistral_common` for Mistral models
    if available, otherwise it will use the "hf" tokenizer.

    - "hf" will use the fast tokenizer if available.

    - "slow" will always use the slow tokenizer.

    - "mistral" will always use the tokenizer from `mistral_common`.

    - "deepseek_v32" will always use the tokenizer from `deepseek_v32`.

    - Other custom values can be supported via plugins.

預設值: auto

`--use-beam-search`¶

預設值: False

`--logprobs`¶

要計算並作為請求一部分返回的每 token 的 logprobs 數量。如果未指定，則（1）如果停用束搜尋，則不計算 logprobs，並且每個 token 返回一個虛擬 logprob；或者（2）如果啟用束搜尋，則計算每個 token 的 1 個 logprob。

預設值: None

`--request-rate`¶

每秒請求數。如果此值為 inf，則所有請求在時間 0 傳送。否則，我們使用泊松過程或伽馬分佈來合成請求到達時間。

預設值: inf

`--burstiness`¶

請求生成的突發性因子。僅當 request_rate 不是 inf 時生效。預設值為 1，遵循泊松過程。否則，請求間隔遵循伽馬分佈。較低的突發性值（0 < burstiness < 1）會導致更突發的請求。較高的突發性值（burstiness > 1）會導致更均勻的請求到達。

預設值: 1.0

`--trust-remote-code`¶

信任來自 Huggingface 的遠端程式碼。

預設值: False

`--disable-tqdm`¶

指定停用 tqdm 進度條。

預設值: False

`--num-warmups`¶

預熱請求的數量。

預設值: 0

`--profile`¶

使用 vLLM 效能分析。--profiler-config 必須在伺服器上提供。

預設值: False

`--save-result`¶

指定將基準測試結果儲存到 JSON 檔案。

預設值: False

`--save-detailed`¶

儲存結果時，是否包含每個請求的資訊，如響應、錯誤、ttfs、tpots 等。

預設值: False

`--append-result`¶

將基準測試結果附加到現有的 JSON 檔案。

預設值: False

`--metadata`¶

鍵值對（例如，--metadata version=0.3.3 tp=1），用於此執行的元資料，以便儲存在結果 JSON 檔案中以供記錄。

預設值: None

`--result-dir`¶

指定儲存基準測試 JSON 結果的目錄。如果未指定，則結果儲存在當前目錄。

預設值: None

`--result-filename`¶

指定儲存基準測試 JSON 結果的檔名。如果未指定，則結果將以 {label}-{args.request_rate}qps-{base_model_id}-{current_dt}.json 格式儲存。

預設值: None

`--ignore-eos`¶

傳送基準測試請求時設定 ignore_eos 標誌。警告：deepspeed_mii 和 tgi 不支援 ignore_eos。

預設值: False

`--percentile-metrics`¶

用於報告百分位數的選定指標的逗號分隔列表。此引數指定要報告百分位數的指標。允許的指標名稱是 "ttft"（首次 token 延遲）、"tpot"（token 生產時間）、"itl"（token 間延遲）、"e2el"（端到端延遲）。如果未指定，則對於生成模型預設為 "ttft,tpot,itl"，對於池化模型預設為 "e2el"。

預設值: None

`--metric-percentiles`¶

選定指標的百分位數的逗號分隔列表。要報告第 25、50 和第 75 百分位數，請使用 "25,50,75"。預設值為 "99"。使用 "--percentile-metrics" 選擇指標。

預設值: 99

`--goodput`¶

指定服務水平目標（SLO）的 goodput，格式為 "KEY:VALUE" 對，其中 key 是指標名稱，value 是毫秒。可以提供多個 "KEY:VALUE" 對，用空格分隔。允許的請求級別指標名稱為 "ttft"（首次 token 延遲）、"tpot"（token 生產時間）、"e2el"（端到端延遲）。有關 goodput 定義的更多上下文，請參閱 DistServe 論文：https://arxiv.org/pdf/2401.09670 和部落格：https://hao-ai-lab.github.io/blogs/distserve。

預設值: None

`--request-id-prefix`¶

指定請求 ID 的字首。

預設值: bench-3d02ebce-

`--served-model-name`¶

API 中使用的模型名稱。如果未指定，模型名稱將與 `--model` 引數相同。

預設值: None

`--lora-modules`¶

在啟動伺服器時傳入的 LoRA 模組名稱的子集。對於每個請求，指令碼都會隨機選擇一個 LoRA 模組。

預設值: None

`--ramp-up-strategy`¶

可能選項: linear, exponential

預熱策略。這將用於在基準測試期間將請求速率從初始 RPS 預熱到最終 RPS 速率（由 --ramp-up-start-rps 和 --ramp-up-end-rps 指定）。

預設值: None

`--ramp-up-start-rps`¶

預熱的起始請求速率（RPS）。使用 --ramp-up-strategy 時需要指定。

預設值: None

`--ramp-up-end-rps`¶

預熱的結束請求速率（RPS）。使用 --ramp-up-strategy 時需要指定。

預設值: None

`--ready-check-timeout-sec`¶

等待端點就緒的最大時間（秒）（預設值：600 秒 / 10 分鐘）。如果設定為 0，則跳過就緒檢查。

預設值: 600

`--extra-body`¶

一個 JSON 字串，表示要在每個請求中包含的額外正文引數。例如：'{"chat_template_kwargs":{"enable_thinking":false}}'

預設值: None

自定義資料集選項¶

`--custom-output-len`¶

每個請求的輸出 token 數，僅用於自定義資料集。

預設值: 256

spec bench 資料集選項¶

`--spec-bench-output-len`¶

每個請求的輸出 token 數，僅用於 spec bench 資料集。

預設值: 256

`--spec-bench-category`¶

spec bench 資料集的類別。如果為 None，則使用所有類別。

預設值: None

sonnet 資料集選項¶

`--sonnet-input-len`¶

每個請求的輸入 token 數，僅用於 sonnet 資料集。

預設值: 550

`--sonnet-output-len`¶

每個請求的輸出 token 數，僅用於 sonnet 資料集。

預設值: 150

`--sonnet-prefix-len`¶

每個請求的 prefix token 數，僅用於 sonnet 資料集。

預設值: 200

sharegpt 資料集選項¶

`--sharegpt-output-len`¶

每個請求的輸出長度。將覆蓋 ShareGPT 資料集中的輸出長度。

預設值: None

blazedit 資料集選項¶

`--blazedit-min-distance`¶

blazedit 資料集的最小距離。最小值：0，最大值：1.0

預設值：0.0

`--blazedit-max-distance`¶

blazedit 資料集的最大距離。最小值：0，最大值：1.0

預設值: 1.0

random 資料集選項¶

`--random-input-len`¶

每個請求的輸入 token 數，僅用於隨機取樣。

預設值: 1024

`--random-output-len`¶

每個請求的輸出 token 數，僅用於隨機取樣。

預設值: 128

`--random-range-ratio`¶

用於取樣輸入/輸出長度的範圍比例，僅用於隨機取樣。必須在 [0, 1) 範圍內，以定義一個對稱的取樣範圍 [length * (1 - range_ratio), length * (1 + range_ratio)]。

預設值：0.0

`--random-prefix-len`¶

請求中隨機上下文之前的固定字首 token 數。總輸入長度是 random-prefix-len 和從 [input_len * (1 - range_ratio), input_len * (1 + range_ratio)] 取樣的隨機上下文長度的總和。

預設值: 0

`--random-batch-size`¶

隨機取樣的批次大小。僅用於嵌入基準測試。

預設值: 1

`--no-reranker`¶

模型是否支援重排（reranking）功能。僅用於重排基準測試。

預設值: False

random multimodal 資料集選項，擴充套件自 random 資料集¶

`--random-mm-base-items-per-request`¶

random-mm 的每請求基礎多模態專案數。實際每請求計數將在該基礎值周圍取樣，使用 --random-mm-num-mm-items-range-ratio。

預設值: 1

`--random-mm-num-mm-items-range-ratio`¶

用於取樣每個請求專案數的範圍比例 r，取值範圍為 [0, 1]。我們從閉區間 [floor(n*(1-r)), ceil(n*(1+r))] 中均勻取樣，其中 n 是基礎每請求專案數。r=0 表示固定；r=1 表示允許 0 個專案。最大值將被限制在 --random-mm-limit-mm-per-prompt 中每模態的總限制。如果計算出的最小值超過最大值，則會引發錯誤。

預設值：0.0

`--random-mm-limit-mm-per-prompt`¶

每請求多模態專案的硬上限，例如 '{"image": 3, "video": 0}'。取樣的每請求專案數將被限制在這些限制的總和。當某個模態達到其上限時，其儲存桶將被排除，並重新標準化機率。注意：目前僅支援影像取樣。

預設值: {'image': 255, 'video': 1}

`--random-mm-bucket-config`¶

儲存桶配置是一個字典，將多模態專案取樣配置對映到機率。目前支援 2 種模態：影像和影片。儲存桶鍵是 (height, width, num_frames) 的元組，值是取樣該特定專案的機率。例如：--random-mm-bucket-config {(256, 256, 1): 0.5, (720, 1280, 1): 0.4, (720, 1280, 16): 0.10} 第一個專案：解析度為 256x256 的影像，機率為 0.5。第二個專案：解析度為 720x1280 的影像，機率為 0.4。第三個專案：解析度為 720x1280、16 幀的影片，機率為 0.1。注意：如果機率總和不為 1，則會進行標準化。注意 bis：目前僅支援影像取樣。

預設值: {(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}

hf 資料集選項¶

`--hf-subset`¶

HF 資料集的子集。

預設值: None

`--hf-split`¶

HF 資料集的分割。

預設值: None

`--hf-name`¶

HuggingFace 上資料集的名稱（例如，'lmarena-ai/VisionArena-Chat'）。如果您的 dataset-path 是本地路徑，則需要指定此項。

預設值: None

`--hf-output-len`¶

每個請求的輸出長度。將覆蓋從取樣的 HF 資料集中獲取的輸出長度。

預設值: None

prefix repetition 資料集選項¶

`--prefix-repetition-prefix-len`¶

每個請求的 prefix token 數，僅用於 prefix repetition 資料集。

預設值: 256

`--prefix-repetition-suffix-len`¶

每個請求的 suffix token 數，僅用於 prefix repetition 資料集。總輸入長度為 prefix_len + suffix_len。

預設值: 256

`--prefix-repetition-num-prefixes`¶

要生成的 prefix 數量，僅用於 prefix repetition 資料集。每個 prefix 的提示數量為 num_requests // num_prefixes。

預設值: 10

`--prefix-repetition-output-len`¶

每個請求的輸出 token 數，僅用於 prefix repetition 資料集。

預設值: 128

取樣引數¶

`--top-p`¶

Top-p 取樣引數。僅對 openai 相容的後端有效。

預設值: None

`--top-k`¶

Top-k 取樣引數。僅對 openai 相容的後端有效。

預設值: None

`--min-p`¶

Min-p 取樣引數。僅對 openai 相容的後端有效。

預設值: None

`--temperature`¶

Temperature 取樣引數。僅對 openai 相容的後端有效。如果未指定，則預設為貪婪解碼（即 temperature==0.0）。

預設值: None

`--frequency-penalty`¶

Frequency penalty 取樣引數。僅對 openai 相容的後端有效。

預設值: None

`--presence-penalty`¶

Presence penalty 取樣引數。僅對 openai 相容的後端有效。

預設值: None

`--repetition-penalty`¶

Repetition penalty 取樣引數。僅對 openai 相容的後端有效。

預設值: None

`--common-prefix-len`¶

所有提示共享的公共字首長度（由 random 資料集使用）。

預設值: None

vllm bench serve¶

JSON 命令列引數¶

引數¶

--seed¶

--num-prompts¶

--dataset-name¶

--no-stream¶

--dataset-path¶

--no-oversample¶

--skip-chat-template¶

--disable-shuffle¶

--label¶

--backend¶

--base-url¶

--host¶

--port¶

--endpoint¶

--header¶

--max-concurrency¶

--model¶

--input-len¶

--output-len¶

--tokenizer¶

--tokenizer-mode¶

--use-beam-search¶

--logprobs¶

--request-rate¶

--burstiness¶

--trust-remote-code¶

--disable-tqdm¶

--num-warmups¶

--profile¶

--save-result¶

--save-detailed¶

--append-result¶

--metadata¶

--result-dir¶

--result-filename¶

--ignore-eos¶

--percentile-metrics¶

--metric-percentiles¶

--goodput¶

--request-id-prefix¶

--served-model-name¶

--lora-modules¶

--ramp-up-strategy¶

--ramp-up-start-rps¶

--ramp-up-end-rps¶

--ready-check-timeout-sec¶

--extra-body¶

自定義資料集選項¶

--custom-output-len¶

spec bench 資料集選項¶

--spec-bench-output-len¶

--spec-bench-category¶

sonnet 資料集選項¶

--sonnet-input-len¶

--sonnet-output-len¶

--sonnet-prefix-len¶

sharegpt 資料集選項¶

--sharegpt-output-len¶

blazedit 資料集選項¶

--blazedit-min-distance¶

--blazedit-max-distance¶

random 資料集選項¶

--random-input-len¶

--random-output-len¶

--random-range-ratio¶

--random-prefix-len¶

--random-batch-size¶

--no-reranker¶

random multimodal 資料集選項，擴充套件自 random 資料集¶

--random-mm-base-items-per-request¶

--random-mm-num-mm-items-range-ratio¶

--random-mm-limit-mm-per-prompt¶

--random-mm-bucket-config¶

hf 資料集選項¶

--hf-subset¶

--hf-split¶

--hf-name¶

`--seed`¶

`--num-prompts`¶

`--dataset-name`¶

`--no-stream`¶

`--dataset-path`¶

`--no-oversample`¶

`--skip-chat-template`¶

`--disable-shuffle`¶

`--label`¶

`--backend`¶

`--base-url`¶

`--host`¶

`--port`¶

`--endpoint`¶

`--header`¶

`--max-concurrency`¶

`--model`¶

`--input-len`¶

`--output-len`¶

`--tokenizer`¶

`--tokenizer-mode`¶

`--use-beam-search`¶

`--logprobs`¶

`--request-rate`¶

`--burstiness`¶

`--trust-remote-code`¶

`--disable-tqdm`¶

`--num-warmups`¶

`--profile`¶

`--save-result`¶

`--save-detailed`¶

`--append-result`¶

`--metadata`¶

`--result-dir`¶

`--result-filename`¶

`--ignore-eos`¶

`--percentile-metrics`¶

`--metric-percentiles`¶

`--goodput`¶

`--request-id-prefix`¶

`--served-model-name`¶

`--lora-modules`¶

`--ramp-up-strategy`¶

`--ramp-up-start-rps`¶

`--ramp-up-end-rps`¶

`--ready-check-timeout-sec`¶

`--extra-body`¶

`--custom-output-len`¶

`--spec-bench-output-len`¶

`--spec-bench-category`¶

`--sonnet-input-len`¶

`--sonnet-output-len`¶

`--sonnet-prefix-len`¶

`--sharegpt-output-len`¶

`--blazedit-min-distance`¶

`--blazedit-max-distance`¶

`--random-input-len`¶

`--random-output-len`¶

`--random-range-ratio`¶

`--random-prefix-len`¶

`--random-batch-size`¶

`--no-reranker`¶

`--random-mm-base-items-per-request`¶

`--random-mm-num-mm-items-range-ratio`¶

`--random-mm-limit-mm-per-prompt`¶

`--random-mm-bucket-config`¶

`--hf-subset`¶

`--hf-split`¶

`--hf-name`¶

`--hf-output-len`¶

`--prefix-repetition-prefix-len`¶

`--prefix-repetition-suffix-len`¶

`--prefix-repetition-num-prefixes`¶

`--prefix-repetition-output-len`¶

`--top-p`¶

`--top-k`¶

`--min-p`¶

`--temperature`¶

`--frequency-penalty`¶

`--presence-penalty`¶