vLLM CLI 指南¶

vllm 命令列工具用於執行和管理 vLLM 模型。您可以從檢視幫助資訊開始：

vllm --help

可用命令

vllm {chat,complete,serve,bench,collect-env,run-batch}

serve¶

啟動 vLLM 相容 OpenAI 的 API 伺服器。

使用模型啟動

vllm serve meta-llama/Llama-2-7b-hf

指定埠

vllm serve meta-llama/Llama-2-7b-hf --port 8100

透過 Unix 域套接字提供服務

vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock

透過 --help 檢視更多選項

# To list all groups
vllm serve --help=listgroup

# To view a argument group
vllm serve --help=ModelConfig

# To view a single argument
vllm serve --help=max-num-seqs

# To search by keyword
vllm serve --help=max

# To view full help with pager (less/more)
vllm serve --help=page

請參閱 vllm serve 獲取所有可用引數的完整參考。

chat¶

透過執行中的 API 伺服器生成聊天補全。

# Directly connect to localhost API without arguments
vllm chat

# Specify API url
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1

# Quick chat with a single prompt
vllm chat --quick "hi"

請參閱 vllm chat 獲取所有可用引數的完整參考。

complete¶

透過執行中的 API 伺服器根據給定的提示生成文字補全。

# Directly connect to localhost API without arguments
vllm complete

# Specify API url
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1

# Quick complete with a single prompt
vllm complete --quick "The future of AI is"

請參閱 vllm complete 獲取所有可用引數的完整參考。

bench¶

執行延遲線上服務吞吐量和離線推理吞吐量的基準測試。

要使用基準測試命令，請使用 pip install vllm[bench] 安裝額外的依賴項。

可用命令

vllm bench {latency, serve, throughput}

latency¶

對單個請求批次的延遲進行基準測試。

vllm bench latency \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --input-len 32 \
    --output-len 1 \
    --enforce-eager \
    --load-format dummy

請參閱 vllm bench latency 獲取所有可用引數的完整參考。

serve¶

對線上服務吞吐量進行基準測試。

vllm bench serve \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --host server-host \
    --port server-port \
    --random-input-len 32 \
    --random-output-len 4  \
    --num-prompts  5

請參閱 vllm bench serve 獲取所有可用引數的完整參考。

throughput¶

基準測試離線推理吞吐量。

vllm bench throughput \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --input-len 32 \
    --output-len 1 \
    --enforce-eager \
    --load-format dummy

請參閱 vllm bench throughput 獲取所有可用引數的完整參考。

collect-env¶

開始收集環境資訊。

vllm collect-env

run-batch¶

執行批處理提示並將結果寫入檔案。

使用本地檔案執行

vllm run-batch \
    -i offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct

使用遠端檔案

vllm run-batch \
    -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct

請參閱 vllm run-batch 獲取所有可用引數的完整參考。

vLLM CLI 指南¶

serve¶

chat¶

complete¶

bench¶

latency¶

serve¶

throughput¶

collect-env¶

run-batch¶

更多幫助¶