使用 OpenAI Batch 檔案格式進行離線推理¶

來源 https://github.com/vllm-project/vllm/tree/main/examples/offline_inference/openai_batch。

This is a guide to performing batch inference using the OpenAI batch file format, **not** the complete Batch (REST) API.

檔案格式¶

OpenAI Batch 檔案格式由一系列在新行上的 JSON 物件組成。

檢視此處的示例檔案。

每一行代表一個單獨的請求。有關更多詳細資訊，請參閱 OpenAI 包參考。

We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` endpoints (completions coming soon).

先決條件¶

本文件中的示例使用 meta-llama/Meta-Llama-3-8B-Instruct。
- 建立使用者訪問令牌
- 在您的機器上安裝令牌（執行 huggingface-cli login）。
- 透過訪問模型卡並同意條款和條件來獲得對受限模型的訪問許可權。

示例 1：使用本地檔案執行¶

步驟 1：建立您的 Batch 檔案¶

要跟隨此示例，您可以下載示例 Batch，或者在您的工作目錄中建立自己的 Batch 檔案。

wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl

建立 Batch 檔案後，它應該看起來像這樣

cat offline_inference/openai_batch/openai_example_batch.jsonl
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}

步驟 2：執行 Batch¶

Batch 執行工具設計用於從命令列使用。

您可以使用以下命令執行 Batch，它會將結果寫入名為 results.jsonl 的檔案

python -m vllm.entrypoints.openai.run_batch \
    -i offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct

或使用命令列

vllm run-batch \
    -i offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct

步驟 3：檢查您的結果¶

您現在應該在 results.jsonl 中找到您的結果。您可以透過執行 cat results.jsonl 來檢查您的結果

cat results.jsonl
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}

示例 2：使用遠端檔案¶

Batch Runner 支援可透過 http/https 訪問的遠端輸入和輸出 URL。

例如，要執行我們的位於 https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl 的示例輸入檔案，您可以執行

python -m vllm.entrypoints.openai.run_batch \
    -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct

或使用命令列

vllm run-batch \
    -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
    -o results.jsonl \
    --model meta-llama/Meta-Llama-3-8B-Instruct

示例 3：與 AWS S3 整合¶

為了與雲物件儲存整合，我們建議使用預簽名 URL。

[在此處瞭解有關 S3 預簽名 URL 的更多資訊]

附加先決條件¶

建立一個 S3 儲存桶.
awscli 包（執行 pip install awscli）用於配置您的憑據並互動式使用 s3。
- 配置您的憑據.
boto3 Python 包（執行 pip install boto3）用於生成預簽名 URL。

步驟 1：上傳您的輸入指令碼¶

要跟隨此示例，您可以下載示例 Batch，或者在您的工作目錄中建立自己的 Batch 檔案。

wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl

建立 Batch 檔案後，它應該看起來像這樣

cat offline_inference/openai_batch/openai_example_batch.jsonl
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}

現在將您的 Batch 檔案上傳到您的 S3 儲存桶。

aws s3 cp offline_inference/openai_batch/openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl

步驟 2：生成您的預簽名 URL¶

預簽名 URL 只能透過 SDK 生成。您可以執行以下 Python 指令碼來生成您的預簽名 URL。請務必用您的儲存桶和檔名替換 MY_BUCKET、MY_INPUT_FILE.jsonl 和 MY_OUTPUT_FILE.jsonl 佔位符。

(該指令碼改編自 https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py)

import boto3
from botocore.exceptions import ClientError

def generate_presigned_url(s3_client, client_method, method_parameters, expires_in):
    """
    Generate a presigned Amazon S3 URL that can be used to perform an action.

    :param s3_client: A Boto3 Amazon S3 client.
    :param client_method: The name of the client method that the URL performs.
    :param method_parameters: The parameters of the specified client method.
    :param expires_in: The number of seconds the presigned URL is valid for.
    :return: The presigned URL.
    """
    try:
        url = s3_client.generate_presigned_url(
            ClientMethod=client_method,
            Params=method_parameters,
            ExpiresIn=expires_in,
        )
    except ClientError:
        raise
    return url


s3_client = boto3.client("s3")
input_url = generate_presigned_url(
    s3_client,
    "get_object",
    {"Bucket": "MY_BUCKET", "Key": "MY_INPUT_FILE.jsonl"},
    expires_in=3600,
)
output_url = generate_presigned_url(
    s3_client,
    "put_object",
    {"Bucket": "MY_BUCKET", "Key": "MY_OUTPUT_FILE.jsonl"},
    expires_in=3600,
)
print(f"{input_url=}")
print(f"{output_url=}")

該指令碼應輸出

input_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'

步驟 3：使用您的預簽名 URL 執行 Batch Runner¶

您現在可以使用上一節生成的 URL 來執行 Batch Runner。

python -m vllm.entrypoints.openai.run_batch \
    -i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
    -o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
    --model --model meta-llama/Meta-Llama-3-8B-Instruct

或使用命令列

vllm run-batch \
    -i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
    -o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
    --model --model meta-llama/Meta-Llama-3-8B-Instruct

步驟 4：檢視您的結果¶

您的結果現在已在 S3 上。您可以透過執行 cat results.jsonl 在終端中檢視它們

aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -

示例 4：使用 Embeddings 端點¶

附加先決條件¶

確保您使用的是 vllm >= 0.5.5。

步驟 1：建立您的 Batch 檔案¶

將 Embeddings 請求新增到您的 Batch 檔案。以下是一個示例

{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}

您甚至可以在 Batch 檔案中混合聊天補全和 Embeddings 請求，只要您使用的模型同時支援它們（請注意，所有請求都必須使用相同的模型）。

步驟 2：執行 Batch¶

您可以使用與先前示例相同的命令執行 Batch。

步驟 3：檢查您的結果¶

您可以透過執行 cat results.jsonl 來檢查您的結果

cat results.jsonl
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
...

示例 5：使用 Score 端點¶

附加先決條件¶

確保您使用的是 vllm >= 0.7.0。

步驟 1：建立您的 Batch 檔案¶

將 Score 請求新增到您的 Batch 檔案。以下是一個示例

{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}

您可以混合聊天補全、Embeddings 和 Score 請求在 Batch 檔案中，只要您使用的模型支援所有這些（請注意，所有請求都必須使用相同的模型）。

步驟 2：執行 Batch¶

您可以使用與先前示例相同的命令執行 Batch。

步驟 3：檢查您的結果¶

您可以透過執行 cat results.jsonl 來檢查您的結果

cat results.jsonl
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}

示例材料¶

openai_example_batch.jsonl

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}