BitsAndBytes¶

vLLM 現在支援 BitsAndBytes，以實現更高效的模型推理。BitsAndBytes 會量化模型以減少記憶體使用並提高效能，而不會顯著犧牲準確性。與其他量化方法相比，BitsAndBytes 無需使用輸入資料校準量化後的模型。

以下是使用 vLLM 的 BitsAndBytes 的步驟。

pip install bitsandbytes>=0.46.1

vLLM 讀取模型的配置檔案，並支援即時量化和預量化檢查點。

您可以在 Hugging Face 上找到 bitsandbytes 量化模型。通常，這些倉庫會有一個 config.json 檔案，其中包含一個 quantization_config 部分。

讀取量化檢查點¶

對於預量化檢查點，vLLM 會嘗試從配置檔案中推斷量化方法，因此您無需顯式指定量化引數。

from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True,
)

即時量化：載入為 4 位量化¶

對於帶有 BitsAndBytes 的即時 4 位量化，您需要顯式指定量化引數。

from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization="bitsandbytes",
)

OpenAI 相容伺服器¶

為 4 位即時量化，請將以下內容新增到您的模型引數中

--quantization bitsandbytes