BitsAndBytes¶

vLLM 現已支援 BitsAndBytes，以實現更高效的模型推理。BitsAndBytes 對模型進行量化，從而減少記憶體使用並提高效能，同時不會顯著犧牲準確性。與其他量化方法相比，BitsAndBytes 無需使用輸入資料校準量化模型。

以下是在 vLLM 中使用 BitsAndBytes 的步驟。

pip install bitsandbytes>=0.46.1

vLLM 會讀取模型的配置檔案，並支援即時量化和預量化檢查點。

您可以在 Hugging Face 上找到 bitsandbytes 量化模型。通常，這些倉庫會有一個包含 quantization_config 部分的 config.json 檔案。

讀取量化檢查點¶

對於預量化檢查點，vLLM 將嘗試從配置檔案中推斷量化方法，因此您無需明確指定量化引數。

from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True
)

即時量化：以4位量化載入¶

對於使用 BitsAndBytes 進行即時4位量化，您需要明確指定量化引數。

from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization="bitsandbytes"
)

相容OpenAI的伺服器¶

將以下內容新增到您的模型引數中，以進行4位即時量化

--quantization bitsandbytes