GGUF¶

警告

請注意，vLLM 對 GGUF 的支援目前仍處於高度實驗性階段且未最佳化，可能與其他功能不相容。當前，您可以將 GGUF 用作減少記憶體佔用的方式。如果您遇到任何問題，請向 vLLM 團隊報告。

警告

目前，vllm 僅支援載入單檔案 GGUF 模型。如果您有 GGUF 多檔案模型，可以使用 gguf-split 工具將它們合併為單個檔案模型。

要使用 vLLM 執行 GGUF 模型，您可以下載並使用 TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF 上的本地 GGUF 模型，並使用以下命令

wget https://huggingface.tw/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0

您還可以新增 --tensor-parallel-size 2 以啟用具有 2 個 GPU 的張量並行推理

# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
   --tensor-parallel-size 2

警告

我們建議使用基礎模型的 tokenizer 而不是 GGUF 模型的 tokenizer。因為從 GGUF 轉換 tokenizer 過程耗時且不穩定，特別是對於詞彙量較大的模型。

GGUF 假設 Hugging Face 可以將元資料轉換為配置檔案。如果 Hugging Face 不支援您的模型，您可以手動建立一個配置檔案並將其作為 hf-config-path 傳遞

# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
   --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0

您也可以直接透過 LLM 入口點使用 GGUF 模型

程式碼

from vllm import LLM, SamplingParams

# In this script, we demonstrate how to pass input to the chat method:
conversation = [
   {
      "role": "system",
      "content": "You are a helpful assistant",
   },
   {
      "role": "user",
      "content": "Hello",
   },
   {
      "role": "assistant",
      "content": "Hello! How can I assist you today?",
   },
   {
      "role": "user",
      "content": "Write an essay about the importance of higher education.",
   },
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(
   model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
   tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)

# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")