SkyPilot¶

vLLM 可以透過 SkyPilot 在雲和 Kubernetes 上執行並擴充套件到多個服務副本。SkyPilot 是一個開源框架，用於在任何雲上執行 LLM。有關 Llama-3、Mixtral 等各種開放模型的更多示例，請參閱 SkyPilot AI gallery。

先決條件¶

轉到 HuggingFace 模型頁面並請求訪問模型 meta-llama/Meta-Llama-3-8B-Instruct。
檢查是否已安裝 SkyPilot（文件）。
檢查 sky check 是否顯示雲或 Kubernetes 已啟用。

pip install skypilot-nightly
sky check

在單例項上執行¶

請參閱用於服務的 vLLM SkyPilot YAML 檔案，serving.yaml。

Yaml

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  PYTHONUNBUFFERED: 1
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  vllm serve $MODEL_NAME \
    --port 8081 \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log &

  echo 'Waiting for vllm api server to start...'
  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url https://:8081/v1 \
    --stop-token-ids 128009,128001

在列出的任何候選 GPU（L4、A10g 等）上啟動 Llama-3 8B 模型服務

HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN

檢查命令的輸出。將有一個可共享的 gradio 連結（如下面的最後一行）。在瀏覽器中開啟它，使用 LLaMA 模型進行文字補全。

(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live

可選：改用 70B 模型而不是預設的 8B 模型，並使用更多 GPU

HF_TOKEN="your-huggingface-token" \
  sky launch serving.yaml \
  --gpus A100:8 \
  --env HF_TOKEN \
  --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct

擴充套件到多個副本¶

SkyPilot 可以透過內建的自動伸縮、負載均衡和容錯功能將服務擴充套件到多個服務副本。您可以透過向 YAML 檔案新增 services 部分來實現此目的。

Yaml

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
    model: $MODEL_NAME
    messages:
      - role: user
        content: Hello! What is your name?
  max_completion_tokens: 1

Yaml

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_completion_tokens: 1

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  PYTHONUNBUFFERED: 1
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  vllm serve $MODEL_NAME \
    --port 8081 \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log

在多個副本上啟動 Llama-3 8B 模型服務

HF_TOKEN="your-huggingface-token" \
  sky serve up -n vllm serving.yaml \
  --env HF_TOKEN

等待服務就緒

watch -n10 sky serve status vllm

示例輸出

Services
NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
vllm  1        35s     READY   2/2       xx.yy.zz.100:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4

服務就緒後，您可以找到服務的單個端點，並透過該端點訪問服務

命令

ENDPOINT=$(sky serve status --endpoint 8081 vllm)
curl -L http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Who are you?"
    }
    ],
    "stop_token_ids": [128009,  128001]
  }'

要啟用自動伸縮，您可以將 service 中的 replicas 替換為以下配置

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 4
    target_qps_per_replica: 2

這將使服務在每個副本的 QPS 超過 2 時進行擴充套件。

Yaml

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 4
    target_qps_per_replica: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_completion_tokens: 1

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

envs:
  PYTHONUNBUFFERED: 1
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  vllm serve $MODEL_NAME \
    --port 8081 \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    2>&1 | tee api_server.log

使用新配置更新服務

HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN

停止服務

sky serve down vllm

可選：將 GUI 連線到端點¶

也可以使用單獨的 GUI 前端訪問 Llama-3 服務，這樣使用者傳送到 GUI 的請求將被負載均衡到各個副本。

Yaml

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.

resources:
  cpus: 2

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm

  # Install Gradio for web UI.
  pip install gradio openai

run: |
  conda activate vllm
  export PATH=$PATH:/sbin

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://$ENDPOINT/v1 \
    --stop-token-ids 128009,128001 | tee ~/gradio.log

啟動聊天 Web UI

sky launch \
  -c gui ./gui.yaml \
  --env ENDPOINT=$(sky serve status --endpoint vllm)

然後，我們可以在返回的 gradio 連結處訪問 GUI

| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live