使用 Kubernetes¶

在 Kubernetes 上部署 vLLM 是一種可擴充套件且高效的機器學習模型服務方式。本指南將引導您使用原生 Kubernetes 部署 vLLM。

使用 CPU 進行部署
使用 GPU 進行部署
故障排除
啟動探測或就緒探測失敗，容器日誌包含 "KeyboardInterrupt: terminated"
總結

另外，您可以使用以下任何一種方式將 vLLM 部署到 Kubernetes：

使用 CPU 進行部署¶

注意

這裡使用 CPU 僅用於演示和測試目的，其效能將無法與 GPU 相媲美。

首先，為下載和儲存 Hugging Face 模型建立 Kubernetes PVC 和 Secret

配置

cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
data:
  token: $(HF_TOKEN)
EOF

接下來，將 vLLM 伺服器作為 Kubernetes Deployment 和 Service 啟動

配置

cat <<EOF |kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: vllm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
            mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app.kubernetes.io/name: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
EOF

我們可以透過日誌驗證 vLLM 伺服器是否已成功啟動（這可能需要幾分鐘來下載模型）

kubectl logs -l app.kubernetes.io/name=vllm
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

使用 GPU 進行部署¶

先決條件：確保您有一個執行中的帶有 GPU 的 Kubernetes 叢集。

為 vLLM 建立 PVC、Secret 和 Deployment

PVC 用於儲存模型快取，它是可選的，您可以使用 hostPath 或其他儲存選項

Yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mistral-7b
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: default
  volumeMode: Filesystem

Secret 是可選的，僅在訪問門控模型時才需要，如果您不使用門控模型，可以跳過此步驟

apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
  namespace: default
type: Opaque
stringData:
  token: "REPLACE_WITH_TOKEN"

接下來建立用於 vLLM 執行模型伺服器的部署檔案。以下示例部署了 Mistral-7B-Instruct-v0.3 模型。

這裡是使用 NVIDIA GPU 和 AMD GPU 的兩個示例。

NVIDIA GPU

Yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-7b
  namespace: default
  labels:
    app: mistral-7b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-7b
  template:
    metadata:
      labels:
        app: mistral-7b
    spec:
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: mistral-7b
      # vLLM needs to access the host's shared memory for tensor parallel inference.
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: mistral-7b
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 20G
            nvidia.com/gpu: "1"
          requests:
            cpu: "2"
            memory: 6G
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5

AMD GPU

如果使用像 MI300X 這樣的 AMD ROCm GPU，您可以參考下面的 deployment.yaml。

Yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-7b
  namespace: default
  labels:
    app: mistral-7b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-7b
  template:
    metadata:
      labels:
        app: mistral-7b
    spec:
      volumes:
      # PVC
      - name: cache-volume
        persistentVolumeClaim:
          claimName: mistral-7b
      # vLLM needs to access the host's shared memory for tensor parallel inference.
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "8Gi"
      hostNetwork: true
      hostIPC: true
      containers:
      - name: mistral-7b
        image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
        securityContext:
          seccompProfile:
            type: Unconfined
          runAsGroup: 44
          capabilities:
            add:
            - SYS_PTRACE
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
        ]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 20G
            amd.com/gpu: "1"
          requests:
            cpu: "6"
            memory: 6G
            amd.com/gpu: "1"
        volumeMounts:
        - name: cache-volume
          mountPath: /root/.cache/huggingface
        - name: shm
          mountPath: /dev/shm

您可以從 https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve 獲取包含步驟和示例 yaml 檔案的完整示例。

為 vLLM 建立 Kubernetes Service

接下來，建立一個 Kubernetes Service 檔案來暴露 mistral-7b 部署

Yaml

apiVersion: v1
kind: Service
metadata:
  name: mistral-7b
  namespace: default
spec:
  ports:
  - name: http-mistral-7b
    port: 80
    protocol: TCP
    targetPort: 8000
  # The label selector should match the deployment labels & it is useful for prefix caching feature
  selector:
    app: mistral-7b
  sessionAffinity: None
  type: ClusterIP

部署和測試

使用 kubectl apply -f <filename> 應用部署和服務配置

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

要測試部署，執行以下 curl 命令

curl http://mistral-7b.default.svc.cluster.local/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
      }'

如果服務部署正確，您應該會收到來自 vLLM 模型的回應。

故障排除¶

啟動探測或就緒探測失敗，容器日誌包含 "KeyboardInterrupt: terminated"¶

如果啟動或就緒探測的 failureThreshold （失敗閾值）對於伺服器啟動所需時間來說太低，Kubernetes 排程器將終止容器。發生這種情況的幾個跡象是：

容器日誌包含 "KeyboardInterrupt: terminated"
kubectl get events 顯示訊息 Container $NAME failed startup probe, will be restarted

要緩解此問題，請增加 failureThreshold 以留出更多時間讓模型伺服器開始提供服務。您可以透過從清單中移除探測並測量模型伺服器顯示其已準備好提供服務所需的時間來確定理想的 failureThreshold。

總結¶

使用 Kubernetes 部署 vLLM 可以高效地擴充套件和管理利用 GPU 資源的機器學習模型。透過遵循上述步驟，您應該能夠在 Kubernetes 叢集中設定和測試 vLLM 部署。如果您遇到任何問題或有建議，請隨時為文件做出貢獻。