Helm¶

用於部署vLLM到Kubernetes的Helm chart

Helm是Kubernetes的包管理器。它有助於自動化Kubernetes上vLLM應用的部署。使用Helm，您可以透過覆蓋變數值，以不同的配置將相同的框架架構部署到多個名稱空間。

本指南將引導您完成使用Helm部署vLLM的過程，包括必要的先決條件、Helm安裝步驟以及關於架構和值檔案的文件。

先決條件¶

在開始之前，請確保您擁有以下內容

一個正在執行的Kubernetes叢集
NVIDIA Kubernetes裝置外掛 (k8s-device-plugin): 可以在 https://github.com/NVIDIA/k8s-device-plugin 找到
叢集中可用的GPU資源
（可選）一個S3儲存桶或其他帶有模型權重的儲存，如果使用自動模型下載

安裝chart¶

使用釋出名稱test-vllm安裝chart

helm upgrade --install --create-namespace \
  --namespace=ns-vllm test-vllm . \
  -f values.yaml \
  --set secrets.s3endpoint=$ACCESS_POINT \
  --set secrets.s3bucketname=$BUCKET \
  --set secrets.s3accesskeyid=$ACCESS_KEY \
  --set secrets.s3accesskey=$SECRET_KEY

解除安裝chart¶

解除安裝test-vllm部署

helm uninstall test-vllm --namespace=ns-vllm

該命令將刪除與chart關聯的所有Kubernetes元件包括持久卷並刪除釋出。

架構¶

Values¶

下表描述了values.yaml中chart的可配置引數

鍵	型別	預設值	描述
autoscaling	object	{"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}	自動伸縮配置
autoscaling.enabled	bool	false	啟用自動伸縮
autoscaling.maxReplicas	int	100	最大副本數
autoscaling.minReplicas	int	1	最小副本數
autoscaling.targetCPUUtilizationPercentage	int	80	自動伸縮的目標CPU利用率
configs	object	{}	Configmap
containerPort	int	8000	容器埠
customObjects	list	[]	自定義物件配置
deploymentStrategy	object	{}	部署策略配置
externalConfigs	list	[]	外部配置
extraContainers	list	[]	附加容器配置
extraInit	object	{"modelDownload":{"enabled":true},"initContainers":[],"pvcStorage":"1Gi"}	初始化容器的附加配置
extraInit.modelDownload	object	{"enabled":true}	模型下載功能配置
extraInit.modelDownload.enabled	bool	true	啟用自動模型下載作業和等待容器
extraInit.modelDownload.image	object	{"repository":"amazon/aws-cli","tag":"2.6.4","pullPolicy":"IfNotPresent"}	模型下載操作的映象
extraInit.modelDownload.waitContainer	object	{}	等待容器配置（命令，引數，環境變數）
extraInit.modelDownload.downloadJob	object	{}	下載作業配置（命令，引數，環境變數）
extraInit.initContainers	list	[]	自定義初始化容器（如果啟用，則在模型下載後附加）
extraInit.pvcStorage	string	"1Gi"	PVC的儲存大小
extraInit.s3modelpath	string	"relative_s3_model_path/opt-125m"	（可選）S3上模型的路徑
extraInit.awsEc2MetadataDisabled	bool	true	（可選）停用AWS EC2元資料服務
extraPorts	list	[]	附加埠配置
gpuModels	list	["TYPE_GPU_USED"]	使用的GPU型別
image	object	{"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}	映象配置
image.command	list	["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]	容器啟動命令
image.repository	string	"vllm/vllm-openai"	映象倉庫
image.tag	string	"latest"	映象標籤
livenessProbe	object	{"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}	存活探針配置
livenessProbe.failureThreshold	int	3	在Kubernetes認為整體檢查失敗之前，連續失敗的探測次數：容器不存活
livenessProbe.httpGet	object	{"path":"/health","port":8000}	kubelet對伺服器的http請求配置
livenessProbe.httpGet.path	string	"/health"	在HTTP伺服器上訪問的路徑
livenessProbe.httpGet.port	int	8000	在容器上訪問的埠的名稱或編號，伺服器正在監聽該埠
livenessProbe.initialDelaySeconds	int	15	容器啟動後，啟動存活探針的秒數
livenessProbe.periodSeconds	int	10	執行存活探針的頻率（秒）
maxUnavailablePodDisruptionBudget	string	""	容錯預算配置
readinessProbe	object	{"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}	就緒探針配置
readinessProbe.failureThreshold	int	3	在Kubernetes認為整體檢查失敗之前，連續失敗的探測次數：容器未就緒
readinessProbe.httpGet	object	{"path":"/health","port":8000}	kubelet對伺服器的http請求配置
readinessProbe.httpGet.path	string	"/health"	在HTTP伺服器上訪問的路徑
readinessProbe.httpGet.port	int	8000	在容器上訪問的埠的名稱或編號，伺服器正在監聽該埠
readinessProbe.initialDelaySeconds	int	5	容器啟動後，啟動就緒探針的秒數
readinessProbe.periodSeconds	int	5	執行就緒探針的頻率（秒）
replicaCount	int	1	副本數量
resources	object	{"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}	資源配置
resources.limits."nvidia.com/gpu"	int	1	使用的GPU數量
resources.limits.cpu	int	4	CPU數量
resources.limits.memory	string	"16Gi"	CPU記憶體配置
resources.requests."nvidia.com/gpu"	int	1	使用的GPU數量
resources.requests.cpu	int	4	CPU數量
resources.requests.memory	string	"16Gi"	CPU記憶體配置
secrets	object	{}	Secret配置
serviceName	string	""	服務名稱
servicePort	int	80	服務埠
labels.environment	string	test	環境名稱

配置示例¶

使用S3模型下載（預設）¶

extraInit:
  modelDownload:
    enabled: true
  pvcStorage: "10Gi"
  s3modelpath: "models/llama-7b"

僅使用自定義初始化容器¶

用於llm-d等需要自定義sidecar而無需模型下載的用例

extraInit:
  modelDownload:
    enabled: false
  initContainers:
    - name: llm-d-routing-proxy
      image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0
      imagePullPolicy: IfNotPresent
      ports:
        - containerPort: 8080
          name: proxy
      securityContext:
        runAsUser: 1000
      restartPolicy: Always
  pvcStorage: "10Gi"