跳到內容
vLLM
KServe
正在初始化搜尋
GitHub
主頁
使用者指南
開發者指南
基準測試
API 參考
CLI 參考
社群
vLLM
GitHub
主頁
使用者指南
使用者指南
入門
入門
快速入門
安裝
安裝
GPU
CPU
TPU
示例
示例
離線推理
離線推理
非同步 LLM 流式傳輸
音訊語言
自動字首快取
基礎
批次LLM推理
工具對話
上下文擴充套件
資料並行
解耦式 Prefill V1
解耦預填充
編碼器-解碼器多模態
KV 載入失敗恢復測試
LLM引擎示例
LLM 引擎重置 Kv
載入分片狀態
Logits 處理器
帶量化推理的LoRA
指標
Mistral-Small
MLPSpeculator
多LoRA推理
使用 OpenAI 批次檔案格式進行離線推理
字首快取
提示詞嵌入推理
Qwen2.5-Omni 離線推理示例
Qwen3 Omni
Qwen 1M
可復現性
RLHF
RLHF 同地部署
RLHF 線上量化
RLHF 工具
儲存分片狀態
簡單效能分析
引擎初始化時跳過載入權重
推測解碼
結構化輸出
Torchrun Dp 示例
Torchrun 示例
視覺語言
視覺語言多影像
線上服務
線上服務
API 客戶端
Helm Charts
監控儀表盤
解耦式編碼器
解耦預填充
解耦服務
解耦式服務 P2P Nccl Xpyd
彈性端點
Gradio OpenAI 聊天機器人 Web 伺服器
Gradio Web 伺服器
Kv 事件訂閱器
多節點服務
多例項資料並行
OpenAI 聊天補全客戶端
適用於多模態的 OpenAI 聊天補全客戶端
帶工具的 OpenAI 聊天補全客戶端
帶所需工具的 OpenAI 聊天補全客戶端
帶工具 Xlam 的 OpenAI 聊天補全客戶端
帶工具 Xlam 流式傳輸的 OpenAI 聊天補全客戶端
帶推理功能的 OpenAI 聊天補全工具呼叫
帶推理功能的 OpenAI 聊天補全
帶推理功能的 OpenAI 聊天補全流式傳輸
OpenAI 補全客戶端
OpenAI 響應客戶端
帶 Mcp 工具的 OpenAI 響應客戶端
帶工具的 OpenAI 響應客戶端
OpenAI 轉錄客戶端
OpenAI 翻譯客戶端
設定 OpenTelemetry POC
Prometheus 和 Grafana
使用 OpenAI 客戶端進行 Prompt 嵌入推理
Ray Serve Deepseek
使用 Langchain 進行檢索增強生成
使用 Llamaindex 進行檢索增強生成
執行叢集
Sagemaker-Entrypoint
Streamlit OpenAI 聊天機器人 Web 伺服器
結構化輸出
令牌生成客戶端
實用程式
其他
其他
LMCache 示例
日誌配置
張量化 vLLM 模型
池化
池化
分類
嵌入
外掛
池化
評分
令牌分類
令牌嵌入
通用
通用
vLLM V1
常見問題
生產指標
可復現性
安全
故障排除
使用統計收集
推理與服務
推理與服務
離線推理
相容 OpenAI 的伺服器
上下文並行部署
資料並行部署
分散式部署故障排查
專家並行部署
並行與擴充套件
整合
整合
LangChain
LlamaIndex
部署
部署
使用 Docker
使用 Kubernetes
使用 Nginx
框架
框架
Anyscale
AnythingLLM
AutoGen
BentoML
Cerebrium
Chatbox
Dify
dstack
Haystack
Helm
Hugging Face 推理端點
LiteLLM
Lobe Chat
LWS
Modal
Open WebUI
檢索增強生成
SkyPilot
Streamlit
NVIDIA Triton
整合
整合
KAITO
KServe
Kthena
KubeAI
KubeRay
Llama Stack
llmaz
生產堆疊
訓練
訓練
人類反饋強化學習
Transformer 強化學習
配置
配置
記憶體節約
引擎引數
環境變數
模型解析
最佳化與調優
伺服器引數
TPU
模型
模型
支援的模型
生成模型
池化模型
擴充套件
擴充套件
使用 fastsafetensors 載入模型權重
使用 Run:ai Model Streamer 載入模型
使用 CoreWeave 的 Tensorizer 載入模型
硬體支援的模型
硬體支援的模型
CPU - 英特爾® 至強®
XPU - 英特爾® GPU
TPU
特性
功能
自動字首快取
批處理不變性
自定義引數
自定義 Logits 處理器
解耦式編碼器
解耦預填充(實驗性)
交錯式思考
LoRA 介面卡
MooncakeConnector 使用指南
多模態輸入
NixlConnector 使用指南
提示詞嵌入輸入
推理輸出
睡眠模式
推測解碼
結構化輸出
工具呼叫
量化
量化
AutoAWQ
AutoRound
BitBLAS
BitsAndBytes
FP8 W8A8
GGUF
GPTQModel
FP8 INC
INT4 W4A16
INT8 W8A8
NVIDIA 模型最佳化器
量化 KV 快取
AMD Quark
TorchAO
開發者指南
開發者指南
通用
通用
棄用策略
Dockerfile
增量編譯工作流程
對 vLLM 進行效能分析
漏洞管理
模型實現
模型實現
基本模型
註冊模型
單元測試
多模態支援
語音轉文字(轉錄/翻譯)支援
CI
持續整合 (CI)
CI 失敗
vLLM 輪子 (Wheels) 的夜間構建版
更新 vLLM OSS CI/CD 上的 PyTorch 版本
設計文件
設計文件
外掛
外掛
IO 處理器外掛
LoRA 解析器外掛
外掛系統
架構概述
CUDA 圖
雙批次重疊
如何除錯 vLLM-torch.compile 整合
融合 MoE 模組化核心
與 Hugging Face 整合
混合 KV 快取管理器
Logits 處理器
指標
多模態資料處理
融合 MoE 核心特性
Python 多程序
最佳化級別
P2P NCCL 聯結器
Paged Attention
自動字首快取
torch.compile 整合
基準測試
基準測試
基準測試 CLI
引數掃描
效能儀表盤
API 參考
API 參考
vllm
vllm
beam_search
collect_env
connections
env_override
envs
forward_context
logger
logits_process
logprobs
outputs
pooling_params
sampling_params
scalar_type
scripts
sequence
tasks
tracing
version
assets
assets
audio
base
image
video
attention
attention
layer
selector
backends
backends
abstract
registry
utils
layers
layers
chunked_local_attention
cross_attention
encoder_only_attention
mm_encoder_attention
ops
ops
chunked_prefill_paged_decode
common
flashmla
merge_attn_states
paged_attn
pallas_kv_cache_update
prefix_prefill
rocm_aiter_mla_sparse
triton_decode_attention
triton_merge_attn_states
triton_reshape_and_cache_flash
triton_unified_attention
vit_attn_wrappers
utils
utils
fa_utils
kv_sharing_utils
kv_transfer_utils
benchmarks
benchmarks
datasets
latency
serve
startup
throughput
lib
lib
endpoint_request_func
ready_checker
utils
sweep
sweep
cli
param_sweep
plot
plot_pareto
serve
serve_sla
server
sla_sweep
utils
compilation
compilation
activation_quant_fusion
backends
base_static_graph
caching
collective_fusion
compiler_interface
counter
cuda_graph
decorators
fix_functionalization
fusion
fusion_attn
fx_utils
inductor_pass
matcher_utils
monitor
noop_elimination
partition_rules
pass_manager
piecewise_backend
post_cleanup
qk_norm_rope_fusion
rocm_aiter_fusion
sequence_parallelism
torch25_custom_graph_pass
vllm_inductor_pass
wrapper
config
config
attention
cache
compilation
device
ec_transfer
kv_events
kv_transfer
load
lora
model
multimodal
observability
parallel
pooler
profiler
scheduler
speculative
speech_to_text
structured_outputs
utils
vllm
device_allocator
device_allocator
cumem
distributed
distributed
communication_op
kv_events
parallel_state
tpu_distributed_utils
utils
device_communicators
device_communicators
all2all
all_reduce_utils
base_device_communicator
cpu_communicator
cuda_communicator
cuda_wrapper
custom_all_reduce
mnnvl_compat
pynccl
pynccl_allocator
pynccl_wrapper
quick_all_reduce
ray_communicator
shm_broadcast
shm_object_storage
symm_mem
tpu_communicator
xpu_communicator
ec_transfer
ec_transfer
ec_transfer_state
ec_connector
ec_connector
base
example_connector
factory
eplb
eplb
async_worker
eplb_state
rebalance_execute
policy
policy
abstract
default
kv_transfer
kv_transfer
kv_transfer_state
kv_connector
kv_connector
base
factory
utils
v1
v1
base
decode_bench_connector
example_connector
lmcache_connector
lmcache_mp_connector
metrics
mooncake_connector
multi_connector
nixl_connector
offloading_connector
lmcache_integration
lmcache_integration
multi_process_adapter
utils
vllm_v1_adapter
p2p
p2p
p2p_nccl_connector
p2p_nccl_engine
tensor_memory_pool
engine
engine
arg_utils
async_llm_engine
llm_engine
protocol
entrypoints
entrypoints
api_server
chat_utils
constants
context
launcher
llm
logger
renderer
responses_utils
score_utils
ssl
tool
tool_server
utils
anthropic
anthropic
protocol
serving_messages
cli
cli
collect_env
main
openai
run_batch
serve
types
benchmark
benchmark
base
latency
main
serve
startup
sweep
throughput
openai
openai
api_server
cli_args
orca_metrics
protocol
run_batch
serving_chat
serving_chat_stream_harmony
serving_completion
serving_engine
serving_models
serving_responses
serving_transcription
speech_to_text
utils
parser
parser
harmony_utils
responses_parser
tool_parsers
tool_parsers
pooling
pooling
classify
classify
api_router
protocol
serving
embed
embed
api_router
protocol
serving
pooling
pooling
api_router
protocol
serving
score
score
api_router
protocol
serving
sagemaker
sagemaker
routes
serve
serve
cache
cache
api_router
disagg
disagg
api_router
protocol
serving
elastic_ep
elastic_ep
api_router
middleware
instrumentator
instrumentator
health
metrics
server_info
lora
lora
api_router
profile
profile
api_router
rlhf
rlhf
api_router
rpc
rpc
api_router
sleep
sleep
api_router
tokenize
tokenize
api_router
serving
inputs
inputs
data
parse
preprocess
logging_utils
logging_utils
dump_input
formatter
lazy
log_time
lora
lora
lora_model
lora_weights
model_manager
peft_helper
request
resolver
utils
worker_manager
layers
layers
base
base_linear
column_parallel_linear
fused_moe
logits_processor
replicated_linear
row_parallel_linear
utils
vocal_parallel_embedding
ops
ops
ipex_ops
ipex_ops
lora_ops
torch_ops
torch_ops
lora_ops
triton_ops
triton_ops
fused_moe_lora_op
kernel_utils
lora_expand_op
lora_kernel_metadata
lora_shrink_op
utils
xla_ops
xla_ops
lora_ops
punica_wrapper
punica_wrapper
punica_base
punica_cpu
punica_gpu
punica_selector
punica_tpu
punica_xpu
utils
model_executor
model_executor
custom_op
parameter
utils
layers
layers
activation
attention_layer_base
batch_invariant
conv
kda
layernorm
lightning_attn
linear
logits_processor
mla
pooler
resampler
utils
vocab_parallel_embedding
fla
fla
ops
ops
chunk
chunk_delta_h
chunk_o
chunk_scaled_dot_kkt
cumsum
fused_recurrent
index
kda
l2norm
layernorm_guard
op
solve_tril
utils
wy_fast
fused_moe
fused_moe
all2all_utils
batched_deep_gemm_moe
config
cpu_fused_moe
cutlass_moe
deep_gemm_moe
deep_gemm_utils
deepep_ht_prepare_finalize
deepep_ll_prepare_finalize
flashinfer_cutedsl_moe
flashinfer_cutlass_moe
flashinfer_cutlass_prepare_finalize
flashinfer_trtllm_moe
fused_batched_moe
fused_marlin_moe
fused_moe
fused_moe_method_base
fused_moe_modular_method
gpt_oss_triton_kernels_moe
layer
modular_kernel
moe_align_block_size
moe_pallas
moe_permute_unpermute
moe_torch_iterative
pplx_prepare_finalize
prepare_finalize
rocm_aiter_fused_moe
routing_simulator
shared_fused_moe
topk_weight_and_reduce
triton_deep_gemm_moe
trtllm_moe
unquantized_fused_moe_method
utils
zero_expert_fused_moe
mamba
mamba
abstract
linear_attn
mamba_mixer
mamba_mixer2
mamba_utils
short_conv
ops
ops
causal_conv1d
layernorm_gated
mamba_ssm
ssd_bmm
ssd_chunk_scan
ssd_chunk_state
ssd_combined
ssd_state_passing
quantization
quantization
auto_round
awq
awq_marlin
awq_triton
base_config
bitblas
bitsandbytes
cpu_wna16
deepspeedfp
experts_int8
fbgemm_fp8
fp8
fp_quant
gguf
gptq
gptq_bitblas
gptq_marlin
gptq_marlin_24
hqq_marlin
inc
input_quant_fp8
ipex_quant
kv_cache
modelopt
moe_wna16
mxfp4
petit
ptpc_fp8
qutlass_utils
rtn
schema
torchao
tpu_int8
compressed_tensors
compressed_tensors
compressed_tensors
compressed_tensors_moe
triton_scaled_mm
utils
schemes
schemes
compressed_tensors_24
compressed_tensors_scheme
compressed_tensors_w4a4_nvfp4
compressed_tensors_w4a8_fp8
compressed_tensors_w4a8_int
compressed_tensors_w4a16_24
compressed_tensors_w4a16_nvfp4
compressed_tensors_w8a8_fp8
compressed_tensors_w8a8_int8
compressed_tensors_w8a16_fp8
compressed_tensors_wNa16
transform
transform
linear
module
utils
schemes
schemes
linear_qutlass_nvfp4
kernels
kernels
mixed_precision
mixed_precision
allspark
bitblas
conch
cutlass
dynamic_4bit
exllama
MPLinearKernel
machete
marlin
xpu
scaled_mm
scaled_mm
aiter
cpu
cutlass
ScaledMMLinearKernel
triton
xla
quark
quark
quark
quark_moe
utils
schemes
schemes
quark_ocp_mx
quark_scheme
quark_w8a8_fp8
quark_w8a8_int8
utils
utils
allspark_utils
bitblas_utils
flashinfer_fp4_moe
flashinfer_utils
fp8_utils
gptq_utils
int8_utils
layer_utils
machete_utils
marlin_utils
marlin_utils_fp4
marlin_utils_fp8
marlin_utils_test
marlin_utils_test_24
mxfp4_utils
mxfp6_utils
mxfp8_utils
nvfp4_emulation_utils
nvfp4_moe_support
ocp_mx_utils
petit_utils
quant_utils
w8a8_utils
rotary_embedding
rotary_embedding
base
common
deepseek_scaling_rope
dual_chunk_rope
dynamic_ntk_alpha_rope
dynamic_ntk_scaling_rope
ernie45_vl_rope
linear_scaling_rope
llama3_rope
llama4_vision_rope
mrope
ntk_scaling_rope
phi3_long_rope_scaled_rope
xdrope
yarn_scaling_rope
model_loader
model_loader
base_loader
bitsandbytes_loader
default_loader
dummy_loader
gguf_loader
online_quantization
runai_streamer_loader
sharded_state_loader
tensorizer
tensorizer_loader
tpu
utils
weight_utils
models
models
adapters
afmoe
aimv2
apertus
arcee
arctic
aria
audioflamingo3
aya_vision
bagel
baichuan
bailing_moe
bamba
bee
bert
bert_with_rope
blip
blip2
bloom
chameleon
chatglm
clip
cohere2_vision
commandr
config
dbrx
deepencoder
deepseek_eagle
deepseek_mtp
deepseek_ocr
deepseek_v2
deepseek_vl2
dots1
dots_ocr
ernie45
ernie45_moe
ernie45_vl
ernie45_vl_moe
ernie_mtp
exaone
exaone4
fairseq2_llama
falcon
falcon_h1
flex_olmo
fuyu
gemma
gemma2
gemma3
gemma3_mm
gemma3n
gemma3n_mm
glm
glm4
glm4_1v
glm4_moe
glm4_moe_mtp
glm4v
gpt2
gpt_bigcode
gpt_j
gpt_neox
gpt_oss
granite
granite_speech
granitemoe
granitemoehybrid
granitemoeshared
gritlm
grok1
h2ovl
hunyuan_v1
hunyuan_vision
hyperclovax_vision
idefics2_vision_model
idefics3
interfaces
interfaces_base
intern_vit
internlm2
internlm2_ve
interns1
interns1_vit
internvl
jais
jais2
jamba
jina_vl
keye
keye_vl1_5
kimi_linear
kimi_vl
lfm2
lfm2_moe
lightonocr
llama
llama4
llama4_eagle
llama_eagle
llama_eagle3
llava
llava_next
llava_next_video
llava_onevision
longcat_flash
longcat_flash_mtp
mamba
mamba2
medusa
midashenglm
mimo
mimo_mtp
mimo_v2_flash
minicpm
minicpm3
minicpm_eagle
minicpmo
minicpmv
minimax_m2
minimax_text_01
minimax_vl_01
mistral3
mistral_large_3
mistral_large_3_eagle
mixtral
mllama4
mlp_speculator
modernbert
module_mapping
molmo
moonvit
mpt
nano_nemotron_vl
nemotron
nemotron_h
nemotron_nas
nemotron_vl
nvlm_d
olmo
olmo2
olmoe
opencua
openpangu
openpangu_mtp
opt
orion
ouro
ovis
ovis2_5
paddleocr_vl
paligemma
persimmon
phi
phi3
phi3v
phi4mm
phi4mm_audio
phi4mm_utils
phimoe
pixtral
plamo2
plamo3
qwen
qwen2
qwen2_5_omni_thinker
qwen2_5_vl
qwen2_audio
qwen2_moe
qwen2_rm
qwen2_vl
qwen3
qwen3_moe
qwen3_next
qwen3_next_mtp
qwen3_omni_moe_thinker
qwen3_vl
qwen3_vl_moe
qwen_vl
radio
registry
roberta
rvl
seed_oss
siglip
siglip2navit
skyworkr1v
smolvlm
solar
stablelm
starcoder2
step3_text
step3_vl
swin
tarsier
telechat2
teleflm
terratorch
ultravox
utils
vision
voxtral
whisper
zamba2
transformers
transformers
base
causal
legacy
moe
multimodal
pooling
utils
warmup
warmup
deep_gemm_warmup
kernel_warmup
multimodal
multimodal
audio
base
cache
evs
hasher
image
inputs
parse
processing
profiling
registry
utils
video
platforms
platforms
cpu
cuda
interface
rocm
tpu
xpu
plugins
plugins
io_processors
io_processors
interface
lora_resolvers
lora_resolvers
filesystem_resolver
profiler
profiler
layerwise_profile
utils
wrapper
ray
ray
lazy_utils
ray_env
reasoning
reasoning
abs_reasoning_parsers
basic_parsers
deepseek_r1_reasoning_parser
deepseek_v3_reasoning_parser
ernie45_reasoning_parser
glm4_moe_reasoning_parser
gptoss_reasoning_parser
granite_reasoning_parser
holo2_reasoning_parser
hunyuan_a13b_reasoning_parser
identity_reasoning_parser
minimax_m2_reasoning_parser
mistral_reasoning_parser
olmo3_reasoning_parser
qwen3_reasoning_parser
seedoss_reasoning_parser
step3_reasoning_parser
tokenizers
tokenizers
deepseek_v32
deepseek_v32_encoding
detokenizer_utils
hf
mistral
protocol
registry
tool_parsers
tool_parsers
abstract_tool_parser
deepseekv3_tool_parser
deepseekv31_tool_parser
deepseekv32_tool_parser
ernie45_tool_parser
gigachat3_tool_parser
glm4_moe_tool_parser
glm47_moe_tool_parser
granite_20b_fc_tool_parser
granite_tool_parser
hermes_tool_parser
hunyuan_a13b_tool_parser
internlm2_tool_parser
jamba_tool_parser
kimi_k2_tool_parser
llama4_pythonic_tool_parser
llama_tool_parser
longcat_tool_parser
minimax_m2_tool_parser
minimax_tool_parser
mistral_tool_parser
olmo3_tool_parser
openai_tool_parser
phi4mini_tool_parser
pythonic_tool_parser
qwen3coder_tool_parser
qwen3xml_tool_parser
seed_oss_tool_parser
step3_tool_parser
utils
xlam_tool_parser
transformers_utils
transformers_utils
config
config_parser_base
dynamic_module
gguf_utils
processor
repo_utils
runai_utils
s3_utils
tokenizer
tokenizer_base
utils
chat_templates
chat_templates
registry
configs
configs
afmoe
arctic
bagel
chatglm
deepseek_vl2
dotsocr
eagle
falcon
flex_olmo
hunyuan_vl
jais
kimi_linear
kimi_vl
lfm2_moe
medusa
midashenglm
mistral
mlp_speculator
moonvit
nemotron
nemotron_h
olmo3
ovis
qwen3_next
radio
step3_vl
tarsier2
ultravox
speculators
speculators
algos
base
processors
processors
bagel
deepseek_ocr
deepseek_vl2
hunyuan_vl
hunyuan_vl_image
ovis
ovis2_5
triton_utils
triton_utils
importing
usage
usage
usage_lib
utils
utils
argparse_utils
async_utils
cache
collection_utils
counter
deep_gemm
flashinfer
func_utils
gc_utils
hashing
import_utils
jsontree
math_utils
mem_constants
mem_utils
nccl
network_utils
nvtx_pytorch_hooks
platform_utils
profiling
registry
serial_utils
system_utils
tensor_schema
torch_utils
v1
v1
cudagraph_dispatcher
kv_cache_interface
outputs
request
serial_utils
utils
attention
attention
backends
backends
cpu_attn
flash_attn
flashinfer
flex_attention
gdn_attn
linear_attn
mamba1_attn
mamba2_attn
mamba_attn
pallas
rocm_aiter_fa
rocm_aiter_unified_attn
rocm_attn
short_conv_attn
tree_attn
triton_attn
utils
mla
mla
aiter_triton_mla
common
cutlass_mla
flashattn_mla
flashinfer_mla
flashmla
flashmla_sparse
indexer
rocm_aiter_mla
rocm_aiter_mla_sparse
triton_mla
core
core
block_pool
encoder_cache_manager
kv_cache_coordinator
kv_cache_manager
kv_cache_metrics
kv_cache_utils
single_type_kv_cache_manager
sched
sched
async_scheduler
interface
output
request_queue
scheduler
utils
engine
engine
async_llm
coordinator
core
core_client
detokenizer
exceptions
input_processor
llm_engine
logprobs
output_processor
parallel_sampling
processor
utils
executor
executor
abstract
multiproc_executor
ray_distributed_executor
ray_executor
ray_utils
uniproc_executor
kv_offload
kv_offload
abstract
arc_manager
backend
cpu
factory
lru_manager
mediums
spec
backends
backends
cpu
worker
worker
cpu_gpu
worker
metrics
metrics
loggers
perf
prometheus
ray_wrappers
reader
stats
pool
pool
metadata
sample
sample
metadata
rejection_sampler
sampler
logits_processor
logits_processor
builtin
interface
state
ops
ops
bad_words
logprobs
penalties
topk_topp_sampler
tpu
tpu
metadata
sampler
spec_decode
spec_decode
eagle
medusa
metadata
metrics
ngram_proposer
suffix_decoding
utils
structured_output
structured_output
backend_guidance
backend_lm_format_enforcer
backend_outlines
backend_types
backend_xgrammar
request
utils
worker
worker
block_table
cp_utils
cpu_model_runner
cpu_worker
dp_utils
ec_connector_model_runner_mixin
gpu_input_batch
gpu_model_runner
gpu_ubatch_wrapper
gpu_worker
kv_connector_model_runner_mixin
lora_model_runner_mixin
tpu_input_batch
tpu_model_runner
tpu_worker
ubatch_utils
ubatching
utils
worker_base
workspace
xpu_model_runner
xpu_worker
gpu
gpu
async_utils
attn_utils
block_table
cudagraph_utils
dp_utils
input_batch
model_runner
states
structured_outputs
metrics
metrics
logits
sample
sample
gumbel
logprob
metadata
min_p
output
penalties
sampler
spec_decode
spec_decode
eagle
eagle_cudagraph
rejection_sample
CLI 參考
CLI 參考
vllm serve
vllm chat
vllm complete
vllm run-batch
vllm bench
vllm bench
vllm bench latency
vllm bench serve
vllm bench sweep plot
vllm bench sweep plot_pareto
vllm bench sweep serve
vllm bench sweep serve_sla
vllm bench throughput
社群
社群
聯絡我們
Meetups
贊助商
治理
治理
協作策略
貢獻者
治理流程
部落格
論壇
Slack
KServe
¶
vLLM 可以部署在 Kubernetes 上的
KServe
中,以實現高度可擴充套件的分散式模型服務。
有關將 vLLM 與 KServe 結合使用的更多詳細資訊,請參閱
本指南
。
回到頂部