vLLM 效能分析¶

警告

效能分析僅適用於 vLLM 的開發者和維護者，用於瞭解程式碼庫中不同部分所佔用的時間比例。vLLM 終端使用者絕不應開啟效能分析，因為它會顯著降低推理速度。

使用 PyTorch Profiler 進行效能分析¶

我們支援使用 torch.profiler 模組跟蹤 vLLM worker。您可以透過設定 VLLM_TORCH_PROFILER_DIR 環境變數來啟用跟蹤，將其指向您希望儲存跟蹤檔案的目錄：VLLM_TORCH_PROFILER_DIR=/mnt/traces/

OpenAI 伺服器也需要在使用 VLLM_TORCH_PROFILER_DIR 環境變數設定後啟動。

使用 benchmarks/benchmark_serving.py 時，可以透過傳遞 --profile 標誌來啟用效能分析。

跟蹤檔案可以使用 https://ui.perfetto.dev/ 進行視覺化。

提示

進行效能分析時，只向 vLLM 傳送少量請求，因為跟蹤檔案可能會變得非常大。此外，無需解壓跟蹤檔案，它們可以直接檢視。

提示

要停止效能分析器——它會將所有效能跟蹤檔案重新整理到目錄中。這需要時間，例如，對於 llama 70b 的大約 100 個請求的資料，在 H100 上重新整理大約需要 10 分鐘。在啟動伺服器之前，將環境變數 VLLM_RPC_TIMEOUT 設定為一個較大的數字。例如 30 分鐘。export VLLM_RPC_TIMEOUT=1800000

示例命令和用法¶

離線推理¶

請參考 examples/offline_inference/simple_profiling.py 以獲取示例。

OpenAI 伺服器¶

VLLM_TORCH_PROFILER_DIR=./vllm_profile \
    python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-70B

benchmark_serving.py

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Meta-Llama-3-70B \
    --dataset-name sharegpt \
    --dataset-path sharegpt.json \
    --profile \
    --num-prompts 2

使用 NVIDIA Nsight Systems 進行效能分析¶

Nsight Systems 是一款高階工具，可以揭示更多效能分析細節，例如暫存器和共享記憶體使用情況、帶註釋的程式碼區域以及低階 CUDA API 和事件。

使用您的包管理器安裝 nsight-systems。以下是 Ubuntu 的示例。

apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli

示例命令和用法¶

離線推理¶

對於基本用法，您只需在任何現有離線推理指令碼之前附加 nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node。

以下是使用 benchmarks/benchmark_latency.py 指令碼的示例

nsys profile -o report.nsys-rep \
    --trace-fork-before-exec=true \
    --cuda-graph-trace=node \
    python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --num-iters-warmup 5 \
    --num-iters 1 \
    --batch-size 16 \
    --input-len 512 \
    --output-len 8

OpenAI 伺服器¶

要對伺服器進行效能分析，您需要像離線推理一樣，在 vllm serve 命令前加上 nsys profile，但您必須根據基準測試的需求指定 --delay XX --duration YY 引數。持續時間用盡後，伺服器將被終止。

# server
nsys profile -o report.nsys-rep \
    --trace-fork-before-exec=true \
    --cuda-graph-trace=node \
    --delay 30 \
    --duration 60 \
    vllm serve meta-llama/Llama-3.1-8B-Instruct

# client
python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --num-prompts 1 \
    --dataset-name random \
    --random-input 1024 \
    --random-output 512

實際上，您應該將 --duration 引數設定為一個較大的值。無論何時您想讓伺服器停止效能分析，請執行

nsys sessions list

以獲取 profile-XXXXX 形式的會話 ID，然後執行

nsys stop --session=profile-XXXXX

以手動終止效能分析器並生成您的 nsys-rep 報告。

分析¶

您可以使用 nsys stats [profile-file] 在命令列介面 (CLI) 中檢視這些效能分析報告的摘要，或者透過按照此處的說明在本地安裝 Nsight 後在圖形使用者介面 (GUI) 中檢視。

CLI 示例

nsys stats report1.nsys-rep
...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):

Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
--------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
    46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
    14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
    12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
    9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
    5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
    4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
    2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
    1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
    0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
...

GUI 示例

vLLM Python 程式碼效能分析¶

Python 標準庫包含用於分析 Python 程式碼的 cProfile。vLLM 包含一些輔助函式，可以輕鬆將其應用於 vLLM 的特定程式碼段。vllm.utils.cprofile 和 vllm.utils.cprofile_context 函式都可以用於分析程式碼段。

示例用法 - 裝飾器¶

第一個輔助函式是一個 Python 裝飾器，可用於分析函式。如果指定了檔名，效能分析資料將儲存到該檔案。如果未指定檔名，效能分析資料將列印到標準輸出 (stdout)。

import vllm.utils

@vllm.utils.cprofile("expensive_function.prof")
def expensive_function():
    # some expensive code
    pass

示例用法 - 上下文管理器¶

第二個輔助函式是一個上下文管理器，可用於分析程式碼塊。與裝飾器類似，檔名是可選的。

import vllm.utils

def another_function():
    # more expensive code
    pass

with vllm.utils.cprofile_context("another_function.prof"):
    another_function()

分析效能分析結果¶

有多種工具可以幫助分析效能分析結果。其中一個例子是 snakeviz。

pip install snakeviz
snakeviz expensive_function.prof