多模態支援¶
本文將引導您完成擴充套件基礎模型以使其接受 多模態輸入 的步驟。
1. 更新基礎 vLLM 模型¶
假定您已按照 這些步驟 在 vLLM 中實現了該模型。請進一步更新模型,如下所示:
-
實現 get_placeholder_str 以定義在文字提示中用於表示多模態專案的佔位符字串。這應與模型的聊天模板一致。
-
在 forward 中為每個對應於多模態輸入的輸入張量保留一個關鍵字引數,如以下示例所示:
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
+ pixel_values: torch.Tensor,
) -> SamplerOutput:
更方便地,您可以直接將 **kwargs
傳遞給 forward 方法,並從中檢索多模態輸入的關鍵字引數。
-
實現 get_multimodal_embeddings,該方法透過模型的多模態分詞器執行多模態輸入並返回其嵌入。下面我們提供了一個典型實現模式的樣板,但您可以根據自己的需要進行調整。
程式碼
class YourModelForImage2Seq(nn.Module): ... def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor: assert self.vision_encoder is not None image_features = self.vision_encoder(image_input) return self.multi_modal_projector(image_features) def get_multimodal_embeddings( self, **kwargs: object) -> Optional[MultiModalEmbeddings]: # Validate the multimodal input keyword arguments image_input = self._parse_and_validate_image_input(**kwargs) if image_input is None: return None # Run multimodal inputs through encoder and projector vision_embeddings = self._process_image_input(image_input) return vision_embeddings
重要
返回的 multimodal_embeddings
必須是形狀為 (num_items, feature_size, hidden_size)
的 3D torch.Tensor,或是形狀為 (feature_size, hidden_size)
的 2D torch.Tensor 的列表/元組,以便 multimodal_embeddings[i]
檢索從請求的第 i
個多模態資料項(例如影像)生成的嵌入。
-
實現 get_input_embeddings 以將
multimodal_embeddings
與來自input_ids
的文字嵌入合併。如果模型的輸入處理實現正確(參見下面的章節),則可以利用我們提供的實用函式輕鬆合併嵌入。程式碼
from .utils import merge_multimodal_embeddings class YourModelForImage2Seq(nn.Module): ... def get_input_embeddings( self, input_ids: torch.Tensor, multimodal_embeddings: Optional[MultiModalEmbeddings] = None, ) -> torch.Tensor: # `get_input_embeddings` should already be implemented for the language # model as one of the requirements of basic vLLM model implementation. inputs_embeds = self.language_model.get_input_embeddings(input_ids) if multimodal_embeddings is not None: inputs_embeds = merge_multimodal_embeddings( input_ids=input_ids, inputs_embeds=inputs_embeds, multimodal_embeddings=multimodal_embeddings, placeholder_token_id=self.config.image_token_index) return inputs_embeds
-
實現 get_language_model getter 以提供對底層語言模型的穩定訪問。
-
完成上述步驟後,使用 SupportsMultiModal 介面更新模型類。
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
- class YourModelForImage2Seq(nn.Module):
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
注意
模型類不必命名為 *ForCausalLM
。檢視 HuggingFace Transformers 文件 以獲取一些示例。
2. 指定處理資訊¶
接下來,建立 BaseProcessingInfo 的子類,以提供與 HF 處理相關的基本資訊。
最大輸入項數¶
您需要覆蓋抽象方法 get_supported_mm_limits,以返回模型支援的每種模態的最大輸入項數。
例如,如果模型支援任意數量的影像,但每個提示只支援一個影片
def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
return {"image": None, "video": 1}
3. 指定虛擬輸入¶
然後,繼承 BaseDummyInputsBuilder 以構建用於 HF 處理和記憶體分析的虛擬輸入。
用於記憶體分析¶
覆蓋抽象方法 get_dummy_text 和 get_dummy_mm_data 以構建用於記憶體分析的虛擬輸入。這些虛擬輸入應導致模型在最壞情況下的記憶體使用量,以便 vLLM 可以為其保留正確的記憶體量。
假設記憶體使用量隨令牌數量增加,則可以構建虛擬輸入以最大化輸出嵌入的數量,這與佔位符特徵令牌的數量相同。
檢視 HF 的 LlavaForConditionalGeneration
程式碼
程式碼
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
n_image_features = image_features.shape[0] * image_features.shape[1]
if n_image_tokens != n_image_features:
raise ValueError(
f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
)
special_image_mask = (
(input_ids == self.config.image_token_index)
.unsqueeze(-1)
.expand_as(inputs_embeds)
.to(inputs_embeds.device)
)
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
每張影像的佔位符特徵令牌數量是 image_features.shape[1]
。image_features
在 get_image_features
方法內部計算。
程式碼
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
if vision_feature_select_strategy == "default":
selected_image_feature = selected_image_feature[:, 1:]
elif vision_feature_select_strategy == "full":
selected_image_feature = selected_image_feature
else:
raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
image_features = self.multi_modal_projector(selected_image_feature)
return image_features
我們可以推斷 image_features.shape[1]
基於視覺塔(針對 llava-hf/llava-1.5-7b-hf
模型,是 CLIPVisionModel
)的 image_outputs.hidden_states.shape[1]
輸出。此外,我們只需要序列長度(張量的第二個維度)即可獲取 image_features.shape[1]
。序列長度由 CLIPVisionTransformer
中的初始隱藏狀態決定,因為注意力機制不改變輸出隱藏狀態的序列長度。
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
hidden_states = self.pre_layrnorm(hidden_states)
encoder_outputs = self.encoder(
inputs_embeds=hidden_states,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
為了找到序列長度,我們檢視 CLIPVisionEmbeddings
的程式碼
程式碼
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
target_dtype = self.patch_embedding.weight.dtype
patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
class_embeds = self.class_embedding.expand(batch_size, 1, -1)
embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
if interpolate_pos_encoding:
embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
else:
embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings
我們可以推斷 embeddings.shape[1] == self.num_positions
,其中
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
self.num_patches = (self.image_size // self.patch_size) ** 2
self.num_positions = self.num_patches + 1
總的來說,一張影像的佔位符特徵令牌數量可以計算為:
程式碼
def get_num_image_tokens(
self,
*,
image_width: int,
image_height: int,
) -> int:
hf_config = self.get_hf_config()
hf_processor = self.get_hf_processor()
image_size = hf_config.vision_config.image_size
patch_size = hf_config.vision_config.patch_size
num_image_tokens = (image_size // patch_size) ** 2 + 1
if hf_processor.vision_feature_select_strategy == "default":
num_image_tokens -= 1
return num_image_tokens
請注意,影像令牌的數量不依賴於影像的寬度和高度。我們可以簡單地使用虛擬的 image_size
來計算多模態分析資料。
程式碼
# NOTE: In actuality, this is usually implemented as part of the
# model's subclass of `BaseProcessingInfo`, but we show it as is
# here for simplicity.
def get_image_size_with_most_features(self) -> ImageSize:
hf_config = self.get_hf_config()
width = height = hf_config.image_size
return ImageSize(width=width, height=height)
def get_dummy_mm_data(
self,
seq_len: int,
mm_counts: Mapping[str, int],
) -> MultiModalDataDict:
num_images = mm_counts.get("image", 0)
target_width, target_height = \
self.info.get_image_size_with_most_features()
return {
"image":
self._get_dummy_images(width=target_width,
height=target_height,
num_images=num_images)
}
對於文字,我們只需從模型配置中擴充套件多模態影像令牌,以匹配所需的影像數量。
檢視 HF 的 FuyuForCausalLM
程式碼
程式碼
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
if image_patches is not None and past_key_values is None:
patch_embeddings = [
self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
.squeeze(0)
.to(inputs_embeds.device)
for patch in image_patches
]
inputs_embeds = self.gather_continuous_embeddings(
word_embeddings=inputs_embeds,
continuous_embeddings=patch_embeddings,
image_patch_input_indices=image_patches_indices,
)
批處理中第 i
個項的佔位符特徵令牌數量是 patch_embeddings[i].shape[0]
,與 image_patches[i].shape[0]
相同,即 num_total_patches
。
與 LLaVA 不同,Fuyu 未在建模檔案中定義補丁數量。我們可以在哪裡獲取更多資訊?考慮到模型輸入來自 FuyuProcessor
的輸出,我們來檢視預處理檔案。
影像輸出是透過在 FuyuProcessor
內部呼叫 FuyuImageProcessor.preprocess
然後 FuyuImageProcessor.preprocess_with_tokenizer_info
獲得的。
在 FuyuImageProcessor.preprocess
中,影像被調整大小並填充到目標 FuyuImageProcessor.size
,並返回調整大小後(但填充前)的尺寸作為元資料。
程式碼
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
batch_images = image_encoding["images"]
image_unpadded_heights = image_encoding["image_unpadded_heights"]
image_unpadded_widths = image_encoding["image_unpadded_widths"]
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
if do_resize:
batch_images = [
[self.resize(image, size=size, input_data_format=input_data_format) for image in images]
for images in batch_images
]
image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
if do_pad:
batch_images = [
[
self.pad_image(
image,
size=size,
mode=padding_mode,
constant_values=padding_value,
input_data_format=input_data_format,
)
for image in images
]
for images in batch_images
]
在 FuyuImageProcessor.preprocess_with_tokenizer_info
中,影像根據此元資料被分割成補丁:
程式碼
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
model_image_input = self.image_processor.preprocess_with_tokenizer_info(
image_input=tensor_batch_images,
image_present=image_present,
image_unpadded_h=image_unpadded_heights,
image_unpadded_w=image_unpadded_widths,
image_placeholder_id=image_placeholder_id,
image_newline_id=image_newline_id,
variable_sized=True,
)
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
image_height, image_width = image.shape[1], image.shape[2]
if variable_sized: # variable_sized=True
new_h = min(
image_height,
math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
)
new_w = min(
image_width,
math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
)
image = image[:, :new_h, :new_w]
image_height, image_width = new_h, new_w
num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
tensor_of_image_ids = torch.full(
[num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
)
patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
assert num_patches == patches.shape[0]
補丁的數量又由 FuyuImageProcessor.get_num_patches
定義
程式碼
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
patch_size = patch_size if patch_size is not None else self.patch_size
patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
if image_height % patch_height != 0:
raise ValueError(f"{image_height=} must be divisible by {patch_height}")
if image_width % patch_width != 0:
raise ValueError(f"{image_width=} must be divisible by {patch_width}")
num_patches_per_dim_h = image_height // patch_height
num_patches_per_dim_w = image_width // patch_width
num_patches = num_patches_per_dim_h * num_patches_per_dim_w
這些影像補丁對應於佔位符令牌(|SPEAKER|
)。因此,我們只需最大化影像補丁的數量。由於輸入影像首先被調整大小以適應 image_processor.size
,我們可以透過輸入尺寸等於 image_processor.size
的影像來最大化影像補丁的數量。
def get_image_size_with_most_features(self) -> ImageSize:
image_processor = self.get_image_processor()
return ImageSize(width=image_processor.size["width"],
height=image_processor.size["height"])
Fuyu 不期望 HF 處理器輸入中包含影像佔位符,因此無論影像數量如何,虛擬提示文字都是空的。
對於多模態影像分析資料,其邏輯與 LLaVA 非常相似
程式碼
def get_dummy_mm_data(
self,
seq_len: int,
mm_counts: Mapping[str, int],
) -> MultiModalDataDict:
target_width, target_height = \
self.info.get_image_size_with_most_features()
num_images = mm_counts.get("image", 0)
return {
"image":
self._get_dummy_images(width=target_width,
height=target_height,
num_images=num_images)
}
4. 指定處理詳情¶
之後,建立 BaseMultiModalProcessor 的子類,以補充 HF 處理中缺失的詳細資訊。
資訊
多模態欄位¶
覆蓋 _get_mm_fields_config,以返回 HF 處理器輸出的與輸入多模態專案相關的張量模式。
CLIPImageProcessor
的輸出是一個簡單張量,形狀為 (num_images, num_channels, image_height, image_width)
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
images = [
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
for image in all_images
]
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
因此,我們覆蓋 _get_mm_fields_config 如下所示:
def _get_mm_fields_config(
self,
hf_inputs: BatchFeature,
hf_processor_mm_kwargs: Mapping[str, object],
) -> Mapping[str, MultiModalFieldConfig]:
return dict(
pixel_values=MultiModalFieldConfig.batched("image"),
)
注意
我們的 實際程式碼 還支援預計算的影像嵌入,可以透過 image_embeds
引數傳遞給模型。
FuyuImageProcessor.preprocess_with_tokenizer_info
的 image_patches
輸出將批處理中每個專案對應的影像補丁進行拼接
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
image_input_ids.append(tensor_of_image_ids)
image_patches.append(patches)
else:
image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
batch_image_input_ids.append(image_input_ids)
batch_image_patches.append(image_patches)
FuyuImageProcessor
輸出的 image_patches
的形狀因此是 (1, num_images, num_patches, patch_width * patch_height * num_channels)
。
為了支援像 LLaVA 中使用 MultiModalFieldConfig.batched,我們透過覆蓋 BaseMultiModalProcessor._call_hf_processor 來移除額外的批處理維度
程式碼
def _call_hf_processor(
self,
prompt: str,
mm_data: Mapping[str, object],
mm_kwargs: Mapping[str, object],
tok_kwargs: Mapping[str, object],
) -> BatchFeature:
processed_outputs = super()._call_hf_processor(
prompt=prompt,
mm_data=mm_data,
mm_kwargs=mm_kwargs,
tok_kwargs=tok_kwargs,
)
image_patches = processed_outputs.get("image_patches")
if image_patches is not None:
images = mm_data["images"]
assert isinstance(images, list)
# Original output: (1, num_images, Pn, Px * Py * C)
# New output: (num_images, Pn, Px * Py * C)
assert (isinstance(image_patches, list)
and len(image_patches) == 1)
assert (isinstance(image_patches[0], torch.Tensor)
and len(image_patches[0]) == len(images))
processed_outputs["image_patches"] = image_patches[0]
return processed_outputs
注意
我們的 實際程式碼 對純文字輸入有特殊處理,以防止 HF 處理器發出不必要的警告。
注意
_call_hf_processor
方法為處理指定了 mm_kwargs
和 tok_kwargs
。mm_kwargs
用於初始化和呼叫 huggingface 處理器,而 tok_kwargs
僅用於呼叫 huggingface 處理器。
這使得我們能夠如下覆蓋 _get_mm_fields_config:
提示詞更新¶
覆蓋 _get_prompt_updates 以返回 PromptUpdate 例項的列表。
每個 PromptUpdate 例項指定了 HF 處理器執行的更新操作(例如:插入、替換)。
檢視 HF 的 LlavaProcessor
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
prompt_strings = []
for sample in text:
sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
prompt_strings.append(sample)
它簡單地將每個輸入 image_token
重複多次,次數等於佔位符特徵令牌的數量(num_image_tokens
)。基於此,我們如下覆蓋 _get_prompt_updates:
程式碼
def _get_prompt_updates(
self,
mm_items: MultiModalDataItems,
hf_processor_mm_kwargs: Mapping[str, object],
out_mm_kwargs: MultiModalKwargs,
) -> Sequence[PromptUpdate]:
hf_config = self.info.get_hf_config()
image_token_id = hf_config.image_token_index
def get_replacement(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
num_image_tokens = self.info.get_num_image_tokens(
image_width=image_size.width,
image_height=image_size.height,
)
return [image_token_id] * num_image_tokens
return [
PromptReplacement(
modality="image",
target=[image_token_id],
replacement=get_replacement,
),
]
回顧步驟 2 中的特徵令牌佈局
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
...
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
我們定義一個輔助函式直接返回 ncols
和 nrows
程式碼
def get_image_feature_grid_size(
self,
*,
image_width: int,
image_height: int,
) -> tuple[int, int]:
image_processor = self.get_image_processor()
target_width = image_processor.size["width"]
target_height = image_processor.size["height"]
patch_width = image_processor.patch_size["width"]
patch_height = image_processor.patch_size["height"]
if not (image_width <= target_width and image_height <= target_height):
height_scale_factor = target_height / image_height
width_scale_factor = target_width / image_width
optimal_scale_factor = min(height_scale_factor, width_scale_factor)
image_height = int(image_height * optimal_scale_factor)
image_width = int(image_width * optimal_scale_factor)
ncols = math.ceil(image_width / patch_width)
nrows = math.ceil(image_height / patch_height)
return ncols, nrows
基於此,我們最初可以將替換令牌定義為:
程式碼
def get_replacement(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
ncols, nrows = self.info.get_image_feature_grid_size(
image_width=image_size.width,
image_height=image_size.height,
)
# `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
# `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
然而,這並不完全正確。在呼叫 FuyuImageProcessor.preprocess_with_tokenizer_info
後,一個 BOS 令牌(<s>
)也會被新增到提示中
程式碼
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
model_image_input = self.image_processor.preprocess_with_tokenizer_info(
image_input=tensor_batch_images,
image_present=image_present,
image_unpadded_h=image_unpadded_heights,
image_unpadded_w=image_unpadded_widths,
image_placeholder_id=image_placeholder_id,
image_newline_id=image_newline_id,
variable_sized=True,
)
prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
tokenizer=self.tokenizer,
prompts=prompts,
scale_factors=scale_factors,
max_tokens_to_generate=self.max_tokens_to_generate,
max_position_embeddings=self.max_position_embeddings,
add_BOS=True,
add_beginning_of_answer_token=True,
)
為了僅將視覺嵌入分配給影像令牌,您可以返回一個 PromptUpdateDetails 例項,而不是字串。
程式碼
hf_config = self.info.get_hf_config()
bos_token_id = hf_config.bos_token_id # `<s>`
assert isinstance(bos_token_id, int)
def get_replacement_fuyu(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
ncols, nrows = self.info.get_image_feature_grid_size(
image_width=image_size.width,
image_height=image_size.height,
)
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
[_NEWLINE_TOKEN_ID]) * nrows
return PromptUpdateDetails.select_token_id(
image_tokens + [bos_token_id],
embed_token_id=_IMAGE_TOKEN_ID,
)
最後,注意到 HF 處理器從分詞後的提示中移除了 |ENDOFTEXT|
令牌,我們可以搜尋它以在字串的開頭進行替換
程式碼
def _get_prompt_updates(
self,
mm_items: MultiModalDataItems,
hf_processor_mm_kwargs: Mapping[str, object],
out_mm_kwargs: MultiModalKwargs,
) -> Sequence[PromptUpdate]:
hf_config = self.info.get_hf_config()
bos_token_id = hf_config.bos_token_id
assert isinstance(bos_token_id, int)
tokenizer = self.info.get_tokenizer()
eot_token_id = tokenizer.bos_token_id
assert isinstance(eot_token_id, int)
def get_replacement_fuyu(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
ncols, nrows = self.info.get_image_feature_grid_size(
image_width=image_size.width,
image_height=image_size.height,
)
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
[_NEWLINE_TOKEN_ID]) * nrows
return PromptUpdateDetails.select_token_id(
image_tokens + [bos_token_id],
embed_token_id=_IMAGE_TOKEN_ID,
)
return [
PromptReplacement(
modality="image",
target=[eot_token_id],
replacement=get_replacement_fuyu,
)
]
5. 註冊處理器相關類¶
在您定義了 BaseProcessingInfo(步驟 2)、 BaseDummyInputsBuilder(步驟 3)和 BaseMultiModalProcessor(步驟 4)之後,使用 MULTIMODAL_REGISTRY.register_processor 裝飾模型類,將它們註冊到多模態登錄檔。
from vllm.model_executor.models.interfaces import SupportsMultiModal
+ from vllm.multimodal import MULTIMODAL_REGISTRY
+ @MULTIMODAL_REGISTRY.register_processor(YourMultiModalProcessor,
+ info=YourProcessingInfo,
+ dummy_inputs=YourDummyInputsBuilder)
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
備註¶
不替換地插入特徵令牌¶
某些 HF 處理器直接插入特徵令牌,而不替換原始提示中的任何內容。在這種情況下,您可以在 _get_prompt_updates 內部使用 PromptInsertion 而不是 PromptReplacement。
示例
- BLIP-2(在提示開頭插入): vllm/model_executor/models/blip2.py
- Florence2(在提示開頭插入): vllm/model_executor/models/florence2.py
- Molmo(在
<|endoftext|>
令牌後插入): vllm/model_executor/models/molmo.py
處理與多模態資料無關的提示詞更新¶
_get_prompt_updates 假設每次提示更新都對應一個多模態專案。如果 HF 處理器執行額外的處理,而不管有多少多模態專案,您應該覆蓋 _apply_hf_processor_tokens_only,以便處理後的令牌輸入與在文字輸入上應用 HF 處理器得到的結果一致。這是因為根據 我們的設計,令牌輸入會繞過 HF 處理器。
示例
- Chameleon(附加
sep_token
): vllm/model_executor/models/chameleon.py - Fuyu(附加
boa_token
): vllm/model_executor/models/fuyu.py - Molmo(應用了未在其他地方定義的聊天模板): vllm/model_executor/models/molmo.py
自定義 HF 處理器¶
某些模型在 HF Hub 上沒有定義 HF 處理器類。在這種情況下,您可以定義一個與 HF 處理器具有相同呼叫簽名的自定義 HF 處理器,並將其傳遞給 _call_hf_processor。
示例
- DeepSeek-VL2: vllm/model_executor/models/deepseek_vl2.py
- InternVL: vllm/model_executor/models/internvl.py
- Qwen-VL: vllm/model_executor/models/qwen_vl.py