多模態(tài)大模型Qwen2的深入了解原創(chuàng)

一起AI技術(shù)

發(fā)布于 2024-11-15 15:09

瀏覽

2收藏

前言

本章我們將深入了解Qwen2-VL并使用多模態(tài)對(duì)于視頻的處理能力。

資料

論文標(biāo)題：《Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution》

論文地址：https://arxiv.org/pdf/2409.12191

論文閱讀理解

論文核心要點(diǎn)

據(jù)Qwen2-VL的論文中介紹，該模型為了進(jìn)一步增強(qiáng)模型對(duì)視頻中視覺(jué)信息的有效感知和理解能力，引入了三個(gè)關(guān)鍵的創(chuàng)新升級(jí)：

原始動(dòng)態(tài)分辨率：該功能允許模型處理任意分辨率的圖像，而不需要調(diào)整模型結(jié)構(gòu)。
多模態(tài)旋轉(zhuǎn)位置嵌入：該功能通過(guò)時(shí)間、高度、寬度三個(gè)維度來(lái)對(duì)進(jìn)行embedding，從而建模了多模態(tài)輸入的位置信息。
統(tǒng)一圖像和視頻的理解：通過(guò)混合訓(xùn)練方法的方式，結(jié)合圖像和視頻數(shù)據(jù)，確保在圖像理解和視頻理解方面具有專業(yè)水平。

升級(jí)點(diǎn)1：原始動(dòng)態(tài)分辨率

模型結(jié)構(gòu)

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

論文原文

Naive Dynamic Resolution A key architectural improvement in Qwen2-VL is the introduction of naive dynamic resolution support (Dehghani et al., 2024). Unlike Qwen-VL, Qwen2-VL can now process images of any resolution, dynamically converting them into a variable number of visual tokens.1 To support this feature, we modified ViT by removing the original absolute position embeddings and introducing 2D-RoPE (Suet al., 2024; Su, 2021) to capture the two-dimensional positional information of images. At the inference stage, images of varying resolutions are packed into a single sequence, with the packed length controlled to limit GPU memory usage. Furthermore, to reduce the visual tokens of each image, a simple MLP layer is employed after the ViT to compress adjacent 2 × 2 tokens into a single token, with the special <|vision_start|> and <|vision_end|> tokens placed at the beginning and end of the compressed visual tokens. As a result, an image with a resolution of 224 × 224, encoded with a ViT using patch_size=14, will be compressed to 66 tokens before entering LLM.

論文翻譯

原始動(dòng)態(tài)分辨率(Naive Dynamic Resolution)：??Qwen2-VL??? 架構(gòu)改進(jìn)的關(guān)鍵之一。與它的前身不同，Qwen2-VL現(xiàn)在可以處理任何分辨率的圖像，并且能夠?qū)⑺鼈儎?dòng)態(tài)轉(zhuǎn)換為可變數(shù)量的視覺(jué)令牌。為了支持這一功能，我們修改了 ??ViT??，刪除了原始絕對(duì)位置嵌入，并引入2D-RoPE來(lái)捕獲圖像的二維位置信息。在推理階段，各種分辨率的圖像被包裝成單個(gè)序列，包裝長(zhǎng)度受控以限制GPU內(nèi)存使用量。此外，為了減少每個(gè)圖像的視覺(jué)令牌數(shù)，在ViT之后采用一個(gè)簡(jiǎn)單的??MLP??層，將相鄰的2×2令牌壓縮到一個(gè)令牌中，其中特殊的 <|vision_start|> 和 <|vision_end|> 令牌放置在壓縮的視覺(jué)令牌的開始和結(jié)束處。因此，使用 ??patch_size = 14??? 編碼的分辨率 ??224×224??? 的圖像將在進(jìn)入LLM之前被壓縮為 ??66?? 個(gè)令牌。

論文理解

圖像分塊（Patch）：在視覺(jué) Transformer（ViT）中，圖像會(huì)被劃分為多個(gè)小塊（patches）。??patch_size = 14?? 意味著每個(gè)小塊的尺寸為??14x14?? 像素。

圖像分辨率：假如輸入的圖像分辨率為??224×224?? 像素。
小塊數(shù)量：

水平方向：??224 / 14?? = 16

垂直方向：??224 / 14?? = 16 因此，總的小塊數(shù)量為 16 × 16 = 256 個(gè)小塊。

壓縮視覺(jué)令牌：為了減少輸入到模型中的視覺(jué)令牌數(shù)量，??Qwen2-VL?? 使用了一個(gè)簡(jiǎn)單的??MLP?? 層，將相鄰的??2x2?? 個(gè)小塊壓縮為一個(gè)視覺(jué)令牌。由于每個(gè)??2x2?? 的小塊包含??4?? 個(gè)小塊，因此??256?? 個(gè)小塊被壓縮為??256 / 4?? = 64 個(gè)視覺(jué)令牌。
特殊令牌：在壓縮后的視覺(jué)令牌序列中，添加了兩個(gè)特殊的令牌：??<|vision_start|>?? 和??<|vision_end|>??，用于標(biāo)識(shí)視覺(jué)信息的開始和結(jié)束。因此，最終的視覺(jué)令牌數(shù)量為??64 + 2?? = 66 個(gè)。

升級(jí)點(diǎn)2：多模態(tài)旋轉(zhuǎn)位置嵌入

模型結(jié)構(gòu)

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

論文原文

Multimodal Rotary Position Embedding (M-RoPE) Another key architectural enhancement is the innovation of Multimodal Rotary Position Embedding ??(M-RoPE)???. Unlike the traditional ??1D-RoPE??? in LLMs, which is limited to encoding one-dimensional positional information, M-RoPE effectively models the positional information of multimodal inputs. This is achieved by deconstructing the original rotary embedding into three components: ??temporal???, ??height???, and ??width??. For text inputs, these components utilize identical position IDs, making M-RoPE functionally equivalent to 1D-RoPE (Su, 2024). When processing images, the temporal IDs of each visual token remain constant, while distinct IDs are assigned to the height and width components based on the token’s position in the image. For videos, which are treated as sequences of frames, the temporal ID increments for each frame, while the height and width components follow the same ID assignment pattern as images. In scenarios where the model’s input encompasses multiple modalities, position numbering for each modality is initialized by incrementing the maximum position ID of the preceding modality by one. An illustration of M-RoPE is shown in Figure 3. M-RoPE not only enhances the modeling of positional information but also reduces the value of position IDs for images and videos, enabling the model to extrapolate to longer sequences during inference.

論文翻譯

多模態(tài)旋轉(zhuǎn)位置嵌入（M-RoPE）：另一個(gè)關(guān)鍵的架構(gòu)增強(qiáng)是多模態(tài)旋轉(zhuǎn)位置嵌入 (M-RoPE) 的創(chuàng)新。與大型語(yǔ)言模型中的傳統(tǒng) 1D-RoPE 不同，它僅限于編碼一維位置信息，M-RoPE 有效地建模了多模態(tài)輸入的位置信息。這通過(guò)將原始旋轉(zhuǎn)嵌入分解為三個(gè)組件：??時(shí)間???、??高度??? 和 ??寬度?? 來(lái)實(shí)現(xiàn)。 對(duì)于文本輸入，這些組件使用相同的位移。多模態(tài)旋轉(zhuǎn)位置嵌入ID，使M-RoPE功能上等同于1D-RoPE。 在處理圖像時(shí)，每個(gè)視覺(jué)令牌的時(shí)間ID保持不變，而高度和寬度組件根據(jù)令牌在圖像中的位置分配不同的ID。 對(duì)于視頻，這些被當(dāng)作幀序列來(lái)處理的視頻，每幀的時(shí)間ID遞增，而高度和寬度組件遵循與圖像相同的ID分配模式。在模型輸入包含多個(gè)模態(tài)的情況下，每個(gè)模態(tài)的位置編號(hào)通過(guò)將前一模態(tài)的最大位置ID增加一個(gè)進(jìn)行初始化。圖3顯示了M-RoPE的示例。M-RoPE不僅增強(qiáng)了對(duì)位置信息的建模能力，而且降低了圖像和視頻中位置ID的價(jià)值，使得模型能夠在推理期間擴(kuò)展到更長(zhǎng)的序列。

論文理解

Postion Embedding：位置嵌入是用來(lái)告訴模型輸入數(shù)據(jù)中每個(gè)元素的位置。比如，在處理文本時(shí)，模型需要知道“我愛(ài)你”中的“我”是第一個(gè)詞，“愛(ài)”是第二個(gè)詞。
M-RoPE：Qwen2-VL 引入的 M-RoPE 則是一個(gè)更復(fù)雜的系統(tǒng)，它不僅能處理文本，還能處理圖像和視頻。M-RoPE 將位置嵌入分為三個(gè)部分：

時(shí)間：適用于視頻或序列數(shù)據(jù)，表示幀的順序。

高度和寬度：適用于圖像，表示圖像中每個(gè)視覺(jué)令牌的位置（行和列）。

不同數(shù)據(jù)類型的處理：

對(duì)于文本輸入:

相同位移：文本中的每個(gè)詞使用相同的時(shí)間位移。例如，句子中的詞按順序編號(hào)。

對(duì)于圖像輸入
?固定的時(shí)間ID：圖像中的每個(gè)視覺(jué)令牌（小塊）保持相同的時(shí)間ID，但高度和寬度的ID會(huì)根據(jù)它們?cè)趫D像中的位置不同而變化。例如，左上角的小塊可能是（1,1），而右下角的小塊可能是（16,16）。
?對(duì)于視頻輸入
?遞增的時(shí)間ID：視頻中的每一幀都有不同的時(shí)間ID，表示它們?cè)谛蛄兄械捻樞颉Ｍ瑫r(shí)，每幀的高度和寬度組件仍然根據(jù)圖像的位置分配ID。

模態(tài)之間的ID初始化: 當(dāng)模型處理多個(gè)模態(tài)時(shí)，比如同時(shí)處理文本和圖像，??M-RoPE?? 會(huì)為每個(gè)模態(tài)分配不同的起始位置ID。例如，處理圖像時(shí)，圖像的最大ID會(huì)在處理文本時(shí)被增加，以避免沖突。

升級(jí)點(diǎn)3：統(tǒng)一圖像和視頻的理解

論文原文

Unified Image and Video Understanding Qwen2-VL employs a mixed training regimen incorporating both image and video data, ensuring proficiency in image understanding and video comprehension. To preserve video information as completely as possible, we sampled each video at two frames per second. Additionally, we integrated ??3D convolutions?? (Carreira and Zisserman, 2017) with a depth of two to process video inputs, allowing the model to handle 3D tubes instead of 2D patches, thus enabling it to process more video frames without increasing the sequence length (Arnab et al., 2021). For consistency, each image is treated as two identical frames. To balance the computational demands of long video processing with overall training efficiency, we dynamically adjust the resolution of each video frame, limiting the total number of tokens per video to 16384. This training approach strikes a balance between the model’s ability to comprehend long videos and training efficiency.

論文翻譯

統(tǒng)一圖像和視頻理解：采用混合訓(xùn)練方法，結(jié)合圖像和視頻數(shù)據(jù)，確保在圖像理解和視頻理解方面具有專業(yè)水平。為了盡可能完整地保留視頻信息，我們每秒對(duì)每個(gè)視頻進(jìn)行兩次采樣。此外，我們還集成深度為兩層的??三維卷積??來(lái)處理視頻輸入，允許模型處理三維管狀結(jié)構(gòu)而不是二維塊，從而使其能夠處理更多視頻幀而無(wú)需增加序列長(zhǎng)度。為了保持一致，每張圖片都被視為兩張相同的幀。為了平衡長(zhǎng)視頻處理所需的計(jì)算需求與整體訓(xùn)練效率，我們動(dòng)態(tài)調(diào)整每個(gè)視頻幀的分辨率，限制每個(gè)視頻中的總令牌數(shù)量不超過(guò) 16384。這種訓(xùn)練方法在模型理解和訓(xùn)練效率之間取得了平衡。

模型部署(使用flash_attention)

在上一章【課程總結(jié)】day31：多模態(tài)大模型初步了解，我們部署了Qwen2-VL模型。由于多模態(tài)大模型比較占用GPU顯存，我們使用??flash_attention??來(lái)加速推理，以減少顯存占用。

準(zhǔn)備環(huán)境

第一步：?jiǎn)?dòng)ModelScope平臺(tái)的PAI-DSW的GPU環(huán)境

# 檢查CUDA的版本
nvcc --version

# 檢查pytorch版本
import torch
print(torch.__version__)
print(torch.cuda.is_available())

運(yùn)行結(jié)果：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

系統(tǒng)版本為 CUDA 12.1 和 PyTorch 2.3.1

拉取代碼

第二步：下載通義千問(wèn)2-VL-2B-Instruct模型

# 確保 git lfs 已安裝
git lfs install

# 下載模型
git clone https://www.modelscope.cn/Qwen/Qwen2-VL-2B-Instruct.git

安裝flash_attention

第三步：安裝flash_attention

pip install flash-attn

運(yùn)行結(jié)果：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

引入相關(guān)庫(kù)

from transformers import Qwen2VLForConditionalGeneration
from transformers import AutoTokenizer
from transformers import AutoProcessor
import torch
from qwen_vl_utils import process_vision_info

加載模型

# 設(shè)置模型路徑
model_dir = "Qwen2-VL-2B-Instruct"

# 使用flash-attension加載模型
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

運(yùn)行結(jié)果：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

模型形狀

在加載模型后，如果輸出 ??model??，可以看到Qwen2的模型結(jié)構(gòu)為：

Qwen2VLForConditionalGeneration(
(visual):Qwen2VisionTransformerPretrainedModel(
(patch_embed):PatchEmbed(
(proj):Conv3d(3,1280, kernel_size=(2,14,14), stride=(2,14,14), bias=False)
)
(rotary_pos_emb):VisionRotaryEmbedding()
(blocks):ModuleList(
(0-31):32 x Qwen2VLVisionBlock(
(norm1):LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2):LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn):VisionFlashAttention2(
(qkv):Linear(in_features=1280, out_features=3840, bias=True)
(proj):Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp):VisionMlp(
(fc1):Linear(in_features=1280, out_features=5120, bias=True)
(act):QuickGELUActivation()
(fc2):Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger):PatchMerger(
(ln_q):LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp):Sequential(
(0):Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2):Linear(in_features=5120, out_features=1536, bias=True)
)
)
)
(model):Qwen2VLModel(
(embed_tokens):Embedding(151936,1536)
(layers):ModuleList(
(0-27):28 x Qwen2VLDecoderLayer(
(self_attn):Qwen2VLFlashAttention2(
(q_proj):Linear(in_features=1536, out_features=1536, bias=True)
(k_proj):Linear(in_features=1536, out_features=256, bias=True)
(v_proj):Linear(in_features=1536, out_features=256, bias=True)
(o_proj):Linear(in_features=1536, out_features=1536, bias=False)
(rotary_emb):Qwen2RotaryEmbedding()
)
(mlp):Qwen2MLP(
(gate_proj):Linear(in_features=1536, out_features=8960, bias=False)
(up_proj):Linear(in_features=1536, out_features=8960, bias=False)
(down_proj):Linear(in_features=8960, out_features=1536, bias=False)
(act_fn):SiLU()
)
(input_layernorm):Qwen2RMSNorm((1536,), eps=1e-06)
(post_attention_layernorm):Qwen2RMSNorm((1536,), eps=1e-06)
)
)
(norm):Qwen2RMSNorm((1536,), eps=1e-06)
)
(lm_head):Linear(in_features=1536, out_features=151936, bias=False)
)

說(shuō)明：

Qwen2-VL 模型主要由兩個(gè)部分組成：視覺(jué)編碼器和語(yǔ)言模型。
視覺(jué)編碼器(Qwen2VisionTransformerPretrainedModel)：

Patch Embedding：使用 ??Conv3d?? 進(jìn)行圖像的embedding，切分為多個(gè)小塊并提取特征。其中卷積核大小為 (2, 14, 14)，步幅也為 (2, 14, 14)。

Rotary Positional Embedding：如論文所述，進(jìn)行旋轉(zhuǎn)位置嵌入以增強(qiáng)視覺(jué)模型的感知能力。

Transformer Blocks：包含 32 個(gè) ??Qwen2VLVisionBlock???，每個(gè)塊都有兩個(gè) ??Layer Normalization??? 層和一個(gè) ??注意力機(jī)制???，注意力機(jī)制采用 ??Linear??? 層進(jìn)行 ??QKV（查詢、鍵、值）??映射。

Patch Merger：對(duì)提取的特征進(jìn)行合并，使用 ??LayerNorm?? 和 ??MLP(多層感知機(jī))?? 處理。

語(yǔ)言模型(Qwen2VLModel)：
?Token Embedding：使用 Embedding 層將輸入的文本 token 轉(zhuǎn)換為稠密向量，維度為 1536。
?Decoder Layers：包含 28 個(gè) Qwen2VLDecoderLayer，每層具有自注意力機(jī)制和 MLP；自注意力機(jī)制（Qwen2VLFlashAttention2）通過(guò) Q、K、V 的線性映射進(jìn)行注意力計(jì)算，采用旋轉(zhuǎn)嵌入增強(qiáng)序列信息。
?Norm Layer:使用 Qwen2RMSNorm 進(jìn)行歸一化，幫助模型在訓(xùn)練過(guò)程中保持穩(wěn)定性。
?輸出層 (lm_head)：
? 最后通過(guò)一個(gè)線性層將模型的輸出映射回詞匯表大小（151936），用于生成文本。

加載processor

processor = AutoProcessor.from_pretrained(model_dir)

processor配置

打印processor可以得到如下信息：

Qwen2VLProcessor:
- image_processor:Qwen2VLImageProcessor{
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean":[
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type":"Qwen2VLImageProcessor",
"image_std":[
0.26862954,
0.26130258,
0.27577711
],
"max_pixels":12845056,
"merge_size":2,
"min_pixels":3136,
"patch_size":14,
"processor_class":"Qwen2VLProcessor",
"resample":3,
"rescale_factor":0.00392156862745098,
"size":{
"max_pixels":12845056,
"min_pixels":3136
},
"temporal_patch_size":2
}

- tokenizer:Qwen2TokenizerFast(name_or_path='Qwen2-VL-2B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token':'<|im_end|>','pad_token':'<|endoftext|>','additional_special_tokens':['<|im_start|>','<|im_end|>','<|object_ref_start|>','<|object_ref_end|>','<|box_start|>','<|box_end|>','<|quad_start|>','<|quad_end|>','<|vision_start|>','<|vision_end|>','<|vision_pad|>','<|image_pad|>','<|video_pad|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
151643:AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151644:AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151645:AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151646:AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151647:AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151648:AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151649:AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151650:AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151651:AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151652:AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151653:AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151654:AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151655:AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151656:AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

{
"chat_template":"{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
"processor_class":"Qwen2VLProcessor"
}

說(shuō)明：

圖像處理器 (Qwen2VLImageProcessor)

轉(zhuǎn)換 RGB -??do_convert_rgb??: 設(shè)置為 true，表示將輸入圖像轉(zhuǎn)換為 RGB 格式，確保顏色通道的一致性。
歸一化 -??do_normalize??: 設(shè)置為 true，表示對(duì)圖像進(jìn)行標(biāo)準(zhǔn)化處理，以便使圖像特征的均值和方差符合模型的預(yù)期。
重縮放 -??do_rescale??: 設(shè)置為 true，表示將圖像像素值縮放到 [0, 1] 的范圍。
調(diào)整大小 -??do_resize??: 設(shè)置為 true，表示將圖像調(diào)整為模型所需的輸入尺寸。
均值和標(biāo)準(zhǔn)差:??image_mean??: [0.48145466, 0.4578275, 0.40821073]，用于圖像歸一化的均值。??image_std??: [0.26862954, 0.26130258, 0.27577711]，用于圖像歸一化的標(biāo)準(zhǔn)差。
像素限制:??max_pixels??: 12845056，表示處理的圖像最大像素?cái)?shù)。??min_pixels??: 3136，表示處理的圖像最小像素?cái)?shù)。
補(bǔ)丁大小 -??patch_size??: 14，表示將圖像劃分為補(bǔ)丁的大小。

分詞器 (Qwen2TokenizerFast)

詞匯表大小 -??vocab_size??: 151643，表示分詞器支持的詞匯數(shù)量。
最大長(zhǎng)度 -??model_max_length??: 32768，表示模型能夠處理的最大文本長(zhǎng)度。
快速模式 -??is_fast??: 設(shè)置為 True，表示使用快速分詞器，以提高處理效率。
填充和截?cái)?

??padding_side??: 'left'，表示在文本左側(cè)填充。

??truncation_side??: 'right'，表示在文本右側(cè)截?cái)唷?/p>

特殊標(biāo)記 -??special_tokens??: 包含多個(gè)特殊標(biāo)記，例如：
? <|vision_start|> 和 <|vision_end|>，用于標(biāo)識(shí)圖像的開始和結(jié)束。
?<|vision_pad|>、<|image_pad|> 和 <|video_pad|> 表示圖像補(bǔ)丁的填充。

構(gòu)建對(duì)話模板

messages = [
{
"role":"user",
"content":[
{
"type":"image",
"image":"https://17aitech.com/wp-content/uploads/2024/10/missile.jpeg",
},
{"type":"text","text":"描述一下這張圖片，可以的話給出具體參數(shù)型號(hào)."},
],
}
]

備注：

圖片路徑為https://17aitech.com/wp-content/uploads/2024/10/missile.jpeg
qwen_vl_utils會(huì)自動(dòng)從以上地址下載圖片
圖片內(nèi)容如下：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

導(dǎo)彈

數(shù)據(jù)預(yù)處理

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

說(shuō)明：

查看text內(nèi)容，其構(gòu)成的對(duì)話模板內(nèi)容為：??'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>描述一下這張圖片，可以的話給出具體參數(shù)型號(hào).<|im_end|>\n<|im_start|>assistant\n'??
其中??<|image_pad|>?? 為圖片的填充符，用于對(duì)齊。

模型推理

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

運(yùn)行結(jié)果：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

識(shí)別Gif動(dòng)圖

messages = [
{
"role":"user",
"content":[
{
"type":"image",
"image":"https://17aitech.com/wp-content/uploads/2024/09/%E6%A3%80%E7%B4%A2%E5%88%B0%E7%AD%94%E6%A1%88.gif",
},
{"type":"text","text":"描述一下這張圖片."},
],
}
]

原始動(dòng)圖：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

識(shí)別結(jié)果：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

識(shí)別視頻

首先，我們下載一段.mp4視頻到本地，下載的視頻地址為好看視頻

備注：我以前曾經(jīng)做過(guò)一個(gè)項(xiàng)目，通過(guò)視頻的幀數(shù)來(lái)度量軟件的啟動(dòng)速度，我們看看大模型是否可以很容易地給出結(jié)果。

其次，我們將視頻上傳到服務(wù)器上。

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)

然后，修改消息內(nèi)容如下：

messages =[
{
"role":"user",
"content":[
{
"type":"video",
"video":"file://start_speed.mp4",
"max_pixels":360*420,
"fps":1.0,
},
{"type":"text","text":"請(qǐng)描述這段視頻，同時(shí)計(jì)算兩個(gè)手機(jī)各自從啟動(dòng)到顯示各自的幀數(shù)并輸出結(jié)果."},
],
}
]

其他部分代碼保持不變后運(yùn)行，運(yùn)行結(jié)果如下：

多模態(tài)大模型Qwen2的深入了解-AI.x社區(qū)