使用 Llama 3.2-Vision 多模態 LLM 和圖像“聊天”

作者：二旺 2024-12-16 07:00:00

本文專注于了解如何在類似聊天的模式下本地構建 Llama 3.2-Vision，并在 Colab 筆記本上探索其多模態技能。

一、引言

將視覺能力與大型語言模型（LLMs）結合，正在通過多模態 LLM（MLLM）徹底改變計算機視覺領域。這些模型結合了文本和視覺輸入，展示了在圖像理解和推理方面的卓越能力。雖然這些模型以前只能通過 API 訪問，但最近的開放源代碼選項現在允許本地執行，使其在生產環境中更具吸引力。

在本教程中，我們將學習如何使用開源的 Llama 3.2-Vision 模型與圖像進行對話，您將對其 OCR、圖像理解和推理能力感到驚嘆。所有代碼都方便地提供在一個 Colab 筆記本中。

二、背景

Llama 是 “Large Language Model Meta AI” 的縮寫，是由 Meta 開發的一系列先進 LLM。其最新版本 Llama 3.2 引入了先進的視覺能力。視覺變體有兩種尺寸：11B 和 90B 參數，能夠在邊緣設備上進行推理。憑借高達 128k 的上下文窗口和對高達 1120x1120 像素的高分辨率圖像的支持，Llama 3.2 可以處理復雜的視覺和文本信息。

三、架構

Llama 系列模型是僅解碼器的 Transformer。Llama 3.2-Vision 基于預訓練的 Llama 3.1 純文本模型構建。它采用了標準的密集自回歸 Transformer 架構，與前代 Llama 和 Llama 2 沒有顯著偏離。

為了支持視覺任務，Llama 3.2 使用預訓練的視覺編碼器（ViT-H/14）提取圖像表示向量，并通過視覺適配器將這些表示集成到凍結的語言模型中。適配器由一系列交叉注意力層組成，允許模型專注于與正在處理的文本相對應的圖像部分 [1]。

適配器在文本-圖像對上進行訓練，以將圖像表示與語言表示對齊。在適配器訓練期間，圖像編碼器的參數會更新，而語言模型的參數保持凍結，以保留現有的語言能力。

Llama 3.2-Vision 架構。視覺模塊（綠色）集成到固定的語言模型（粉色）中

這種設計使 Llama 3.2 在多模態任務中表現出色，同時保持了強大的純文本性能。生成的模型在需要圖像和語言理解的任務中展示了令人印象深刻的能力，并允許用戶與其視覺輸入進行交互式通信。在了解了 Llama 3.2 的架構后，我們可以深入實際實現。但首先，我們需要做一些準備工作。

四、準備工作

在 Google Colab 上運行 Llama 3.2 — Vision 11B 之前，我們需要進行以下準備工作：

(1) GPU 設置：

推薦使用至少 22GB VRAM 的高端 GPU 以實現高效推理 [2]。
對于 Google Colab 用戶：導航到“運行時” > “更改運行時類型” > 選擇“A100 GPU”。請注意，高端 GPU 可能不適用于免費 Colab 用戶。

(2) 模型權限：在此處申請 Llama 3.2 模型的訪問權限。

(3) Hugging Face 設置：

如果您還沒有 Hugging Face 賬戶，請在此處創建一個。
如果您還沒有訪問令牌，請從您的 Hugging Face 賬戶生成一個。
對于 Google Colab 用戶，在 Google Colab Secrets 中將 Hugging Face 令牌設置為名為“HF_TOKEN”的秘密環境變量。

(4) 安裝所需庫。

五、加載模型

在設置好環境和獲取必要權限后，我們將使用 Hugging Face Transformers 庫實例化模型及其關聯的處理器。處理器負責為模型準備輸入并格式化其輸出。

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto")

processor = AutoProcessor.from_pretrained(model_id)

1.期望的聊天模板

聊天模板通過存儲“用戶”（我們）和“助手”（AI 模型）之間的對話歷史來保持上下文。對話歷史被結構化為一個名為 messages 的列表，其中每個字典代表一個對話輪次，包括用戶和模型的響應。用戶輪次可以包括圖像-文本或純文本輸入，{"type": "image"} 表示圖像輸入。例如，經過幾次聊天迭代后，messages 列表可能如下所示：

messages = [
    {"role": "user",      "content": [{"type": "image"}, {"type": "text", "text": prompt1}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts1}]},
    {"role": "user",      "content": [{"type": "text", "text": prompt2}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts2}]},
    {"role": "user",      "content": [{"type": "text", "text": prompt3}]},
    {"role": "assistant", "content": [{"type": "text", "text": generated_texts3}]}
]

這個 messages 列表稍后會傳遞給 apply_chat_template() 方法，以將對話轉換為模型期望格式的單個可標記化字符串。

2.主函數

在本教程中，我提供了一個 chat_with_mllm 函數，該函數支持與 Llama 3.2 MLLM 進行動態對話。此函數處理圖像加載、預處理圖像和文本輸入、生成模型響應，并管理對話歷史以啟用聊天模式交互。

def chat_with_mllm (model, processor, prompt, images_path=[],do_sample=False, temperature=0.1, show_image=False, max_new_tokens=512, messages=[], images=[]):

    # Ensure list:
    if not isinstance(images_path, list):
        images_path =  [images_path]

    # Load images 
    if len (images)==0 and len (images_path)>0:
            for image_path in tqdm (images_path):
                image = load_image(image_path)
                images.append (image)
                if show_image:
                    display ( image )

    # If starting a new conversation about an image
    if len (messages)==0:
        messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]}]

    # If continuing conversation on the image
    else:
        messages.append ({"role": "user", "content": [{"type": "text", "text": prompt}]})

    # process input data
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(images=images, text=text, return_tensors="pt", ).to(model.device)

    # Generate response
    generation_args = {"max_new_tokens": max_new_tokens, "do_sample": True}
    if do_sample:
        generation_args["temperature"] = temperature
    generate_ids = model.generate(**inputs,**generation_args)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:-1]
    generated_texts = processor.decode(generate_ids[0], clean_up_tokenization_spaces=False)

    # Append the model's response to the conversation history
    messages.append ({"role": "assistant", "content": [  {"type": "text", "text": generated_texts}]})

    return generated_texts, messages, images

六、與 Llama 對話

1. 蝴蝶圖像示例

在我們的第一個示例中，我們將與 Llama 3.2 討論一張孵化中的蝴蝶圖像。由于 Llama 3.2-Vision 在使用圖像時不支持系統提示，我們將直接在用戶提示中附加指令以指導模型的響應。通過設置 do_sample=True 和 temperature=0.2，我們允許輕微的隨機性，同時保持響應的一致性。對于固定答案，可以設置 do_sample=False。messages 參數（保存聊天歷史）最初為空，images 參數也是如此。

instructions = "Respond concisely in one sentence."
prompt = instructions + "Describe the image."

response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path],
                                             do_sample=True,
                                             temperature=0.2,
                                             show_image=True,
                                             messages=[],
                                             images=[])

# Output:  "The image depicts a butterfly emerging from its chrysalis, 
#           with a row of chrysalises hanging from a branch above it."

正如我們所見，輸出準確且簡潔，表明模型有效地理解了圖像。在下一個聊天迭代中，我們將傳遞一個新的提示以及聊天歷史（messages）和圖像文件（images）。新提示旨在評估 Llama 3.2 的推理能力：

prompt = instructions + "What would happen to the chrysalis in the near future?"
response, messages, images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.2,
                                             show_image=False,
                                             messages=messages,
                                             images=images)

# Output: "The chrysalis will eventually hatch into a butterfly."

我們在提供的 Colab 筆記本中繼續了這次對話，并獲得了以下對話內容：

對話突出了模型通過準確描述場景來理解圖像的能力。它還展示了其推理能力，通過邏輯連接信息，正確推斷出蛹會發生什么，并解釋了為什么有些是棕色的而有些是綠色的。

2. 表情包圖像示例

在這個示例中，我將向模型展示我自己創建的一個表情包，以評估 Llama 的 OCR 能力，并確定它是否理解我的幽默感。

instructions = "You are a computer vision engineer with sense of humor."
prompt = instructions + "Can you explain this meme to me?"


response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.5,
                                             show_image=True,
                                             messages=[],
                                             images=[])
instructions = "You are a computer vision engineer with sense of humor."
prompt = instructions + "Can you explain this meme to me?"


response, messages,images= chat_with_mllm ( model, processor, prompt,
                                             images_path=[img_path,],
                                             do_sample=True,
                                             temperature=0.5,
                                             show_image=True,
                                             messages=[],
                                             images=[])

這是輸入的表情包：

這是模型的響應：

正如我們所見，模型展示了出色的 OCR 能力，并理解了圖像中的文本含義。至于它的幽默感——你怎么看，它理解了嗎？你理解了嗎？

責任編輯：趙寧寧來源：小白玩轉Python

Llama 大型語言模型

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看