圖像也能通過 RAG 加入知識庫啦！

發布于 2025-5-14 00:37

瀏覽

0收藏

我們知道，檢索增強生成 RAG 通過整合外部知識庫與生成模型，有效緩解了大模型在專業領域的知識局限性。傳統的知識庫以文本為主，通常依賴于純文本嵌入來實現語義搜索和內容檢索。

然而，隨著多模態數據需求的增長和復雜文檔處理場景的增多，傳統方法在處理混合格式文檔（如包含文本、圖像、表格的 PDF）或長上下文內容時，往往面臨性能瓶頸。??Cohere Embed v4?? 的出現為這些挑戰提供了創新解決方案，其多模態嵌入能力和長上下文支持顯著提升了 RAG 系統的性能和適用性。

??Cohere Embed v4?? 是一個能夠滿足企業需求的多模態嵌入模型，發布于 2025 年 4 月 15 日。它可以處理文本、圖像和混合格式（如 PDF），非常適合需要處理復雜文檔的場景。它的關鍵功能如下，

多模態支持：可以統一嵌入包含文本和圖像的文檔，如 PDF 和演示幻燈片。
長上下文：支持高達 128K 的上下文長度，約 200 頁，適合長文檔。
多語言能力：覆蓋 100 多種語言，支持跨語言搜索，無需識別或翻譯語言。
安全性和效率：優化用于金融、醫療等行業，可在虛擬私有云或本地部署，并提供壓縮嵌入，節省高達 83% 的存儲成本。

下面，我們來測試一下這個 ??Cohere Embed v4??，它作為嵌入模型，需要配合大模型來一起搞事情，比如 ??Gemini Flash 2.5??。

首先，我們不妨先來理一下??Cohere Embed v4?? 和 ??Gemini Flash 2.5?? 在這個任務中是什么關系以及具體是如何協作的呢？

我們要實現一個基于視覺的檢索增強生成 (RAG) 系統。在這個系統中，??Cohere Embed v4?? 和 ??Gemini Flash 2.5?? 扮演著不同的角色，它們相互配合完成了任務：

Cohere Embed v4 負責檢索部分。它將圖像和文本轉換為向量表示（嵌入），然后利用這些嵌入來搜索與用戶問題最相關的圖像。
Gemini Flash 2.5 負責生成部分。它是一個強大的視覺語言模型 (VLM)，能夠理解圖像和文本，并根據它們生成答案。

它們如何配合完成任務的？以下是它們協作的流程：

圖像嵌入: 首先，使用 ??Cohere Embed v4?? 對所有圖像進行編碼，生成圖像嵌入，并存儲起來。
問題嵌入: 當用戶提出一個問題時，??Cohere Embed v4?? 也會將問題編碼成嵌入。
檢索: 系統將問題嵌入與圖像嵌入進行比較，找到與問題最相關的圖像。
答案生成: 將檢索到的圖像和用戶的問題一起發送給 ??Gemini Flash 2.5??，它會根據圖像和問題生成最終的答案。

小結

簡而言之，??Cohere Embed v4?? 充當信息檢索器，找到與問題相關的圖像，而 ??Gemini Flash 2.5?? 充當答案生成器，根據檢索到的圖像和問題生成答案。它們協同工作，實現了基于視覺的 RAG 系統，讓用戶可以通過自然語言提問來獲取圖像中的信息。

下面，我們給出的實驗代碼主要是給出一個思路供實際用圖像或 PDF 等構建知識庫時參考。

實驗代碼

以下代碼展示了一種基于純視覺的 RAG 方法，甚至適用于復雜的信息圖表。它由兩個部分組成：

Cohere 最先進的文本和圖像檢索模型 Embed v4。它允許我們嵌入和搜索復雜的圖像，例如信息圖表，而無需任何預處理。
Vision-LLM：我們使用谷歌的 Gemini Flash 2.5。它允許輸入圖像和文本問題，并能夠基于此回答問題。

首先，我們來看一下搭建好以后的問答示例。

代碼，

# 定義查詢 query
question = "請用中文解釋一下有鵝的圖"
# 搜索最相關的圖像
top_image_path = search(question)
# 使用搜索到的圖像回答查詢
answer(question, top_image_path)

根據搜索的圖像回答如下，

圖像也能通過 RAG 加入知識庫啦！-AI.x社區

這回答可以吧，竟然看出來了這張圖像被上下顛倒過了。根據問題搜到庫中的圖像是 cohere 的功勞，解讀這張圖像是 Gemini 的功勞。

再來一張試試。

# 定義查詢 query
question = "我記得有個圖里有貓，請解釋一下那個圖是講什么來著？"
# 搜索最相關的圖像
top_image_path = search(question)
# 使用搜索到的圖像回答查詢
answer(question, top_image_path)

回答如下，

圖像也能通過 RAG 加入知識庫啦！-AI.x社區

以下是安裝和具體的代碼。

訪問 cohere.com，注冊并獲取 API key。

pip install -q cohere

# Create the Cohere API client. Get your API key from cohere.com
import cohere
cohere_api_key = "<<YOUR_COHERE_KEY>>" #Replace with your Cohere API key
co = cohere.ClientV2(api_key=cohere_api_key)

到 Google AI Studio 為 Gemini 生成一個 API 密鑰。然后，安裝 Google 生成式 AI SDK。

pip install -q google-genai

from google import genai
gemini_api_key = "<<YOUR_GEMINI_KEY>>"  #Replace with your Gemini API key
client = genai.Client(api_key=gemini_api_key)

import requests
import os
import io
import base64
import PIL
import tqdm
import time
import numpy as np


# Some helper functions to resize images and to convert them to base64 format
max_pixels = 1568*1568  #Max resolution for images


# Resize too large images
def resize_image(pil_image):
    org_width, org_height = pil_image.size


    # Resize image if too large
    if org_width * org_height > max_pixels:
        scale_factor = (max_pixels / (org_width * org_height)) ** 0.5
        new_width = int(org_width * scale_factor)
        new_height = int(org_height * scale_factor)
        pil_image.thumbnail((new_width, new_height))


# Convert images to a base64 string before sending it to the API
def base64_from_image(img_path):
    pil_image = PIL.Image.open(img_path)
    img_format = pil_image.format if pil_image.format else "PNG"


    resize_image(pil_image)


    with io.BytesIO() as img_buffer:
        pil_image.save(img_buffer, format=img_format)
        img_buffer.seek(0)
        img_data = f"data:image/{img_format.lower()};base64,"+base64.b64encode(img_buffer.read()).decode("utf-8")


    return img_data


# 圖像列表，有本地的，也有網絡的。
images = {
    "test1.webp": "./img/test1.webp",
    "test2.webp": "./img/test2.webp",
    "test3.webp": "./img/test3.webp",
    "tesla.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbef936e6-3efa-43b3-88d7-7ec620cdb33b_2744x1539.png",
    "netflix.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23bd84c9-5b62-4526-b467-3088e27e4193_2744x1539.png",
    "nike.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5cd33ba-ae1a-42a8-a254-d85e690d9870_2741x1541.png",
    "google.png": "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F395dd3b9-b38e-4d1f-91bc-d37b642ee920_2741x1541.png",
    "accenture.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08b2227c-7dc8-49f7-b3c5-13cab5443ba6_2741x1541.png",
    "tecent.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ec8448c-c4d1-4aab-a8e9-2ddebe0c95fd_2741x1541.png"
}


# 下載圖像并計算每張圖像的嵌入
img_folder = "img"
os.makedirs(img_folder, exist_ok=True)


img_paths = []
doc_embeddings = []
for name, url in tqdm.tqdm(images.items()):
    img_path = os.path.join(img_folder, name)
    img_paths.append(img_path)


    # Download the image
    if not os.path.exists(img_path):
        response = requests.get(url)
        response.raise_for_status()


        with open(img_path, "wb") as fOut:
            fOut.write(response.content)


    # Get the base64 representation of the image
    api_input_document = {
        "content": [
            {"type": "image", "image": base64_from_image(img_path)},
        ]
    }


    # Call the Embed v4.0 model with the image information
    api_response = co.embed(
        model="embed-v4.0",
        input_type="search_document",
        embedding_types=["float"],
        inputs=[api_input_document],
    )


    # Append the embedding to our doc_embeddings list
    emb = np.asarray(api_response.embeddings.float[0])
    doc_embeddings.append(emb)


doc_embeddings = np.vstack(doc_embeddings)
print("\n\nEmbeddings shape:", doc_embeddings.shape)

看這些圖像的嵌入：??Embeddings shape: (9, 1536)??。

以下展示了一個基于視覺的 RAG（檢索增強生成）的簡單流程。

首先執行 search()：我們為問題計算嵌入向量。然后，我們可以使用該嵌入向量在我們預先嵌入的圖像庫中進行搜索，以找到最相關的圖像，然后返回該圖像。
在 answer() 中，將問題和圖像一起發送給 Gemini，以獲得問題的最終答案。

# Search allows us to find relevant images for a given question using Cohere Embed v4
def search(question, max_img_size=800):
    # Compute the embedding for the query
    api_response = co.embed(
        model="embed-v4.0",
        input_type="search_query",
        embedding_types=["float"],
        texts=[question],
    )


    query_emb = np.asarray(api_response.embeddings.float[0])


    # Compute cosine similarities
    cos_sim_scores = np.dot(query_emb, doc_embeddings.T)


    # Get the most relevant image
    top_idx = np.argmax(cos_sim_scores)


    # Show the images
    print("Question:", question)


    hit_img_path = img_paths[top_idx]


    print("Most relevant image:", hit_img_path)
    image = PIL.Image.open(hit_img_path)
    max_size = (max_img_size, max_img_size)  # Adjust the size as needed
    image.thumbnail(max_size)
    display(image)
    return hit_img_path


# Answer the question based on the information from the image
# Here we use Gemini 2.5 as powerful Vision-LLM
def answer(question, img_path):
    prompt = [f"""Answer the question based on the following image.
Don't use markdown.
Please provide enough context for your answer.


Question: {question}""", PIL.Image.open(img_path)]


    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-04-17",
        cnotallow=prompt
    )


    answer = response.text
    print("LLM Answer:", answer)

然后，針對圖像進行問答。

# Define the query
question = "請用中文解釋一下 Nike 的數據"


# Search for the most relevant image
top_image_path = search(question)


# Use the image to answer the query
answer(question, top_image_path)

以下是回答，

圖像也能通過 RAG 加入知識庫啦！-AI.x社區