使用CLIP和LLM構建多模態RAG系統

作者：佚名 2024-01-11 16:24:12

在本文中我們將探討使用開源大型語言多模態模型(Large Language Multi-Modal)構建檢索增強生成(RAG)系統。本文的重點是在不依賴LangChain或LLlama index的情況下實現這一目標，這樣可以避免更多的框架依賴。

什么是RAG

在人工智能領域，檢索增強生成(retrieve - augmented Generation, RAG)作為一種變革性技術改進了大型語言模型(Large Language Models)的能力。從本質上講，RAG通過允許模型從外部源動態檢索實時信息來增強AI響應的特異性。

該體系結構將生成能力與動態檢索過程無縫結合，使人工智能能夠適應不同領域中不斷變化的信息。與微調和再訓練不同，RAG提供了一種經濟高效的解決方案，允許人工智能在不改變整個模型的情況下能夠得到最新和相關的信息。

RAG的作用

1、提高準確性和可靠性

通過將大型語言模型(llm)重定向到權威的知識來源來解決它們的不可預測性。降低了提供虛假或過時信息的風險，確保更準確和可靠的反應。

2、增加透明度和信任

像LLM這樣的生成式人工智能模型往往缺乏透明度，這使得人們很難相信它們的輸出。RAG通過允許組織對生成的文本輸出有更大的控制，解決了對偏差、可靠性和遵從性的關注。

3、減輕幻覺

LLM容易產生幻覺反應——連貫但不準確或捏造的信息。RAG通過確保響應以權威來源為基礎，減少關鍵部門誤導性建議的風險。

4、具有成本效益的適應性

RAG提供了一種經濟有效的方法來提高AI輸出，而不需要廣泛的再訓練/微調。可以通過根據需要動態獲取特定細節來保持最新和相關的信息，確保人工智能對不斷變化的信息的適應性。

多模式模態模型

多模態涉及有多個輸入，并將其結合成單個輸出，以CLIP為例：CLIP的訓練數據是文本-圖像對，通過對比學習，模型能夠學習到文本-圖像對的匹配關系。

該模型為表示相同事物的不同輸入生成相同(非常相似)的嵌入向量。

多模態大型語言(multi-modal large language)

GPT4v和Gemini vision就是探索集成了各種數據類型(包括圖像、文本、語言、音頻等)的多模態語言模型(MLLM)。雖然像GPT-3、BERT和RoBERTa這樣的大型語言模型(llm)在基于文本的任務中表現出色，但它們在理解和處理其他數據類型方面面臨挑戰。為了解決這一限制，多模態模型結合了不同的模態，從而能夠更全面地理解不同的數據。

多模態大語言模型它超越了傳統的基于文本的方法。以GPT-4為例，這些模型可以無縫地處理各種數據類型，包括圖像和文本，從而更全面地理解信息。

與RAG相結合

這里我們將使用Clip嵌入圖像和文本，將這些嵌入存儲在ChromDB矢量數據庫中。然后將利用大模型根據檢索到的信息參與用戶聊天會話。

我們將使用來自Kaggle的圖片和維基百科的信息來創建一個花卉專家聊天機器人。

首先我們安裝軟件包：

! pip install -q timm einops wikipedia chromadb open_clip_torch
 !pip install -q transformers==4.36.0
 !pip install -q bitsandbytes==0.41.3 accelerate==0.25.0

預處理數據的步驟很簡單只是把圖像和文本放在一個文件夾里。

可以隨意使用任何矢量數據庫，這里我們使用ChromaDB。

import chromadb
 
 from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
 from chromadb.utils.data_loaders import ImageLoader
 from chromadb.config import Settings
 
 
 client = chromadb.PersistentClient(path="DB")
 
 embedding_function = OpenCLIPEmbeddingFunction()
 image_loader = ImageLoader() # must be if you reads from URIs

ChromaDB需要自定義嵌入函數。

from chromadb import Documents, EmbeddingFunction, Embeddings
 
 class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow or images
        return embeddings

這里將創建2個集合，一個用于文本，另一個用于圖像。

collection_images = client.create_collection(
    name='multimodal_collection_images', 
    embedding_functinotallow=embedding_function, 
    data_loader=image_loader)
 
 collection_text = client.create_collection(
    name='multimodal_collection_text', 
    embedding_functinotallow=embedding_function, 
    )
 
 # Get the Images
 IMAGE_FOLDER = '/kaggle/working/all_data'
 
 
 image_uris = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if not image_name.endswith('.txt')])
 ids = [str(i) for i in range(len(image_uris))]
 
 collection_images.add(ids=ids, uris=image_uris) #now we have the images collection

對于Clip，我們可以像這樣使用文本檢索圖像。

from matplotlib import pyplot as plt
 
 retrieved = collection_images.query(query_texts=["tulip"], include=['data'], n_results=3)
 for img in retrieved['data'][0]:
    plt.imshow(img)
    plt.axis("off")
    plt.show()

也可以使用圖像檢索相關的圖像。

文本集合如下所示：

# now the text DB
 from chromadb.utils import embedding_functions
 default_ef = embedding_functions.DefaultEmbeddingFunction()
 
 text_pth = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if image_name.endswith('.txt')])
 
 list_of_text = []
 for text in text_pth:
    with open(text, 'r') as f:
        text = f.read()
        list_of_text.append(text)
 
 ids_txt_list = ['id'+str(i) for i in range(len(list_of_text))]
 ids_txt_list
 
 collection_text.add(
    documents = list_of_text,
    ids =ids_txt_list
 )

然后使用上面的文本集合獲取嵌入。

results = collection_text.query(
    query_texts=["What is the bellflower?"],
    n_results=1
 )
 
 results

結果如下：

{'ids': [['id0']],
  'distances': [[0.6072186183744086]],
  'metadatas': [[None]],
  'embeddings': None,
  'documents': [['Campanula () is the type genus of the Campanulaceae family of flowering plants. Campanula are commonly known as bellflowers and take both their common and scientific names from the bell-shaped flowers—campanula is Latin for "little bell".\nThe genus includes over 500 species and several subspecies, distributed across the temperate and subtropical regions of the Northern Hemisphere, with centers of diversity in the Mediterranean region, Balkans, Caucasus and mountains of western Asia. The range also extends into mountains in tropical regions of Asia and Africa.\nThe species include annual, biennial and perennial plants, and vary in habit from dwarf arctic and alpine species under 5 cm high, to large temperate grassland and woodland species growing to 2 metres (6 ft 7 in) tall.']],
  'uris': None,
  'data': None}

或使用圖片獲取文本。

query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg'
 raw_image = Image.open(query_image)
 
 doc = collection_text.query(
    query_embeddings=embedding_function(query_image),
     
    n_results=1,
         
 )['documents'][0][0]

上圖的結果如下：

A rose is either a woody perennial flowering plant of the genus Rosa (), in the family Rosaceae (), or the flower it bears. There are over three hundred species and tens of thousands of cultivars. They form a group of plants that can be erect shrubs, climbing, or trailing, with stems that are often armed with sharp prickles. Their flowers vary in size and shape and are usually large and showy, in colours ranging from white through yellows and reds. Most species are native to Asia, with smaller numbers native to Europe, North America, and northwestern Africa. Species, cultivars and hybrids are all widely grown for their beauty and often are fragrant. Roses have acquired cultural significance in many societies. Rose plants range in size from compact, miniature roses, to climbers that can reach seven meters in height. Different species hybridize easily, and this has been used in the development of the wide range of garden roses.

這樣我們就完成了文本和圖像的匹配工作，其實這里都是CLIP的工作，下面我們開始加入LLM。

from huggingface_hub import hf_hub_download
 
 hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
 hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
 hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
 hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
 hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)

我們是用visheratin/LLaVA-3b。

from modeling_llava import LlavaForConditionalGeneration
 import torch
 
 model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b")
 model = model.to("cuda")

加載tokenizer。

from transformers import AutoTokenizer
 
 tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")

然后定義處理器，方便我們以后調用。

from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
 
 image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
 processor = LlavaProcessor(image_processor, tokenizer)

下面就可以直接使用了。

question = 'Answer with organized answers: What type of rose is in the picture? Mention some of its characteristics and how to take care of it ?'
 
 query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg'
 raw_image = Image.open(query_image)
 
 doc = collection_text.query(
    query_embeddings=embedding_function(query_image),
     
    n_results=1,
         
 )['documents'][0][0]
 
 plt.imshow(raw_image)
 plt.show()
 imgs = collection_images.query(query_uris=query_image, include=['data'], n_results=3)
 for img in imgs['data'][0][1:]:
    plt.imshow(img)
    plt.axis("off")
    plt.show()

得到的結果如下：

結果還包含了我們需要的大部分信息。

這樣我們整合就完成了，最后就是創建聊天模板。

prompt = """<|im_start|>system
 A chat between a curious human and an artificial intelligence assistant.
 The assistant is an exprt in flowers , and gives helpful, detailed, and polite answers to the human's questions.
 The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
 <|im_start|>user
 <image>
 {question} Use the following article as an answer source. Do not write outside its scope unless you find your answer better {article} if you thin your answer is better add it after document.<|im_end|>
 <|im_start|>assistant
 """.format(questinotallow='question', article=doc)

如何創建聊天過程我們這里就不詳細介紹了，完整代碼在這里：

https://github.com/nadsoft-opensource/RAG-with-open-source-multi-modal

責任編輯：華軒來源： DeepHub IMBA

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看