使用 LangChain 和 Pinecone 矢量數據庫構建自定義問答應用程序

作者：iron guo 2023-11-10 14:46:41

我們探索了使用 LangChain 和 Pinecone 矢量數據庫構建自定義問答應用程序的令人興奮的可能性。本博客向我們介紹了基本概念，從問答應用程序的概述開始，到了解 Pinecone 矢量數據庫的功能。通過將 OpenAI 語義搜索管道的強大功能與 Pinecone 高效的索引和檢索系統相結合，我們充分利用了利用 Streamlit 創建強大且準確的問答解決方案的潛力。

構建自定義聊天機器人，以使用 LangChain、OpenAI 和 PineconeDB 從任何數據源開發問答應用程序

介紹

大型語言模型的出現是我們這個時代最令人興奮的技術發展之一。它為人工智能領域開辟了無限可能，為各行業的現實問題提供了解決方案。這些模型最有趣的應用之一是開發來自個人或組織數據源的自定義問答或聊天機器人。然而，由于LLMS接受的是公開可用的一般數據的培訓，因此他們的答案可能并不總是具體或對最終用戶有用。為了解決這個問題，我們可以使用LangChain等框架來開發自定義聊天機器人，根據我們的數據提供特定的答案。在本文中，我們將學習如何構建自定義問答應用程序并部署在 Streamlit Cloud 上。那么讓我們開始吧！

學習目標

了解為什么自定義問答應用程序比微調語言模型更好
學習使用 OpenAI 和 Pinecone 開發語義搜索管道
開發自定義問答應用程序并將其部署在 Streamlit 云上。

問答應用概述

問答或“通過數據聊天”是LLMs 和 LangChain 的一個流行用例。LangChain 提供了一系列組件來加載您可以為您的用例找到的任何數據源。它支持大量數據源和轉換器轉換為一系列字符串以存儲在矢量數據庫中。一旦數據存儲在數據庫中，就可以使用稱為檢索器的組件查詢數據庫。此外，通過使用LLMS，我們可以像聊天機器人一樣獲得準確的答案，而無需處理大量文檔。

LangChain支持以下數據源。如圖所示，它允許超過 120 個集成來連接您可能擁有的每個數據源。

圖片

問答應用程序工作流程

我們了解了LangChain支持的數據源，這使我們能夠使用LangChain中可用的組件開發問答管道。以下是 LLM 用于文檔加載、存儲、檢索和生成輸出的組件。

文檔加載器：加載用戶文檔以進行矢量化和存儲
文本分割器：這些是文檔轉換器，可將文檔轉換為固定的塊長度以有效地存儲它們
矢量存儲：矢量數據庫集成，用于存儲輸入文本的矢量嵌入
文檔檢索：根據用戶對數據庫的查詢來檢索文本。他們使用相似性搜索技術來檢索相同的內容。
模型輸出：根據查詢的輸入提示和檢索到的文本生成的用戶查詢的最終模型輸出。

這是問答管道的高級工作流程，可以解決多種類型的現實問題。我沒有深入研究每個 LangChain 組件

圖片

自定義問答相對于模型微調的優勢

針對具體情況的答案
適應新的輸入文檔
無需對模型進行微調，節省模型訓練成本
比一般答案更準確和具體的答案

什么是Pinecone 矢量數據庫？

Pinecone

Pinecone 是一種流行的矢量數據庫，用于構建 LLM 支持的應用程序。它具有多功能性和可擴展性，適用于高性能人工智能應用。它是一個完全托管的云原生矢量數據庫，不會給用戶帶來任何基礎設施麻煩。

LLMS基礎應用程序涉及大量非結構化數據，需要復雜的長期記憶才能以最大準確度檢索信息。生成式人工智能應用程序依靠向量嵌入的語義搜索來根據用戶輸入返回合適的上下文。

Pinecone 非常適合此類應用程序，并經過優化以低延遲存儲和查詢大量向量，以構建用戶友好的應用程序。讓我們學習如何為我們的問答應用程序設置松果矢量數據庫。

# install pinecone-client
 pip install pinecone-client 


# 導入 pinecone 并使用您的 API 密鑰和環境名稱進行初始化
import pinecone 
pinecone.init(api_key= "YOUR_API_KEY" ,envirnotallow= "YOUR_ENVIRONMENT" ) 


# 創建您的第一個索引以開始存儲Vectors
 pinecone.create_index( "first_index" ,Dimension= 8 , metric= "cosine" ) 


# 更新插入樣本數據（5個8維向量）
 index.upsert([ 
    ( "A" , [ 0.1 , 0.1 , 0.1 , 0.1 , 0.1 ) , 0.1 , 0.1 , 0.1 ]), 
    ( "B" , [ 0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0.2 , 0.2 ]), 
    ( "C" , [ 0.3 , 0.3 , 0.3 , 0.3 , 0.3 , 0.3 , 0.3 , 0.3 ]), 
    ( "D" , [ 0.4 , 0.4 , 0.4 , 0.4 , 0.4 , 0.4 , 0.4 , 0.4 ]), 
    ( "E" , [ 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 ]) 
]) 


# 使用 list_indexes() 方法調用 db 中可用的多個索引
pinecone.list_indexes() 


[Output]>>> [ 'first_index' ]

在上面的演示中，我們安裝了一個pinecone客戶端來初始化我們項目環境中的矢量數據庫。初始化向量數據庫后，我們可以創建具有所需維度和度量的索引，以將向量嵌入插入到向量數據庫中。在下一節中，我們將使用 Pinecone 和 LangChain 為我們的應用程序開發語義搜索管道。

使用 OpenAI 和 Pinecone 構建語義搜索管道

我們了解到問答應用程序工作流程有 5 個步驟。在本節中，我們將執行前 4 個步驟，即文檔加載器、文本拆分器、向量存儲和文檔檢索。

要在本地環境或云基礎筆記本環境（例如 Google Colab）中執行這些步驟，您需要安裝一些庫并在 OpenAI 和 Pinecone 上創建一個帳戶以分別獲取它們的 API 密鑰。讓我們從環境設置開始：

安裝所需的庫

# install langchain and openai with other dependencies
!pip install --upgrade langchain openai -q
!pip install pillow==6.2.2
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q
!apt-get install poppler-utils
!pip install pinecone-client -q
!pip install tiktoken -q




# setup openai environment
import os
os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY"


# importing libraries
import os
import openai
import pinecone
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

安裝設置完成后，導入上述代碼片段中提到的所有庫。然后，按照以下步驟操作：

加載文檔

在此步驟中，我們將從目錄加載文檔作為 AI 項目管道的起點。我們的目錄中有 2 個文檔，我們將把它們加載到項目環境中。

#load the documents from content/data dir
directory = '/content/data'


# load_docs functions to load documents using langchain function
def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents


documents = load_docs(directory)
len(documents)
[Output]>>> 5

分割文本數據

如果每個文檔的長度固定，文本嵌入和LLMS的性能會更好。因此，對于任何LLMS用例來說，將文本分割成相等長度的塊是必要的。我們將使用“RecursiveCharacterTextSplitter”將文檔轉換為與文本文檔相同的大小。

# split the docs using recursive text splitter
def split_docs(documents, chunk_size=200, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs


# split the docs
docs = split_docs(documents)
print(len(docs))
[Output]>>>12

將數據存儲在向量存儲中

一旦文檔被分割，我們將使用 OpenAI 嵌入將它們的嵌入存儲在向量數據庫中。

# embedding example on random word
embeddings = OpenAIEmbeddings()


# initiate pinecondb
pinecone.init(
    api_key="YOUR-API-KEY",
    envirnotallow="YOUR-ENV"
)


# define index name
index_name = "langchain-project"


# store the data and embeddings into pinecone index
index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

從向量數據庫中檢索數據

在此階段，我們將使用語義搜索從矢量數據庫中檢索文檔。我們將向量存儲在名為“langchain-project”的索引中，一旦我們查詢到與下面相同的內容，我們就會從數據庫中獲得最相似的文檔。

# An example query to our database
query = "What are the different types of pet animals are there?"


# do a similarity search and store the documents in result variable 
result = index.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)
-
--------------------------------[Output]--------------------------------------
result
[Document(page_cnotallow='Small mammals like hamsters, guinea pigs, 
and rabbits are often chosen for their
low maintenance needs. Birds offer beauty and song,
and reptiles like turtles and lizards can make intriguing pets.', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
 Document(page_cnotallow='Pet animals come in all shapes and sizes, each suited 
to different lifestyles and home environments. Dogs and cats are the most 
common, known for their companionship and unique personalities. Small', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
 Document(page_cnotallow='intriguing pets. Even fish, with their calming presence
, can be wonderful pets.', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'})]

我們可以根據相似性搜索從向量存儲中檢索文檔。

帶 Streamlit 的自定義問答應用程序

在問答應用程序的最后階段，我們將集成工作流程的每個組件來構建自定義問答應用程序，該應用程序允許用戶輸入各種數據源（例如基于網絡的文章、PDF、CSV 等）與其聊天。從而使他們在日常活動中富有成效。我們需要創建一個 GitHub 存儲庫并將以下文件添加到其中。

圖片

GitHub 倉庫結構

需要添加的項目文件：

main.py — 包含流式前端代碼的 python 文件
qanda.py — 提示設計和模型輸出函數，返回用戶查詢的答案
utils.py — 加載和分割輸入文檔的實用函數
vector_search.py — 文本嵌入和向量存儲函數
requirements.txt - 在 Streamlit 公共云中運行應用程序的項目依賴項

我們在此項目演示中支持兩種類型的數據源：

基于 Web URL 的文本數據
在線 PDF 文件

這兩種類型包含廣泛的文本數據，并且在許多用例中最常見。您可以查看下面的main.py python 代碼來了解應用程序的用戶界面。

# import necessary libraries
import streamlit as st
import openai
import qanda
from vector_search import *
from utils import *
from io  import StringIO


# take openai api key in
api_key = st.sidebar.text_input("Enter your OpenAI API key:", type='password')
# open ai key
openai.api_key = str(api_key)


# header of the app
_ , col2,_ = st.columns([1,7,1])
with col2:
    col2 = st.header("Simplchat: Chat with your data")
    url = False
    query = False
    pdf = False
    data = False
    # select option based on user need
    options = st.selectbox("Select the type of data source",
                            optinotallow=['Web URL','PDF','Existing data source'])
    #ask a query based on options of data sources
    if options == 'Web URL':
        url = st.text_input("Enter the URL of the data source")
        query = st.text_input("Enter your query")
        button = st.button("Submit")
    elif options == 'PDF':
        pdf = st.text_input("Enter your PDF link here") 
        query = st.text_input("Enter your query")
        button = st.button("Submit")
    elif options == 'Existing data source':
        data= True
        query = st.text_input("Enter your query")
        button = st.button("Submit") 


# write code to get the output based on given query and data sources   
if button and url:
    with st.spinner("Updating the database..."):
        corpusData = scrape_text(url)
        encodeaddData(corpusData,url=url,pdf=False)
        st.success("Database Updated")
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)


# write a code to get output on given query and data sources
if button and pdf:
    with st.spinner("Updating the database..."):
        corpusData = pdf_text(pdf=pdf)
        encodeaddData(corpusData,pdf=pdf,url=False)
        st.success("Database Updated")
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)
        
if button and data:
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)
        
        
# delete the vectors from the database
st.expander("Delete the indexes from the database")
button1 = st.button("Delete the current vectors")
if button1 == True:
    index.delete(deleteAll='true')

在streamlit云上部署問答應用程序

圖片

應用程序用戶界面

Streamlit 提供社區云來免費托管應用程序。此外，streamlit 由于其自動化 CI/CD 管道功能而易于使用。

結論

總之，我們探索了使用 LangChain 和 Pinecone 矢量數據庫構建自定義問答應用程序的令人興奮的可能性。本博客向我們介紹了基本概念，從問答應用程序的概述開始，到了解 Pinecone 矢量數據庫的功能。通過將 OpenAI 語義搜索管道的強大功能與 Pinecone 高效的索引和檢索系統相結合，我們充分利用了利用 Streamlit 創建強大且準確的問答解決方案的潛力。

常見問題解答

Q1：什么是Pinecone和LangChain ？

答：Pinecone 是一個可擴展的長期記憶向量數據庫，用于存儲 LLM 支持的應用程序的文本嵌入，而 LangChain 是一個允許開發人員構建 LLM 支持的應用程序的框架

Q2：NLP問答有什么應用？

答：問答應用程序用于客戶支持聊天機器人、學術研究、電子學習等。

Q3：為什么要使用LangChain ？

答：與LLMS合作可能會很復雜。LangChain允許開發人員使用各種組件以對開發人員最友好的方式集成這些LLM，從而更快地交付產品。

Q4：構建問答應用程序的步驟是什么？

A：構建問答應用的步驟如下：文檔加載、文本分割、向量存儲、檢索、模型輸出。

Q5：LangChain 工具有哪些？

答：LangChain 有以下工具：文檔加載器、文檔轉換器、向量存儲、鏈、內存和代理。

責任編輯：武曉燕來源： HELLO程序員

OpenAI Pinecone

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看