打造企業級智能問答系統的秘密：如何使用云數據庫 PostgreSQL 版實現向量檢索

作者：王海洋、王龍 2023-11-15 14:28:51

通過本文，你將會了解交互式問答系統的原理，學習 PostgreSQL 的向量化存儲和檢索技術，以及大語言模型交互技術等。

本文就如何利用火山引擎云數據庫 PostgreSQL 版和大語言模型技術（Large Language Model，簡稱 LLM），實現企業級智能交互式問答系統進行介紹。

背景

在大數據的浪潮下，眾多企業建立了自己的知識庫，以便于信息檢索和知識查詢。然而，隨著知識庫內容的膨脹，傳統的信息檢索方式變得低效，經常出現費時費力且結果不盡人意的情況。隨著生成式人工智能（AI Generated Content，簡稱 AIGC）的出現，人們看到了一種更智能的實現方式，通過問答的方式，知識獲取的效率、準確性和用戶體驗在多方面得到提升。

即便如此，對于特定垂直領域的企業，生成式人工智能的局限性也開始顯現，例如大模型訓練周期長、對某一領域專業知識掌握不足等，這常常會導致 AI“幻覺”問題的出現（即 AI 的“一本正經地胡說八道”）。為了解決這一難題，我們通常會采用以下兩種方式：

Fine Tune 方法，“馴服”大語言模型。

利用領域知識，對大語言模型進行監督微調（Supervised Fine Tune）和蒸餾（Distillation）。這種方式可塑性強，但需要大量的算力和人才資源，綜合成本高。此外，企業還需要持續監控和更新模型，以確保與不斷變化的領域知識保持同步。

Prompt Engineering 方法，改變“自己”。
該方法基于向量數據庫，補充足夠的對話上下文和參考資料，完善與大語言模型進行交互的問答問題（Prompt），其本質是將大語言模型的推理歸納能力與向量化信息檢索能力相結合，從而快速建立能夠理解特定語境和邏輯的問答系統。該方法的實現成本相對較低。

接下來，本文針對 Prompt Engineering 方法，來演示將云數據庫 PostgreSQL 版作為向量數據庫的使用方法。

核心概念及原理

嵌入向量（Embedding Vectors）

向量 Embedding 是在自然語言處理和機器學習中廣泛使用的概念。各種文本、圖片或其他信號，均可通過一些算法轉換為向量化的 Embedding。在向量空間中，相似的詞語或信號距離更近，可以用這種性質來表示詞語或信號之間的關系和相似性。例如，通過一定的向量化模型算法，將如下三句話，轉換成二維向量（x，y），我們可通過坐標系來畫出這些向量的位置，它們在二維坐標中的遠近，就顯示了其相似性，坐標位置越接近，其內容就越相似。如下圖所示：

“今天天氣真好，我們出去放風箏吧”
“今天天氣真好，我們出去散散步吧”
“這么大的雨，我們還是在家呆著吧”

圖片

Prompt Engineering 過程原理

如上所說，使用者需要不斷調整輸入提示，從而獲得相關領域的專業回答。輸入模型的相關提示內容越接近問題本身，模型的輸出越趨近于專業水平。通俗理解就是，模型能夠利用所輸入的提示信息，從中抽取出問題的答案，并總結出一份專業水準的回答。整個 Prompt Engineering 工作流程如下圖所示：

圖片

其大致可以分為兩個階段：向量構建階段和問答階段。在向量構建階段，將企業知識庫的所有文檔，分割成內容大小適當的片段，然后通過 Embeddings 轉換算法，例如 OpenAI 的模型 API（https://platform.openai.com/docs/guides/embeddings/what-are-embeddings），將其轉換成 Embeddings 數據，存儲于云數據庫 PostgreSQL 版向量數據庫中，詳細流程如下圖所示：

圖片

在問答階段，問答系統首先接收用戶的提問，并將其轉化為 Embedding 數據，然后通過與向量化的問題進行相似性檢索，獲取最相關的 TOP N 的知識單元。接著，通過 Prompt 模板將問題、最相關的 TOP N 知識單元、歷史聊天記錄整合成新的問題。大語言模型將理解并優化這個問題，然后返回相關結果。最后，系統將結果返回給提問者。流程如下圖所示：

圖片

實現過程

接下來將介紹如何利用云數據庫 PostgreSQL 版提供的 pg_vector 插件構建用于向量高效存儲、檢索的向量數據庫。

前置條件

已創建 ECS 實例，或者使用本地具備 Linux 環境的主機，作為訪問數據庫的客戶端機器。
請確保您具備 OpenAI Secret API Key，并且您的網絡環境可以使用 OpenAI。

訓練步驟

本文將以構建企業專屬“數據庫顧問”問答系統為例，演示整個構建過程。使用的知識庫樣例為https://www.postgresql.org/docs/15/index.html，腳本獲取方式詳見文末。

搭建的環境基于 Debian 9.13，以下方案僅供參考，環境不同依賴包安裝有所差異。

以下過程包括兩個主要腳本文件，構建知識庫的 generate-embeddings.ts，問答腳本 queryGPT.py，建議組織項目目錄如下所示：

.
├── package.json                              // ts依賴包
├── docs
│   ├── PostgreSQL15.mdx                      // 知識庫文檔
├── script
│   ├── generate-embeddings.ts                // 構建知識庫
│   ├── queryGPT.py                           // 問答腳本

1. 學習階段

1. 創建 PostgreSQL 實例

登錄云數據庫 PostgreSQL 版控制臺（https://console.volcengine.com/db/rds-pg）創建實例，并創建數據庫和賬號。關于創建 PostgreSQL 實例、數據庫、賬號的詳細信息，請參見云數據庫 PostgreSQL 版快速入門（https://www.volcengine.com/docs/6438/79234）。

2. 創建插件

進入測試數據庫，并創建 pg_vector 插件。

create extension if not exists vector;

創建對應的數據庫表，其中表 doc_chunks 中的字段 embedding 即為表示知識片段的向量。

-- 記錄文檔信息
create table docs (
  id bigserial primary key,
  -- 父文檔ID
  parent_doc bigint references docs,
  -- 文檔路徑
  path text not null unique,
  -- 文檔校驗值
  checksum text
);
-- 記錄chunk信息
create table doc_chunks (
  id bigserial primary key,
  doc_id bigint not null references docs on delete cascade, -- 文檔ID
  content text, -- chunk內容
  token_count int, -- chunk中的token數量
  embedding vector(1536), -- chunk轉化成的embedding向量
  slug text, -- 為標題生成唯一標志
  heading text -- 標題
);

3. 構建向量知識庫

在客戶端機器上，將知識庫文檔內容，分割成內容大小適當的片段，通過 OpenAI 的 embedding 轉化接口，轉化成embedding 向量，并存儲到數據庫，參考腳本獲取方式詳見文末。

注意該腳本只能處理 markdown 格式的文件。

安裝 pnpm：

curl -fsSL https://get.pnpm.io/install.sh | sh -

安裝 nodejs（參考https://github.com/nodesource/distributions）：

sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg
NODE_MAJOR=16
echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_$NODE_MAJOR.x nodistro main" | sudo tee /etc/apt/sources.list.d/nodesource.list
sudo apt-get update
sudo apt-get install nodejs -y

安裝 typescript 依賴，配置文件 package.json 獲取方式詳見文末：

pnpm run setup
pnpm install tsx

修改 generate-embeddings.ts，設置 OpenAI 的 key、PG 的連接串以及 markdown 文檔目錄：

#這里需要將user、passwd、127.0.0.1、5432 替換為實際數據庫用戶、密碼、數據庫地址、端口
const postgresql_url = 'pg://user:passwd@127.0.0.1:5432/database';
const openai_key = '-------------';
const SOURCE_DIR = path.join(__dirname, 'document path');

運行腳本，生成文檔 embedding 向量并插入數據庫：

pnpm tsx script/generate-embeddings.ts

運行過程：

圖片

腳本運行后，我們查看下所構建的知識庫。查詢 docs 表：

圖片

查詢 docs_chunk 表，批量導入向量成功：

圖片

2. 問答階段

1. 創建相似度計算函數

為了方便應用使用，使用 PostgreSQL 的自定義函數功能，創建內置于數據庫內的函數。應用只需調用 PostgreSQL，該函數便可在應用程序中獲取向量匹配結果。示例中使用“內積”來計算向量的相似性。

create or replace function match_chunks(chunck_embedding vector(1536), threshold float, count int, min_length int)
returns table (id bigint, content text, similarity float)
language plpgsql
as $$
begin
  return query
  select
    doc_chunks.id,
    doc_chunks.content,
    (doc_chunks.embedding <#> chunck_embedding) * -1 as similarity
  from doc_chunks

  -- chunk內容大于設定的長度
  where length(doc_chunks.content) >= min_length

  -- The dot product is negative because of a Postgres limitation, so we negate it
  and (doc_chunks.embedding <#> chunck_embedding) * -1 > threshold
  order by doc_chunks.embedding <#> chunck_embedding
  
  limit count;
end;
$$;

2. 提問及回答

以下 Python 程序，可以接收提問者問題，并實現上述 Prompt Engineering 的“問答階段”的功能，最終將具備“邏輯思考”+“深度領域知識”的解答，發送給提問者。

import os, psycopg2, openai
def query_handler(query = None):
    if query is None or query == "":
        print('請輸入有效問題')
        return
    query = query.strip().replace('\n', ' ')
    embedding = None
    try:
        # 使用 GPT 將提問轉化為 embedding 向量
        response = openai.Embedding.create(
            engine="text-embedding-ada-002",  # 固定為text-embedding-ada-002
            input=[query],
        )
        embedding = response.data[0].embedding
    except Exception as ex:
        print(ex)
        return
    content = ""
    con = None
    try:
        # 處理 postgres 配置，連接數據庫
        # host:127.0.0.1,port:5432,user:test,password:test,database:test
        params = postgresql_url.split(',')
        database, user, password, host, port = "test", "test", "test", "127.0.0.1", "5432"
        for param in params:
            pair = param.split(':')
            if len(pair) != 2:
                print('POSTGRESQL_URL error: ' + postgresql_url)
                return
            k, v = pair[0].strip(), pair[1].strip()
            if k == 'database':
                database = v
            elif k == 'user':
                user = v
            elif k == 'password':
                password = v
            elif k == 'host':
                host = v
            elif k == 'port':
                port = v
        # connect postgres
        con = psycopg2.connect(database=database, user=user, password=password, host=host, port=port)
        cur = con.cursor()
        # 從數據庫查詢若干條最接近提問的 chunk
        sql = "select match_chunks('[" + ','.join([str(x) for x in embedding]) + "]', 0.78, 5, 50)"
        cur.execute(sql)
        rows = cur.fetchall()
        for row in rows:
            row = row[0][1:-2].split(',')[-2][1:-2].strip()
            content = content + row + "\n---\n"
    except Exception as ex:
        print(ex)
        return
    finally:
        if con is not None:
            con.close()
    try:
        # 組織提問和 chunk 內容，發送給 GPT
        prompt = '''Pretend you are GPT-4 model , Act an database expert.
        I will introduce a database scenario for which you will provide advice and related sql commands.
        Please only provide advice related to this scenario. Based on the specific scenario from the documentation,
        answer the question only using that information. Please note that if there are any updates to the database
        syntax or usage rules, the latest content shall prevail. If you are uncertain or the answer is not explicitly
        written in the documentation, please respond with "I'm sorry, I cannot assist with this.\n\n''' + "Context sections:\n" + \
        content.strip().replace('\n', ' ') + "\n\nQuestion:"""" + query.replace('\n', ' ') + """"\n\nAnswer:"
        print('\n正在處理，請稍后。。。\n')
        response = openai.ChatCompletion.create(
            engine="gpt_openapi",  # 固定為gpt_openapi
            messages=[
                {"role": "user", "content": prompt}
            ],
            model="gpt-35-turbo",
            temperature=0,
        )
        print('回答:')
        print(response['choices'][0]['message']['content'])

    except Exception as ex:
        print(ex)
        return
os.environ['OPENAI_KEY'] = '-----------------------'
os.environ['POSTGRESQL_URL'] = 'host:127.0.0.1,port:5432,user:test,password:test,database:test'
openai_key = os.getenv('OPENAI_KEY')
postgresql_url = os.getenv('POSTGRESQL_URL')
# openai config
openai.api_type = "azure"
openai.api_base = "https://example-endpoint.openai.azure.com"
openai.api_version = "2023-XX"
openai.api_key = openai_key
def main():
    if openai_key is None or postgresql_url is None:
        print('Missing environment variable OPENAI_KEY, POSTGRESQL_URL(host:127.XX.XX.XX,port:5432,user:XX,password:XX,database:XX)')
        return
    print('我是您的PostgreSQL AI助手，請輸入您想查詢的問題，例如：\n1、如何創建table？\n2、給我解釋一下select語句？\n3、如何創建一個存儲過程？')
    while True:
        query = input("\n輸入您的問題:")
        query_handler(query)
if __name__ == "__main__":
    main()

先修改 90、91 行的 OpenAI 的 key 和 PG 的連接串，為實際 key 和連接地址：

os.environ['OPENAI_KEY'] = '-----------------------'
os.environ['POSTGRESQL_URL'] = 'host:127.0.0.1,port:5432,user:test,password:test,database:test'

然后修改 GPT 的參數：

openai.api_type = "azure"
openai.api_base = "https://example-endpoint.openai.azure.com"
openai.api_version = "2023-XX"

其次通過修改機器人自我介紹，以讓提問者快速了解問答機器人的專業特長，這里的自我介紹，說明機器人是一個數據庫專家的角色。

prompt = '''Pretend you are GPT-4 model , Act an database expert.
        I will introduce a database scenario for which you will provide advice and related sql commands.
        Please only provide advice related to this scenario. Based on the specific scenario from the documentation,
        answer the question only using that information. Please note that if there are any updates to the database
        syntax or usage rules, the latest content shall prevail. If you are uncertain or the answer is not explicitly
        written in the documentation, please respond with "I'm sorry, I cannot assist with this.\n\n''' + "Context sections:\n" + \
        content.strip().replace('\n', ' ') + "\n\nQuestion:"""" + query.replace('\n', ' ') + """"\n\nAnswer:"

最后安裝腳本依賴：

pip install psycopg2-binary
pip install openai
pip install 'openai[datalib]'

測試過程：

圖片

到此為止，您就獲得了一個企業級專屬智能問答系統。

方案優勢

相較于其他向量數據庫，借助火山引擎云數據庫 PostgreSQL 版提供的 pg_vector 插件構建的向量數據庫具有如下優勢：

使用便捷易上手： 無需專業 AI 專家介入，無需構建其他大規模復雜分布式集群，只需要一個數據庫實例，便可構建專用向量數據庫。使用接口兼容現有 SQL 語法，不需要定制化調度框架、終端。
性價比高： 可使用已有數據庫實例，不需要額外購買其他龐大的集群資源。
數據實時更新可用： 向量數據可以在毫秒級實現新增、更新，并且依然具備事務屬性，無需擔心數據的錯亂。
支持高并發，擴展容易： 在向量化場景可支持數千 TPS；在性能出現瓶頸時，可以通過一鍵擴展只讀節點，輕松實現整體吞吐的瞬間提升。
支持向量維度高： pg_vector 還具備支持向量維度高的特點。最多可支持 16000 維向量，能夠滿足絕大部分向量化存儲、使用場景。

責任編輯：龐桂玉來源：字節跳動技術團隊

數據庫 PostgreSQL

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看