OpenAI 自帶檢索真的好用嗎？定量測評帶你深度了解！

2023-12-25 14:34:27

我們基于 Ragas 測評工具，將 OpenAI assistant 和基于向量數(shù)據(jù)庫的開源 RAG 方案做了詳盡的比較和分析。可以發(fā)現(xiàn)，雖然 OpenAI assistant 的確在檢索方面表現(xiàn)尚佳，但在回答效果，召回表現(xiàn)等方面卻遜色于向量 RAG 檢索方案，Ragas 的各項(xiàng)指標(biāo)也定量地反應(yīng)出該結(jié)論。

向量數(shù)據(jù)庫的勁敵來了？又有一批賽道創(chuàng)業(yè)公司要倒下？

……

這是 OpenAI 上線 Assistant 檢索功能后，技術(shù)圈傳出的部分聲音。原因在于，此功能可以為用戶提供基于知識庫問答的 RAG（檢索增強(qiáng)增強(qiáng)）能力。而此前，大家更傾向于將向量數(shù)據(jù)庫作為 RAG 方案的重要組件，以達(dá)到減少大模型出現(xiàn)“幻覺”的效果。

那么，問題來了，OpenAI 自帶的 Assistant 檢索功能 V.S. 基于向量數(shù)據(jù)庫構(gòu)建的開源 RAG 方案相比，誰更勝一籌？

本著嚴(yán)謹(jǐn)?shù)那笞C精神，我們對這個(gè)問題進(jìn)行了定量測評，結(jié)果很有意思：OpenAI 真的很強(qiáng)！

不過，在基于向量數(shù)據(jù)庫的開源 RAG 方案面前就有些遜色了！

接下來，我將還原整個(gè)測評過程。需要強(qiáng)調(diào)的是，要完成這些測評并不容易，少量的測試樣本根本無法有效衡量 RAG 應(yīng)用的各方面效果。

因此，需要采用一個(gè)公平、客觀的 RAG 效果測評工具，在一個(gè)合適的數(shù)據(jù)集上進(jìn)行測評，進(jìn)行定量的評估和分析，并保證結(jié)果的可復(fù)現(xiàn)性。

話不多說，上過程！

一、評測工具

Ragas （https://docs.ragas.io/en/latest/）是一個(gè)致力于測評 RAG 應(yīng)用效果的開源框架。用戶只需要提供 RAG 過程中的部分信息，如 question、 contexts、 answer 等，它就能使用這些信息來定量評估多個(gè)指標(biāo)。通過 pip 安裝 Ragas，只需幾行代碼，即可進(jìn)行評估，過程如下：

Python
from ragas import evaluate
from datasets import Dataset

# prepare your huggingface dataset in the format
# dataset = Dataset({
#     features: ['question', 'contexts', 'answer', 'ground_truths'],
#     num_rows: 25
# })
results = evaluate(dataset)

# {'ragas_score': 0.860, 'context_precision': 0.817,
# 'faithfulness': 0.892, 'answer_relevancy': 0.874

Ragas 有許多評測的得分指標(biāo)子類別，比如：

?從 generation 角度出發(fā)，有描述回答可信度的 Faithfulness，回答和問題相關(guān)度的 Answer relevancy

?從 retrieval 角度出發(fā)，有衡量知識召回精度的 Context precision，知識召回率的 Context recall，召回內(nèi)容相關(guān)性的 Context Relevancy

?從 answer 與 ground truth 比較角度出發(fā)，有描述回答相關(guān)性的 Answer semantic similarity，回答正確性的 Answer Correctness

?從 answer 本身出發(fā)，有各種 Aspect Critique

圖片

圖片來源：https://docs.ragas.io/en/latest/concepts/metrics/index.html

這些指標(biāo)各自衡量的角度不同，舉個(gè)例子，比如指標(biāo) answer correctness，它是結(jié)果導(dǎo)向，直接衡量 RAG 應(yīng)用回答的正確性。下面是一個(gè) answer correctness 高分與低分的對比例子：

Plain Text
Ground truth: Einstein was born in 1879 at Germany .
High answer correctness: In 1879, in Germany, Einstein was born.
Low answer correctness: In Spain, Einstein was born in 1879.

其它指標(biāo)細(xì)節(jié)可參考官方文檔：

（https://docs.ragas.io/en/latest/concepts/metrics/index.html）。

重要的是，每個(gè)指標(biāo)衡量角度不同，這樣用戶就可以全方位，多角度地評估 RAG 應(yīng)用的好壞。

二、測評數(shù)據(jù)集

我們使用 Financial Opinion Mining and Question Answering (fiqa) Dataset （https://sites.google.com/view/fiqa/）作為測試數(shù)據(jù)集。主要有以下幾方面的原因：

?該數(shù)據(jù)集是屬于金融專業(yè)領(lǐng)域的數(shù)據(jù)集，它的語料來源非常多樣化，并包含了人工回答內(nèi)容。里面涵蓋非常冷門的金融專業(yè)知識，大概率不會出現(xiàn)在 GPT 的訓(xùn)練數(shù)據(jù)集。這樣就比較適合用來當(dāng)作外掛的知識庫，以和沒見過這些知識的 LLM 形成對比。

?該數(shù)據(jù)集原本就是用來評估 Information Retrieval (IR) 能力的，因此它有標(biāo)注好了的知識片段，這些片段可以直接當(dāng)做召回的標(biāo)準(zhǔn)答案（ground truth）。

?Ragas 官方也把它視作一個(gè)標(biāo)準(zhǔn)的入門測試數(shù)據(jù)集（https://docs.ragas.io/en/latest/getstarted/evaluation.html#the-data），并提供了構(gòu)建它的腳本（https://github.com/explodinggradients/ragas/blob/main/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb）。因此有一定的社區(qū)基礎(chǔ)，可以得到一致的認(rèn)可，也比較合適用來做 baseline。

我們先使用轉(zhuǎn)換腳本來將最原始的 fiqa 數(shù)據(jù)集轉(zhuǎn)換構(gòu)建成 Ragas 方便處理的格式。可以先看一眼該評測數(shù)據(jù)集的內(nèi)容，它有 647 個(gè)金融相關(guān)的 query 問題，每個(gè)問題對應(yīng)的知識原文內(nèi)容列表就是 ground_truths，它一般包含 1 到 4 條知識內(nèi)容片段。

fiqa數(shù)據(jù)集示例

進(jìn)行到這一步，測試數(shù)據(jù)就準(zhǔn)備好了。我們只需要將 question 這一列，拿去提問 RAG 應(yīng)用，然后將 RAG 應(yīng)用的回答和召回，合并上 ground truths，將所有這些信息，用 Ragas 評測打分。

三、RAG對照設(shè)置

接下來就是搭建我們要對比的兩個(gè) RAG 應(yīng)用，來對比跑分。下面開始搭建兩套 RAG 應(yīng)用：OpenAI assistant 和基于向量數(shù)據(jù)庫自定義的 RAG pipeline。

1.OpenAI assistant

我們采用 OpenAI 官方的 assistant retrieval 方式介紹（https://platform.openai.com/docs/assistants/tools/knowledge-retrieval），構(gòu)建 assistant 和上傳知識，并且使用 OpenAI 官方給出的方式（https://platform.openai.com/docs/assistants/how-it-works/message-annotations）拿到 answer 和召回的 contexts，其它都采用默認(rèn)設(shè)置。

2.基于向量數(shù)據(jù)庫的 RAG pipeline

緊接著我們打造一條基于向量召回的 RAG pipeline。用 Milvus （https://milvus.io/）向量數(shù)據(jù)庫存儲知識，用 HuggingFaceEmbeddings （https://python.langchain.com/docs/integrations/platforms/huggingface）中的 BAAI/bge-base-en 模型構(gòu)建 embedding用 LangChain （https://python.langchain.com/docs/get_started/introduction）的組件進(jìn)行文檔導(dǎo)入和 Agent 構(gòu)建。

下面列出了兩套方案的對比：

這里注意到，我們用的 LLM model 都是 gpt-4-1106-preview，其它的策略由于 OpenAI 是閉源的，所以應(yīng)該和它有許多差別。篇幅所限具體實(shí)現(xiàn)細(xì)節(jié)在此不作展開，可以參考我們的實(shí)現(xiàn)代碼（https://github.com/milvus-io/bootcamp/tree/master/evaluation）。

四、結(jié)果和分析

1.實(shí)驗(yàn)結(jié)果

我們使用 Ragas 里的多個(gè)指標(biāo)對它們進(jìn)行打分，得到下面每個(gè)指標(biāo)的對比結(jié)果：

各項(xiàng)指標(biāo)對比

可以看到，在我們統(tǒng)計(jì)的5項(xiàng)指標(biāo)中，OpenAI assistant 除了在 answer_similarity 這項(xiàng)超過自定義的 RAG pipeline 之外，其它指標(biāo)都略低于自定義的 RAG pipeline。

另外，Ragas 也可以通過計(jì)算各項(xiàng)指標(biāo)的調(diào)和平均數(shù)，來得到一個(gè)總體平均的得分，叫作 Ragas score。調(diào)和平均數(shù)的作用在于懲罰低分項(xiàng)。從總體上看，OpenAI assistant 的 Ragas score 也是低于自定義的 RAG pipeline。

Ragas score對比

那么，為什么會有這樣的結(jié)果呢？通過大量的單條結(jié)果對比，我們發(fā)現(xiàn)，基于向量數(shù)據(jù)庫搭建的 RAG pipeline 與OpenAI assistant 相比，有如下優(yōu)勢：

1） OpenAI assistant 更傾向于用自己的知識回答，而少用召回的知識。

這可能是由于其內(nèi)部的 Agent 設(shè)定導(dǎo)致，當(dāng)遇到 GPT 可能知道的事實(shí)時(shí)，它更自信，選擇使用自己訓(xùn)練時(shí)的知識，而不去使用召回的知識，但也許這時(shí)的知識就是正好和它訓(xùn)練的相悖。舉個(gè)例子：

提問:

Plain Text
Are personal finance / money management classes taught in high school, anywhere?

OpenAI assistant 回答：

Plain Text
Personal finance and money management classes are indeed taught in some high schools, although the availability and requirement of such classes can vary greatly depending on the country and even within regions of a country. In the United States, for example, the Council for Economic Education (CEE) conducts surveys and reports on the state of financial education in K-12 schools. Their findings suggest that as of my last update, a growing number of states have integrated personal finance into their educational standards, and some have mandated that students take a personal finance course in order to graduate.

Although I don't have access to a live database or the most recent studies, you can often find information about financial education in schools by:

1. Checking government education websites.
2. Looking at reports from financial education nonprofits and councils.
3. Contacting local school boards or departments of education.

If you are interested in specific locations or have a particular focus in mind, you could specify that so I could provide more directed information, if available.

這個(gè)回答對不對呢？只從回答來看是沒什么問題，但我們注意到標(biāo)亮的地方，強(qiáng)調(diào)的是其沒有聯(lián)網(wǎng)獲取最新知識，這就說明它沒有用到上傳上去的知識文檔內(nèi)容。

再看看 ground truth，也就是數(shù)據(jù)集里標(biāo)注的正確答案：

Plain Text
["In Houston, Texas USA where I went to a private high school they had a half-semester class in personal finance, but it was optional and didn't give you any credits towards graduation. You are right though, it should be a standard class. After all, who doesn't need that information in their adult lives, and not everyone goes to college."
...
 '"We had a ""civics"" class when I was a freshman in high school. This was in the Ann Arbor, MI public schools. It covered the very basics (how to balance your checkbook, what are stocks, how do income taxes work, what is interest, etc.) of money management along with an overview of politics and the legal system. It was a really light class, though, and didn\'t go deeply into personal finance and money management. I agree that such a class would be very valuable, as would cooking, nutrition, and basic home and car repair."']

可以看到，文檔里舉例了許多回答這個(gè)問題的具體場景，而 OpenAI 的回答確實(shí)沒有用到它們。它認(rèn)為這個(gè)問題過于簡單，足以直接回答，而忽略了知識庫里可能需要的信息。

接下來我們看看基于向量數(shù)據(jù)庫的 RAG 的回答：

Plain Text
Personal finance and money management classes are indeed taught in some high schools, but the availability and requirement of such classes can vary greatly. For example, in Houston, Texas, a private high school offered an optional half-semester class in personal finance that did not provide credits towards graduation. Staples High School in Westport, Connecticut, introduced a personal financial management class focused on using mathematics to develop financial literacy skills, covering topics such as earnings, banking, credit cards, loans, taxes, insurance, investing, budgeting, and buying personal property. In Ann Arbor, Michigan, public schools offered a 'civics' class that included basic money management topics. Additionally, in the UK, the 'Every Child Matters' framework includes 'achieving economic well-being' as an important element, and some schools use a School Rewards System to teach kids about finance and economy. While these examples show that such classes exist, they are not universally required or standardized, and the depth and focus of the curriculum can vary.

高亮出了 RAG 回答中引用 ground truth 的內(nèi)容，它很好地把知識內(nèi)容融入到回答中去，這才是用戶需要的。

2.）OpenAI 對知識的切分和召回有待優(yōu)化，開源自定義方案更勝一籌

我們可以通過查看 assistant 的中間召回的知識，來分析它對知識文檔的切分策略，或者反推其 embedding 模型的效果。舉個(gè)例子：

提問:

JSON
Pros / cons of being more involved with IRA investments [duplicate]

OpenAI assistant 的中間召回片段：

JSON
['PROS: CONS']

這顯然是一個(gè)錯(cuò)誤的召回片段，而且它只召回了這一條片段。首先片段的切分不太合理，把后面的內(nèi)容切掉了。其次 embedding 模型并沒有把更重要的、可以回答這個(gè)問題的片段召回，只是召回了提問詞相似的片段。

自定義 RAG pipeline 的召回片段：

Plain Text
['in the tax rate, there\'s also a significant difference in the amount being taxed. Thus, withdrawing from IRA is generally not a good idea, and you will never be better off with withdrawing from IRA than with cashing out taxable investments (from tax perspective). That\'s by design."'
 "Sounds like a bad idea. The IRA is built on the power of compounding. Removing contributions will hurt your retirement savings, and you will never be able to make that up. Instead, consider tax-free investments. State bonds, Federal bonds, municipal bonds, etc. For example, I invest in California muni bonds fund which gives me ~3-4% annual dividend income - completely tax free. In addition - there's capital appreciation of your fund holdings. There are risks, of course, for example rate changes will affect yields and capital appreciation, so consult with someone knowledgeable in this area (or ask another question here, for the basics). This will give you the same result as you're expecting from your Roth IRA trick, without damaging your retirement savings potential."
 "In addition to George Marian's excellent advice, I'll add that if you're hitting the limits on IRA contributions, then you'd go back to your 401(k). So, put enough into your 401(k) to get the match, then max out IRA contributions to give you access to more and better investment options, then go back to your 401(k) until you top that out as well, assuming you have that much available to invest for retirement."
 "While tax deferral is a nice feature, the 401k is not the Holy Grail.  I've seen plenty of 401k's where the investment options are horrible: sub-par performance, high fees, limited options.   That's great that you've maxed out your Roth IRA.  I commend you for that.   As long as the investment options in your 401k are good, then I would stick with it."
 "retirement plans which offer them good cheap index funds. These people probably don't need to worry quite as much. Finally, having two accounts is more complicated. Please contact someone who knows more about taxes than I am to figure out what limitations apply for contributing to both IRAs and 401(k)s in the same year."]

可以看到，自行搭建的 RAG pipeline 把許多 IRA 投資的信息片段都召回出來了，這些內(nèi)容也是有效地結(jié)合到最后 LLM 的回答中去。

此外，可以注意到，向量召回也有類似 BM25 這種分詞召回的效果，召回的關(guān)鍵詞確實(shí)都是需要的詞“IRA”，因此向量召回不僅在整體語義上有效，在微觀詞匯上召回效果也不遜色于詞頻召回。

2.其它方面

除實(shí)驗(yàn)效果分析之外，對比更加靈活的自定義開源 RAG 方案，OpenAI assistant 還有一些較為明顯的劣勢：

?OpenAI assistant 無法調(diào)整RAG流程中的參數(shù)，內(nèi)部是個(gè)黑盒，這也導(dǎo)致了沒法對其優(yōu)化。而自定義 RAG 方案可以調(diào)整 top_k、chunk size、embedding 模型等組件或參數(shù)，這樣也可以在特定數(shù)據(jù)上進(jìn)行優(yōu)化。

?OpenAI 存儲文件量有限，而向量數(shù)據(jù)庫可以存儲海量知識。OpenAI 單文件上傳有上限 512 MB 并不能超過 2,000,000 個(gè) token。

因此，OpenAI 無法完成業(yè)務(wù)更復(fù)雜，數(shù)據(jù)量更大或更加定制化的 RAG 服務(wù)。

五、總結(jié)

因此，對于構(gòu)建更加強(qiáng)大、效果更好的 RAG 應(yīng)用，開發(fā)者可以考慮基于 Milvus（https://zilliz.com/what-is-milvus）或 Zilliz Cloud（https://cloud.zilliz.com.cn/signup）等向量數(shù)據(jù)庫，構(gòu)建定義檢索功能，從而帶來更好的效果和靈活的選擇。

責(zé)任編輯：武曉燕來源： 51CTO技術(shù)棧