大語言模型響應結果的可靠性分析實戰

作者：朱先忠 2025-02-28 08:00:00

本文將對直接提問和檢索增強兩種方案生成的大語言模型的響應結果生成可信度或可靠性分數展開對比評估。

譯者 | 朱先忠

審校 | 重樓

大語言模型（LLM）的基本原理非常簡單：根據訓練數據中的統計模式預測單詞序列中的下一個單詞（或標記）。然而，當這種看似簡單的功能可以執行許多令人驚嘆的任務（例如文本摘要、創意生成、頭腦風暴、代碼生成、信息處理和內容創建）時，它就變得異常復雜。話雖如此，LLM沒有任何記憶，它們實際上并不“理解”任何東西，除了堅持其基本功能：預測下一個單詞。

下一個單詞預測的過程是概率性的：LLM必須從概率分布中選擇每個單詞。在此過程中，它們通常會生成虛假、捏造或不一致的內容，以試圖產生連貫的響應并用看似合理但不正確的信息填補空白。這種現象稱為幻覺（Hallucination），這是LLM不可避免的眾所周知的特征，需要對其輸出進行驗證和證實。

檢索增強生成（RAG）方法使LLM與外部知識源協同工作，在一定程度上減少了幻覺，但無法完全消除幻覺。盡管高級RAG可以提供文內引用和URL，但驗證這些引用可能非常繁瑣且耗時。因此，我們需要一個客觀標準來評估LLM響應的可靠性或可信度，無論它是由其自身知識還是外部知識庫（RAG）生成的。

在本文中，我們將討論如何通過可信語言模型評估LLM輸出的可信度，該模型為LLM的輸出分配分數。我們將首先討論如何使用可信語言模型為LLM的答案分配分數并解釋可信度。隨后，我們將使用LlamaParse和Llamaindex開發一個示例RAG，以評估RAG答案的可信度。

本文的完整代碼可在GitHub上的Jupyter筆記本中找到。

為LLM的答案分配可信度分數

為了演示如何為LLM的回復分配可信度分數，我將使用Cleanlab的可信語言模型（TLM）。此類TLM結合使用不確定性量化和一致性分析來計算LLM響應的可信度分數和解釋。

Cleanlab提供免費試用API，可通過在其網站上創建賬戶獲取。我們首先需要安裝Cleanlab的Python客戶端：

pip install --upgrade cleanlab-studio

Cleanlab支持多種專有模型，例如“gpt-4o”、“gpt-4o-mini”、“o1-preview”、“claude-3-sonnet”、“claude-3.5-sonnet”、“claude-3.5-sonnet-v2”等。以下是TLM為GPT-4o的答案分配可信度分數的方式。可信度分數范圍從0到1，其中值越高表示可信度越高。

from cleanlab_studio import Studio
studio = Studio("<CLEANLAB_API_KEY>") # 從上面獲取您的API密鑰
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#設置提示
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#TLM響應包含實際輸出的“響應”、可信度評分和解釋
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")

上述代碼測試了GPT-4o對“‘Abracadabra’這個詞中有多少個元音？”這個問題的響應。TLM的輸出包含模型的答案（響應）、可信度分數和解釋。以下是此代碼的輸出。

Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.

可以看出，最先進的語言模型對于如此簡單的任務會產生幻覺并產生錯誤的輸出。以下是claude-3.5-sonnet-v2對同一問題的回答和可信度分數。

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.

claude-3.5-sonnet-v2產生了正確的輸出。讓我們比較一下這兩個模型對另一個問題的回答。

python
from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# 使用API密鑰初始化Cleanlab Studio
studio = Studio("<CLEANLAB_API_KEY>") #替換為您的實際API密鑰

# 要評估的模型列表
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# 定義提示
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# 遍歷每個模型并進行評估
for model in models:
 tlm = studio.TLM(options={"log": ["explanation"], "model": model})
 out = tlm.prompt(prompt_text)

 md_content = f"""
## 模型: {model}

**響應**: {out['response']}

**可信度評分**: {out['trustworthiness_score']}

**解釋**: {out['log']['explanation']}

---
"""
 display(Markdown(md_content))

以下是兩個模型的響應：

GPT-4o和Claude-3.5-Sonnet-V2生成的錯誤輸出，以低可信度分數表示

我們還可以為開源LLM生成可信度分數。讓我們來看看最近大肆宣傳的開源LLM：DeepSeek-R1。我將使用DeepSeek-R1-Distill-Llama-70B，它基于Meta的Llama-3.3–70B-Instruct模型，并從DeepSeek更大的6710億參數混合專家（MoE）模型中提煉而來。知識提煉（也稱為“知識蒸餾”）是一種機器學習技術，旨在將大型預訓練模型“教師模型”的學習成果轉移到較小的“學生模型”。

import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ["GROQ_API_KEY"]=st.secrets["GROQ_API_KEY"]
#初始化Groq Llama即時模型
groq_llm = ChatGroq(model="deepseek-r1-distill-llama-70b", temperature=0.5)
prompt = "Which one of 9.11 and 9.9 is bigger?"
# Get the response from the model
response = groq_llm.invoke(prompt)
#初始化Cleanlab的studio
studio = Studio("226eeab91e944b23bd817a46dbe3c8ae") 
cleanlab_tlm = studio.TLM(optinotallow={"log": ["explanation"]}) #供解釋
#得到包含可信度得分和解釋的輸出
output = cleanlab_tlm.get_trustworthiness_score(prompt, respnotallow=response.content.strip())
md_content = f"""
## 模型: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output['trustworthiness_score']}
**Explanation:** {output['log']['explanation']}
---
"""
display(Markdown(md_content))

下面是deepseek-r1-distill-llama-70b模型的輸出。

deepseek-r1-distill-llama-70b模型的正確輸出，具有較高的可信度得分

開發可信的RAG

我們現在將開發一個RAG來演示如何在RAG中衡量LLM響應的可信度。此RAG將通過從給定的鏈接中抓取數據、以MarkDown格式解析數據并創建向量存儲來開發。

接下來的代碼需要安裝以下庫：

pip install llama-parse llama-index-core llama-index-embeddings-huggingface 
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio

要將HTML渲染為PDF格式，我們還需要從他們的網站安裝wkhtmltopdf命令行工具。

將導入以下庫：

from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
import requests
from bs4 import BeautifulSoup
import pdfkit
from llama_index.readers.docling import DoclingReader
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.cleanlab import CleanlabTLM
from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent
import nest_asyncio
import os

接下來的步驟將涉及使用Python的BeautifulSoup庫從給定的URL抓取數據，使用pdfkit將抓取的數據保存為PDF文件，然后使用LlamaParse（這是一個用LLM構建且專為LLM用例設計的原生AI文檔解析平臺）將PDF中的數據解析為Markdown文件。

我們將首先配置CleanlabTLM要使用的LLM和嵌入模型（HuggingFace嵌入模型BAAI/bge-small-en-v1.5），該嵌入模型將用于計算抓取數據的嵌入，以創建向量存儲。

options = {
 "model": "gpt-4o",
 "max_tokens": 512,
 "log": ["explanation"]
}
llm = CleanlabTLM(api_key="<CLEANLAB_API_KEY>", optinotallow=options) # 從https://cleanlab.ai/獲取您的免費API
Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(
 model_name="BAAI/bge-small-en-v1.5"
)

現在，我們將定義一個自定義事件處理程序GetTrustworthinessScore，它繼承自一個基礎事件處理程序類。該處理程序在LLM（大語言模型）完成時被觸發，并從響應元數據中提取可信度評分。我們創建了一個輔助函數display_response用于顯示LLM的響應及其可信度評分。

# 可信度評分事件處理程序
class GetTrustworthinessScore(BaseEventHandler):
 events: ClassVar[List[BaseEvent]] = []
 trustworthiness_score: float = 0.0
 @classmethod
 def class_name(cls) -> str:
 return "GetTrustworthinessScore"
 def handle(self, event: BaseEvent) -> Dict:
 if isinstance(event, LLMCompletionEndEvent):
 self.trustworthiness_score = event.response.additional_kwargs.get("trustworthiness_score", 0.0)
 self.events.append(event)
 return {}

# 顯示LLM響應的輔助函數
def display_response(response):
 response_str = response.response
 trustworthiness_score = event_handler.trustworthiness_score
 print(f"Response: {response_str}")
 print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")

接下來，我們將通過從給定的URL抓取數據來生成PDF。為了演示目的，我們僅從這篇關于大語言模型的維基百科文章（遵循Creative Commons Attribution-ShareAlike 4.0許可）抓取數據。

注意：建議讀者始終仔細檢查即將抓取的內容和數據的狀態，并確保他們被允許這樣做。

下面的代碼片段通過發出HTTP請求并使用Python的BeautifulSoup庫解析HTML內容來從給定的URL抓取數據。HTML內容通過將協議相對URL轉換為絕對URL進行清理。隨后，抓取的內容使用pdfkit轉換為PDF文件。

##########################################
# 從多個URL生成PDF
##########################################
# 配置wkhtmltopdf路徑
wkhtml_path = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=wkhtml_path)
# 定義URL和分配文檔名稱
urls = {
 "LLMs": "https://en.wikipedia.org/wiki/Large_language_model"
}
# 保存PDF的目錄
pdf_directory = "PDFs"
os.makedirs(pdf_directory, exist_ok=True)
pdf_paths = {}
for doc_name, url in urls.items():
 try:
 print(f"Processing {doc_name} from {url} ...")
 response = requests.get(url)
 soup = BeautifulSoup(response.text, "html.parser")
 main_content = soup.find("div", {"id": "mw-content-text"})
 if main_content is None:
 raise ValueError("Main content not found")
 # 將協議相對URL替換為絕對URL
 html_string = str(main_content).replace('src="http://', 'src="https://').replace('href="http://', 'href="https://')
 pdf_file_path = os.path.join(pdf_directory, f"{doc_name}.pdf")
 pdfkit.from_string(
 html_string,
 pdf_file_path,
 optinotallow={'encoding': 'UTF-8', 'quiet': ''},
 cnotallow=config
 )
 pdf_paths[doc_name] = pdf_file_path
 print(f"Saved PDF for {doc_name} at {pdf_file_path}")
 except Exception as e:
 print(f"Error processing {doc_name}: {e}")

在從抓取的數據生成PDF后，我們使用LlamaParse解析這些PDF。我們設置解析指令以提取MarkDown格式的內容，并按頁以及文檔名稱和頁碼解析文檔。這些提取的實體（頁面）被稱為節點。解析器遍歷提取的節點，并通過附加引用標題來更新每個節點的元數據，以便于后續引用。

##########################################
# 使用LlamaParse解析PDF并注入元數據
##########################################

# 定義解析指令（如果您的解析器支持）
parsing_instructions = """提取文檔的markdown格式內容。
按頁將文檔拆分為節點（例如）。
確保每個節點具有文檔名稱和頁碼的元數據。"""

# 創建LlamaParse實例
parser = LlamaParse(
 api_key="<LLAMACLOUD_API_KEY>", # 替換為您的實際密鑰
 parsing_instructinotallow=parsing_instructions,
 result_type="markdown",
 premium_mode=True,
 max_timeout=600
)
# 保存合并的Markdown文件的目錄（每個PDF一個）
output_md_dir = os.path.join(pdf_directory, "markdown_docs")
os.makedirs(output_md_dir, exist_ok=True)
# 列表，用于保存所有更新后的節點以供索引
all_nodes = []
for doc_name, pdf_path in pdf_paths.items():
 try:
 print(f"Parsing PDF for {doc_name} from {pdf_path} ...")
 nodes = parser.load_data(pdf_path) # 返回節點列表
 updated_nodes = []
 # 處理每個節點：更新元數據并在文本中注入引用標題。
 for i, node in enumerate(nodes, start=1):
 # 復制現有元數據（如果有），并添加我們自己的鍵。
 new_metadata = dict(node.metadata) if node.metadata else {}
 new_metadata["document_name"] = doc_name
 if "page_number" not in new_metadata:
 new_metadata["page_number"] = str(i)
 # 構建引用標題。
 citation_header = f"[{new_metadata['document_name']}, page {new_metadata['page_number']}]\n\n"
 # 在節點的文本前添加引用標題。
 updated_text = citation_header + node.text
 new_node = node.__class__(text=updated_text, metadata=new_metadata)
 updated_nodes.append(new_node)
 # 使用更新后的節點文本為文檔保存一個合并的Markdown文件。
 combined_texts = [node.text for node in updated_nodes]
 combined_md = "\n\n---\n\n".join(combined_texts)
 md_filename = f"{doc_name}.md"
 md_filepath = os.path.join(output_md_dir, md_filename)
 with open(md_filepath, "w", encoding="utf-8") as f:
 f.write(combined_md)
 print(f"Saved combined markdown for {doc_name} to {md_filepath}")
 # 將更新后的節點添加到全局列表以供索引。
 all_nodes.extend(updated_nodes)
 print(f"Parsed {len(updated_nodes)} nodes from {doc_name}.")
 except Exception as e:
 print(f"Error parsing {doc_name}: {e}")

現在，我們創建一個向量存儲和一個查詢引擎。我們定義一個自定義提示模板來指導LLM在回答問題時的行為。最后，我們創建一個查詢引擎，使用創建的索引來回答問題。對于每個查詢，我們根據節點與查詢的語義相似性從向量存儲中檢索前3個節點。LLM使用這些檢索到的節點來生成最終答案。

##########################################

# 創建索引和查詢引擎

##########################################

# 從所有節點創建索引。

index = VectorStoreIndex.from_documents(documents=all_nodes)

# 定義一個自定義提示模板，強制包含引用。

prompt_template = """
你是一個具有主題專業知識的AI助手。
僅使用提供的上下文回答問題。
在必要時，以格式良好的Markdown格式回答，包含項目符號和章節。
如果提供的上下文不支持答案，請回復“我不知道。”
上下文：
{context_str}
問題：
{query_str}
答案：
"""
# 使用自定義提示創建查詢引擎。
query_engine = index.as_query_engine(similarity_top_k=3, llm=llm, prompt_template=prompt_template)
print("Combined index and query engine created successfully!")

現在，讓我們測試一些查詢及其對應的可信度評分。

query = "When is mixture of experts approach used?"
response = query_engine.query(query)
display_response(response)

回答“何時使用專家混合方法？”的問題（圖片來自作者本人）

query = "How do you compare Deepseek model with OpenAI's models?"
response = query_engine.query(query)
display_response(response)

回答“How do you compare the Deepseek model with OpenAI’s models?（您如何將Deepseek模型與OpenAI的模型進行比較？）”的問題（作者提供的圖片）

總之，為LLM的響應分配可信度分數（無論是通過直接推理還是RAG生成）有助于定義AI輸出的可靠性并在需要時優先考慮人工驗證。這對于關鍵領域尤其重要，因為錯誤或不可靠的響應可能會造成嚴重后果。

譯者介紹

朱先忠，51CTO社區編輯，51CTO專家博客、講師，濰坊一所高校計算機教師，自由編程界老兵一枚。

原文標題：How to Measure the Reliability of a Large Language Model’s Response，作者：Umair Ali Khan

責任編輯：華軒來源： 51CTO

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

大語言模型響應結果的可靠性分析實戰

為LLM的答案分配可信度分數

開發可信的RAG

譯者介紹