成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

傳統(tǒng)分塊已死?Agentic Chunking拯救語義斷裂,實測RAG準確率飆升40%,LLM開發(fā)者必看! 原創(chuàng)

發(fā)布于 2025-2-24 09:40
瀏覽
0收藏

最近公司處理LLM項目的同事咨詢了我一個問題:明明文檔中多次提到同一個專有名詞,RAG卻總是漏掉關(guān)鍵信息。排查后發(fā)現(xiàn),問題出在傳統(tǒng)的分塊方法上——那些相隔幾頁卻密切相關(guān)的句子,被無情地拆散了。我給了一些通用的建議,比如使用混合檢索代替單一的語義檢索,基于chunk生成QA對等等。接著他又提出了一個問題,有沒有通過分塊技術(shù)能減少這類問題的發(fā)生?我說你也可以試試最近新提出的一種分塊策略:Agentic Chunking.

為什么分塊如此重要?

在RAG模型中,文本分塊是第一步,也是最關(guān)鍵的一步。傳統(tǒng)的分塊方法,比如遞歸字符分割(Recursive character splitting),雖然簡單易用,但它有一個明顯的缺點:它依賴于固定的token長度進行分割,這可能導致一個主題被分割到不同的文本塊中,從而破壞了上下文的連貫性。

另一種常見的分塊方法是語義分割(semantic splitting),它通過檢測句子之間的語義變化來進行分割。這種方法雖然比遞歸字符分割更智能,但它也有局限性。比如,當文檔中的話題來回切換時,語義分割可能會將相關(guān)內(nèi)容分割到不同的塊中,導致信息不連貫。

比如遇到下面這種場景時,它們就會集體失靈:

"小明介紹了Transformer架構(gòu)...(中間插入5段其他內(nèi)容)...最后他強調(diào),Transformer的核心是自注意力機制。"

傳統(tǒng)方法要么把這兩句話拆到不同區(qū)塊,要么被中間內(nèi)容干擾導致語義斷裂。而人工分塊時,我們自然會將它們歸為“模型原理”組——這種跨越文本距離的關(guān)聯(lián)性,正是Agentic Chunking要解決的。

Agentic Chunking的工作原理

Agentic Chunking的核心思想是讓大語言模型(LLM)主動評估每一句話,并將其分配到最合適的文本塊中。與傳統(tǒng)的分塊方法不同,Agentic Chunking不依賴于固定的token長度或語義變化,而是通過LLM的智能判斷,將文檔中相隔較遠但主題相關(guān)的句子歸入同一組。

舉個例子,假設(shè)我們有以下文本:

On July 20, 1969, astronaut Neil Armstrong walked on the moon. He was leading the NASA’s Apollo 11 mission. Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface.

在Agentic Chunking中,LLM會將這些句子進行propositioning處理,即將每個句子獨立化,確保每個句子都有自己的主語。處理后的文本如下:

On July 20, 1969, astronaut Neil Armstrong walked on the moon.
Neil Armstrong was leading the NASA’s Apollo 11 mission.
Neil Armstrong famously said, “That’s one small step for man, one giant leap for mankind” as he stepped onto the lunar surface.

這樣,LLM就可以單獨檢查每一個句子,并將其分配到最合適的文本塊中。

propositioning 可以看做是對文檔進行“句子級整容”,確保每個句子獨立完整

如何實現(xiàn)Agentic Chunking?

實現(xiàn)Agentic Chunking的關(guān)鍵在于propositioning文本塊的動態(tài)創(chuàng)建與更新。我們可以使用Langchain和Pydantic等工具來實現(xiàn)這一過程。流程圖如下:

傳統(tǒng)分塊已死?Agentic Chunking拯救語義斷裂,實測RAG準確率飆升40%,LLM開發(fā)者必看!-AI.x社區(qū)


1. Propositioning文本

首先,我們需要將文本中的每個句子進行propositioning處理。我們可以使用Langchain提供的提示詞模板,讓LLM自動完成這項工作。以下是一個簡單的代碼示例:

from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
from typing import Optional
from langchain.chat_models import ChatOpenAI
import uuid
import os
from typing import List

from langchain import hub
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

from pydantic import BaseModel

obj = hub.pull("wfh/proposal-indexing")
llm = ChatOpenAI(model="gpt-4o")

class Sentences(BaseModel):
    sentences: List[str]

extraction_llm = llm.with_structured_output(Sentences)
extraction_chain = obj | extraction_llm

sentences = extraction_chain.invoke(
    """
    On July 20, 1969, astronaut Neil Armstrong walked on the moon.
    He was leading the NASA's Apollo 11 mission.
    Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.
    """
)

2. 創(chuàng)建和更新文本塊

接下來,我們需要創(chuàng)建一個函數(shù)來動態(tài)生成和更新文本塊。每個文本塊包含主題相似的propositions,并且隨著新propositions的加入,文本塊的標題和摘要也會不斷更新。

def create_new_chunk(chunk_id, proposition):
    summary_llm = llm.with_structured_output(ChunkMeta)
    summary_prompt_template = ChatPromptTemplate.from_messages([
        ("system", "Generate a new summary and a title based on the propositions."),
        ("user", "propositions:{propositions}"),
    ])
    summary_chain = summary_prompt_template | summary_llm
    chunk_meta = summary_chain.invoke({"propositions": [proposition]})
    chunks[chunk_id] = {
        "summary": chunk_meta.summary,
        "title": chunk_meta.title,
        "propositions": [proposition],
    }

3. 將proposition推送到合適的文本塊

最后,我們需要一個AI Agent來判斷新的proposition應(yīng)該被添加到哪個文本塊中。如果沒有合適的文本塊,Agent會創(chuàng)建一個新的文本塊。

def find_chunk_and_push_proposition(proposition):
    class ChunkID(BaseModel):
        chunk_id: int = Field(descriptinotallow="The chunk id.")
    allocation_llm = llm.with_structured_output(ChunkID)
    allocation_prompt = ChatPromptTemplate.from_messages([
        ("system", "Find the chunk that best matches the proposition. If no chunk matches, return a new chunk id."),
        ("user", "proposition:{proposition} chunks_summaries:{chunks_summaries}"),
    ])
    allocation_chain = allocation_prompt | allocation_llm
    chunks_summaries = {chunk_id: chunk["summary"] for chunk_id, chunk in chunks.items()}
    best_chunk_id = allocation_chain.invoke({"proposition": proposition, "chunks_summaries": chunks_summaries}).chunk_id
    if best_chunk_id not in chunks:
        create_new_chunk(best_chunk_id, proposition)
    else:
        add_proposition(best_chunk_id, proposition)

實測效果如何

我選擇了新加坡圣淘沙著名景點 Wings of Time 的介紹文本作為測試對象,使用 GPT-4 模型進行處理。這段文本包含了景點介紹、票務(wù)信息、開放時間等多個方面的內(nèi)容,是一個很好的測試樣本。

Product Name: Wings of Time

Product Description: Wings of Time is one of Sentosa's most breathtaking attractions, combining water, laser, fire, and music to create a mesmerizing night show about friendship and courage. Situated on the scenic  (https://www.sentosa.com.sg/en/things-to-do/attractions/siloso-beach/) Siloso Beach , this award-winning spectacle is staged nightly, promising an unforgettable experience for visitors of all ages. Be wowed by spellbinding laser, fire, and water effects set to a majestic soundtrack, complete with a jaw-dropping fireworks display. A fitting end to your day out at Sentosa, it’s possibly the only place in Singapore where you can witness such an awe-inspiring performance.  Get ready for an even better experience starting 1 February 2025 ! Wings of Time Fireworks Symphony, Singapore’s only daily fireworks show, now features a fireworks display that is four times longer!   Important Note: Please visit  (https://www.sentosa.com.sg/sentosa-reservation) here if you need to change your visit date. All changes must be made at least 1 day prior to the visit date.

Product Category: Shows

Product Type: Attraction

Keywords: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time tickets

Meta Description: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening.


Product Tags: Family Fun,Popular experiences,Frequently Bought

Locations: Beach Station

[Tickets]

Name: Wings of Time (Std)
Terms: ? All Wings of Time (WOT) Open-Dated tickets require prior redemption at Singapore Cable Car Ticketing counters and are subjected to seats availability on a first come first serve basis. ? This is a rain or shine event. Tickets are non-exchangeable or nonrefundable under any circumstances. ? Once timeslot is confirmed, no further amendments are allowed. Please proceed to WOT admission gates to scan your issued QR code via mobile or physical printout for admission. ? Gates will open 15 minutes prior to the start of the show. ? Show Duration: 20 minutes per show. ? Please be punctual for your booked time slot. ? Admission will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host. ? Standard seats are applicable to guest aged 4 years and above. ? No outside Food & Drinks are allowed. ? Refer to  (https://www.mountfaberleisure.com/attraction/wings-of-time/) https://www.mountfaberleisure.com/attraction/wings-of-time/ for more information on Wings of Time.
Pax Type: Standard
Promotion A: Enjoy $1.90 off when you purchase online! Discount will automatically be applied upon checkout.
Price: 19





Opening Hours: Daily  Show 1: 7.40pm  Show 2: 8.40pm




Accessibilities: Wheelchair



[Information]

Title: Terms & Conditions
Description: For more information, click  (https://www.sentosa.com.sg/en/promotional-general-store-terms-and-conditions) here for Terms & Conditions


Title: Getting Here
Description: By Sentosa Express: Alight at Beach Station  By Public Bus: Board Bus 123 and alight at Beach Station  By Intra-Island Bus: Board Sentosa Bus A or B and alight at Beach Station     Nearest Car Park   Beach Station Car Park


Title: Contact Us
Description: Beach Station  +65 6361 0088   (mailto:guestrelations@mflg.com.sg) guestrelations@mflg.com.sg

系統(tǒng)首先將原文轉(zhuǎn)化為 50 多個獨立的陳述句(propositions)。有趣的是,在這個過程中,系統(tǒng)自動將每句話的主語統(tǒng)一為"Wings of Time",這顯示出了 AI 對文本主題的準確把握。

[
    "Wings of Time is one of Sentosa's most breathtaking attractions.",
    'Wings of Time combines water, laser, fire, and music to create a mesmerizing night show.',
    'The night show of Wings of Time is about friendship and courage.',
    'Wings of Time is situated on the scenic Siloso Beach.',
    'Wings of Time is an award-winning spectacle staged nightly.',
    'Wings of Time promises an unforgettable experience for visitors of all ages.',
    'Wings of Time features spellbinding laser, fire, and water effects set to a majestic soundtrack.',
    'Wings of Time includes a jaw-dropping fireworks display.',
    'Wings of Time is a fitting end to a day out at Sentosa.',
    'Wings of Time is possibly the only place in Singapore where such an awe-inspiring performance can be witnessed.',
    'Wings of Time will offer an even better experience starting 1 February 2025.',
    'Wings of Time Fireworks Symphony is Singapore’s only daily fireworks show.',
    'Wings of Time Fireworks Symphony now features a fireworks display that is four times longer.',
    'Visitors should visit the provided link if they need to change their visit date to Wings of Time.',
    'All changes to the visit date must be made at least 1 day prior to the visit date.',
    'Wings of Time is categorized as a show.',
    'Wings of Time is a type of attraction.',
    'Keywords for Wings of Time include: Wings of Time, Sentosa night show, Sentosa attractions, laser show Sentosa, water show Singapore, Sentosa events, family activities Sentosa, Singapore night shows, outdoor night show Sentosa, book Wings of Time tickets.',
    'The meta description for Wings of Time is: Experience Wings of Time at Sentosa! A breathtaking night show featuring water, laser, and fire effects. Perfect for a memorable evening.',
    'Product tags for Wings of Time include: Family Fun, Popular experiences, Frequently Bought.',
    'Wings of Time is located at Beach Station.',
    'Wings of Time (Std) tickets require prior redemption at Singapore Cable Car Ticketing counters.',
    'Wings of Time (Std) tickets are subjected to seats availability on a first come first serve basis.',
    'Wings of Time is a rain or shine event.',
    'Tickets for Wings of Time are non-exchangeable or nonrefundable under any circumstances.',
    'Once the timeslot for Wings of Time is confirmed, no further amendments are allowed.',
    'Visitors should proceed to Wings of Time admission gates to scan their issued QR code via mobile or physical printout for admission.',
    'Gates for Wings of Time will open 15 minutes prior to the start of the show.',
    'The show duration for Wings of Time is 20 minutes per show.',
    'Visitors should be punctual for their booked time slot for Wings of Time.',
    'Admission to Wings of Time will be on a first come first serve basis within the allocated timeslot or at the discretion of the attraction host.',
    'Standard seats for Wings of Time are applicable to guests aged 4 years and above.',
    'No outside food and drinks are allowed at Wings of Time.',
    'More information on Wings of Time can be found at the provided link.',
    'The pax type for Wings of Time is Standard.',
    'Promotion A for Wings of Time offers $1.90 off when purchased online.',
    'The discount for Promotion A will automatically be applied upon checkout.',
    'The price for Wings of Time is 19.',
    'Wings of Time has opening hours daily with Show 1 at 7.40pm and Show 2 at 8.40pm.',
    'Wings of Time is accessible by wheelchair.',
    "The title for terms and conditions is 'Terms & Conditions'.",
    'More information on terms and conditions can be found at the provided link.',
    "The title for getting to Wings of Time is 'Getting Here'.",
    'Visitors can get to Wings of Time by Sentosa Express by alighting at Beach Station.',
    'Visitors can get to Wings of Time by Public Bus by boarding Bus 123 and alighting at Beach Station.',
    'Visitors can get to Wings of Time by Intra-Island Bus by boarding Sentosa Bus A or B and alighting at Beach Station.',
    'The nearest car park to Wings of Time is Beach Station Car Park.',
    "The title for contacting Wings of Time is 'Contact Us'.",
    'The contact location for Wings of Time is Beach Station.',
    'The contact phone number for Wings of Time is +65 6361 0088.',
    'The contact email for Wings of Time is guestrelations@mflg.com.sg.']

經(jīng)過 AI 的智能分塊(agentic chunking),整個文本被自然地劃分為四個主要部分:

  1. 主體信息塊:包含了 Wings of Time 的核心介紹、特色、位置等綜合信息
  2. 日程政策塊:專門處理預(yù)約變更相關(guān)的信息
  3. 價格優(yōu)惠塊:聚焦于折扣和支付相關(guān)內(nèi)容
  4. 法律條款塊:歸納了各項條款和規(guī)定

Chunk (a641f): Sentosa's Wings of Time Show & Visitor Information
Summary: This chunk contains comprehensive details about the Wings of Time attraction in Sentosa, including its features, themes, location, visitor experience, ticketing and admission procedures, future enhancements, promotions, classification as a show and attraction, unique fireworks display, daily show schedule, accessibility options, importance of punctuality and ticket redemption, extended fireworks display in the Fireworks Symphony, transportation options to reach the venue, and the necessity of adhering to non-exchangeable ticket policies, with a focus on the standard ticketing process and visitor guidelines, and the recent update on the extended fireworks display, as well as the contact information and accessibility details, and the new experience starting February 2025.

Chunk (ae2b8): Scheduling Policies
Summary: This chunk contains information about policies regarding changes to scheduled dates and times.

Chunk (dadbb): Retail & Discounts
Summary: This chunk contains information about the application of discounts during the checkout process.

Chunk (3347c): Legal Terms & Conditions
Summary: This chunk contains information about terms and conditions, including their titles and where to find more information.

經(jīng)過這樣的分塊之后,各個塊的主題明確,不重疊,且重要信息優(yōu)先,輔助信息分類存放。把這樣的信息放在一起,也有助于提升向量庫的召回率,從而提升RAG的準確率。

總結(jié)

Agentic Chunking是一種非常強大的文本分塊技術(shù),它能夠?qū)⑽臋n中相隔較遠但主題相關(guān)的句子歸入同一組,從而提升RAG模型的效果,但是這種方法在成本和延遲上相對較高。同事嘗試了Agentic chunking之后,據(jù)他說準確率提升了40%,但成本也增加了3倍。那么我們時候應(yīng)該使用Agentic chunking呢?

根據(jù)我的項目經(jīng)驗,以下場景特別適合:

  • 非結(jié)構(gòu)化文本(如客服對話記錄)
  • 主題反復橫跳的內(nèi)容(技術(shù)沙龍實錄)
  • 需要跨段落關(guān)聯(lián)的QA系統(tǒng)

而面對結(jié)構(gòu)清晰的論文、說明書等,傳統(tǒng)分塊和語義分塊仍是性價比之選。


本文轉(zhuǎn)載自公眾號AI 博物院 作者:longyunfeigu

原文鏈接:??https://mp.weixin.qq.com/s/NyDnQCvq_cpCz_SwWivewQ??

?著作權(quán)歸作者所有,如需轉(zhuǎn)載,請注明出處,否則將追究法律責任
收藏
回復
舉報
回復
相關(guān)推薦
主站蜘蛛池模板: 日韩在线一区视频 | 亚州精品天堂中文字幕 | 亚洲一区二区三区四区在线观看 | av日韩在线播放 | 在线欧美日韩 | 狠狠草视频 | 91视频在线 | 91激情电影 | 成人二区三区 | 毛片99| 秋霞影院一区二区 | 91视频18 | 6080亚洲精品一区二区 | 欧美成人综合 | 老妇激情毛片免费 | 国产一区二区视频在线 | 欧美久久一级特黄毛片 | 日韩国产一区二区三区 | 久久国内精品 | 久久久久一区二区三区四区 | 国产精品一区二 | 中文字幕第十一页 | 日韩和的一区二区 | 日韩中文字幕免费在线观看 | 欧美日韩不卡在线 | 精品在线99| 欧美一级大黄 | 精品视频一区在线 | 国产91综合一区在线观看 | 欧美久久视频 | 日韩视频在线播放 | 亚洲国产一区二区三区在线观看 | 成人精品视频在线 | www.亚洲一区 | av免费网址 | 亚洲国产一区二区三区在线观看 | 免费成人在线网站 | 91久久国产综合久久 | 99久久精品国产麻豆演员表 | 中文字幕国产 | av日韩一区 |