Advanced RAG 09：『提示詞壓縮』技術綜述原創精華

發布于 2024-6-29 11:08

瀏覽

0收藏

編者按： 如何最大限度地發揮 LLMs 的強大能力，同時還能控制其推理成本？這是當前業界研究的一個熱點課題。
針對這一問題，本期精心選取了一篇關于"提示詞壓縮"(Prompt Compression)技術的綜述文章。正如作者所說，提示詞壓縮技術的核心目標是壓縮向 LLMs 輸入的上下文信息，刪減非關鍵內容，保留語義核心，從而在不影響模型表現的前提下，降低推理成本。
文中全面介紹了多種提示詞壓縮算法的原理和實現細節，包括基于信息熵的Selective Context、基于軟提示調優的AutoCompressor、引入數據蒸餾方法的LLMLingua-2、綜合利用問題語義的LongLLMLingua等。作者還貼心地附上了代碼示例，以便各位讀者可以動手實踐，加深對算法的理解。
你是否曾因難以處理冗長的提示詞而寢食難安，被昂貴的推理成本所困擾？現在，就讓我們跟隨本文的腳步，開啟一場 Prompt Compression 技術的學習之旅吧！也許在了解某個算法時靈感閃現，你就能找到突破瓶頸的金鑰匙。

作者 | Florian June

編譯 | 岳揚

RAG 方法可能會面臨兩大挑戰：

大語言模型（LLMs）往往有上下文長度（context length）的限制。這意味著，隨著輸入文本的長度增長，處理過程不僅變得更加耗時，成本也隨之增加。
檢索出的上下文未必都能派上用場。有時，僅有一小部分信息對解答問題有幫助。在某些情形下，為了回答某些特定問題，可能需要整合來自多個文本片段的信息。即便實施了重排序（re-ranking）技術，這一難題依然未能得到解決。

為了解決上述問題，LLM 的提示詞壓縮技術（Prompt compression）應運而生。從本質上講，其目的是精煉提示詞中的關鍵信息，使得每個輸入的詞元（input tokens）都承載更多價值，從而提升模型效率并還能控制成本。這一理念在圖 1 的右下角進行了直觀展示。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 1：RAG 架構中的提示詞壓縮技術（見圖右下角）。如紫色虛線標記的部分所示，某些壓縮方法能夠直接作用于已檢索的上下文信息。此圖由作者繪制。

如圖 1 中紫色虛線標記的部分所示，部分壓縮方法可以直接應用于從大語言模型中檢索出的上下文信息。

總的來說，提示詞壓縮方法可以分為四大類：

基于信息熵（information entropy）的方法：例如 Selective Context[1]、LLMLingua[2] 和 LongLLMLingua[3]。這些方法利用小型語言模型來計算原始提示詞中每個 token 的自信息（self-information ）（譯者注：自信息，又稱為驚喜度（surprisal）或信息含量（information content），是信息理論中的核心概念之一。它用來量化某個事件所傳達的信息量的大小。）或困惑度（perplexity）。接著刪除那些困惑度較低的 token ，實現壓縮目的。
基于 soft prompt tuning（譯者注：soft prompt tuning 不直接修改模型的權重，而是引入一組可學習的連續向量（通常稱為"soft prompts"），這種方法允許模型在不改變其核心結構的情況下適應不同的下游任務，同時保留了模型在預訓練階段學到的一般知識。）的方法：如 AutoCompressor[4] 和 GIST[5]。此類方法需要對大語言模型的參數進行微調，使其適用于特定領域，但不能直接應用于黑盒大語言模型（black-box LLM）。
先進行數據蒸餾，再訓練模型生成更易解釋的文本摘要：這類方法可以跨不同語言模型遷移，并能應用于無需梯度更新的黑盒大語言模型。代表性的方法包括 LLMLingua-2[6] 和 RECOMP[7]。
基于詞元合并（token merging）或詞元剪枝（token pruning）的方法：如 ToMe[8] 和 AdapLeR[9]。這些方法通常需要在推理過程中對模型進行微調或生成中間結果。

鑒于第四類方法最初是為了像 ViT 或 BERT 這樣的較小模型而提出的，本文將重點介紹前三類方法中代表性算法的原理。

01 Selective Context

1.1 作者的領悟見解

圖 2 表明，大語言模型（LLM）即使在缺乏完整上下文或對話歷史的情況下，也能對用戶的詢問做出回應。即便某些相關細節被省略，大語言模型（LLM）依舊能給出用戶期望的回答。這或許是因為大語言模型（LLM）能夠從上下文信息和預訓練階段積累的知識中推斷出缺失的信息。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 2：即便去除了部分非關鍵信息，大語言模型（LLM）依然能準確作答。來源：《Selective Context》[1]

由此看來，我們可以通過篩選掉非關鍵信息來優化上下文長度（context length），而不會影響其整體性能。這就是 Selective Context 方法的關鍵所在。

Selective Context 策略采用小型語言模型（SLM），來計算給定上下文中各個詞匯單元（比如句子、短語或詞語）的自信息值。然后，基于這些自信息值（self-information）進一步評估各單元的信息含量。通過僅保留自信息值較高的內容，Selective Context 為大語言模型（LLM）提供了更為簡潔、高效的 context representation （譯者注：經過數學化或模型化文本或對話后的機器可處理的上下文信息）。這一做法不會對其在各種任務中的表現造成負面影響。

1.2 Self-Information 自信息

Selective Context 運用自信息（self-information）來衡量內容的價值。

自信息，又稱為驚喜度（surprisal）或信息含量（information content），是信息理論中的核心概念之一。它用來量化某個事件所傳達的信息量的大小。具體來說，它是 token 出現概率的負對數形式：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

這里，??(??) 代表 token ?? 的自信息量，而 ??(??) 則指代該 token 的出現概率。

在信息論框架（information theory）下，自信息反映了事件發生時帶來的驚喜程度或不確定性程度。那些不常見的事件，由于包含了更多新穎的信息，因而具有較高的自信息值。 相比之下，頻繁發生的事件，因其提供的新信息較少，自信息值也就相應較低。

1.3 Algorithm 算法

為了便于闡述其背后的原理，我們不妨一同探究一下其源代碼。

首要步驟是配置開發環境，安裝必需的 Python 庫以及下載 Spacy 模型。

(base) Florian:~ Florian$ conda create -n "selective_context" python=3.10 
(base) Florian:~ Florian$ conda activate selective_context
(selective_context) Florian:~ Florian$ pip install selective-context
(selective_context) Florian:~ Florian$ python -m spacy download en_core_web_sm

安裝完成后，版本信息如下：

(selective_context) Florian:~ Florian$ pip list | grep selective
selective-context   0.1.4

測試代碼如下所示：

from selective_context import SelectiveContext

sc = SelectiveContext(model_type='gpt2', lang='en')
text = "INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .]. Ideal CL models in the real world should be deal with domain shifts , researchers have recently started to sample tasks from two different datasets . For instance , proposed to train and evaluate a model on Imagenet first and then challenge its performance on the Places365 dataset . considers more scenarios , starting with Imagenet or Places365 , and then moving on to the VOC/CUB/Scenes datasets. Few works propose more advanced scenarios built on top of more than two datasets."
context, reduced_content = sc(text)

# We can also adjust the reduce ratio
# context_ratio, reduced_content_ratio = sc(text, reduce_ratio = 0.5)

初次執行時，系統會自動下載 GPT-2 模型，該模型的文件大小接近 500MB。圖 3 呈現了測試代碼的具體運行結果。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 3：Selective Context 算法測試代碼運行結果。截圖由作者提供。

隨后，我們將深入研究 sc(text) 函數。該函數的內部實現代碼[10]如下：

class SelectiveContext:
 ...
 ...
 def __call__(self, text: str, reduce_ratio: float = 0.35, reduce_level :str = 'phrase') -> List[str]:
        context = self.beautify_context(text)

        self.mask_ratio = reduce_ratio

        sents = [sent.strip() for sent in re.split(self.sent_tokenize_pattern, context) if sent.strip()]

 # You want the reduce happen at sentence level, phrase level, or token level?
 assert reduce_level in ['sent', 'phrase', 'token'], f"reduce_level should be one of ['sent', 'phrase', 'token'], got {reduce_level}"
        sent_lus, phrase_lus, token_lus = self._lexical_unit(sents)
        lexical_level = {
 'sent': sent_lus,
 'phrase': phrase_lus,
 'token': token_lus
 }

 # context is the reduced context, masked_sents denotes what context has been filtered out
        context, masked_sents = self.self_info_mask(lexical_level[reduce_level].text, lexical_level[reduce_level].self_info, reduce_level)
 return context, masked_sents

這段代碼的核心操作分為三個階段：

首先，計算出上下文中每一個 token 的自信息值。
接著，依據詞匯單位（比如短語或句子）整合 token 與其對應的自信息。
最后，采取有選擇的方式保留必要的信息上下文，從而達到優化的目的。

第一步：自信息的計算

給定上下文 ??=??0,??1,…,???? ，其中每個 ???? 均代表一個 token ，我們可以借助因果語言模型（例如 GPT-2、OPT 或 LLaMA）來求解每個 token ???? 的自信息值：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

若你選用的是 GPT-2 模型，以下便是實現此計算過程的相應代碼片段[11]：

class SelectiveContext:
 ...
 ... 
 def _get_self_info_via_gpt2(self, text: str) -> Tuple[List[str], List[float]]:
 if self.lang == 'en':
            text = f"<|endoftext|>{text}"
 elif self.lang == 'zh':
            text = f"[CLS]{text}"
 with torch.no_grad():
            encoding = self.tokenizer(text, add_special_tokens=False, return_tensors='pt')
            encoding = encoding.to(self.device)
            outputs = self.model(**encoding)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1)
            self_info = -torch.log(probs)
 
        input_ids = encoding['input_ids']
        input_ids_expaned = input_ids[:, 1:].unsqueeze(-1)

第二步：整合為詞匯單元（Lexical Units）

如果僅僅在 tokens 層面上執行 selective context filtering（譯者注：識別和保留那些對當前任務或用戶查詢最為關鍵的信息，同時過濾掉不太相關或冗余的部分。），可能會導致最終的上下文失去連貫性。舉個例子，原本的數字"2009"在壓縮后可能會變成"209"，這樣的結果顯然不夠合理。

鑒于此，除了在 tokens 層面進行篩選外，同時在短語和句子層面上實行過濾策略也是極其重要的。在這里，我們所說的過濾（filtering）基本單位------詞匯單元，可以是個別的 token ，也可以是完整的短語或是句子。

那么，怎樣才能計算出每個詞匯單元 ??=(????,…,????+??) 的自信息呢？我們可以根據自信息的可加性原則，將組成 u 的每個 token 的自信息相加：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

下面是具體的代碼實現[12]，為了便于調試，我對部分變量添加了詳細的注釋：

class SelectiveContext:
 ...
 ...
 def _lexical_unit(self, sents):

 if self.sent_level_self_info:
            sent_self_info = []
            all_noun_phrases = []
            all_noun_phrases_info = []
            all_tokens = []
            all_token_self_info = []

 for sent in sents:
 # print(sent)
                tokens, self_info = self.get_self_information(sent)
 '''
                ipdb> sent
                'INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .].'

                ipdb> tokens
                ['IN', 'TR', 'ODUCT', 'ION', ' Contin', 'ual', ' Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lif', 'elong', ' Learning', ',', ' is', ' a', ' promising', ' learning', ' paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple', ' tasks', ' across', ' different', ' environments', ' over', ' their', ' lifetime', ' [', 'To', ' uniform', ' the', ' language', ' and', ' enhance', ' the', ' read', 'ability', ' of', ' the', ' paper', ' we', ' adopt', ' the', ' unique', ' term', ' continual', ' learning', ' (', ' CL', ' )', '.', '].']

                ipdb> self_info
                [7.514791011810303, 1.632637619972229, 0.024813441559672356, 0.006853647995740175, 12.09920597076416, 2.1144468784332275, 9.457701683044434, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 10.071824073791504, 0.6905602216720581, 0.01698811538517475, 1.5882389545440674, 0.4495090842247009, 0.45371606945991516, 6.932497978210449, 6.087430477142334, 3.66465425491333, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 4.6389899253845215, 0.33642446994781494, 4.918881416320801, 2.076707601547241, 3.3553669452667236, 5.5081071853637695, 5.625778675079346, 0.7966060638427734, 6.347291946411133, 12.772034645080566, 13.792041778564453, 4.11267614364624, 6.583715915679932, 3.3618998527526855, 8.434362411499023, 1.2423189878463745, 5.8330583572387695, 0.0013973338063806295, 0.3090735077857971, 1.1139129400253296, 4.160390853881836, 3.744772434234619, 7.2841596603393555, 1.4088190793991089, 7.86871337890625, 4.305004596710205, 9.69282341003418, 0.08665203303098679, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 6.892032623291016]
                '''
                sent_self_info.append(np.mean(self_info))

                all_tokens.extend(tokens)
                all_token_self_info.extend(self_info)

                noun_phrases, noun_phrases_info = self._calculate_lexical_unit(tokens, self_info)
 '''
                ipdb> noun_phrases
                ['INTRODUCTION Continual Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lifelong Learning', ',', ' is', ' a promising learning paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple tasks', ' across', ' different environments', ' over', ' their lifetime', ' [', 'To', ' uniform', ' the language', ' and', ' enhance', ' the readability', ' of', ' the paper', ' we', ' adopt', ' the unique term continual learning', ' (', ' CL', ' )', '.', ']', '.']
 
                ipdb> noun_phrases_info
                [4.692921464797109, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 3.5931241369495788, 1.5882389545440674, 0.4495090842247009, 4.284574694931507, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 2.487707197666168, 4.918881416320801, 2.7160372734069824, 5.5081071853637695, 3.2111923694610596, 6.347291946411133, 12.772034645080566, 13.792041778564453, 5.348196029663086, 3.3618998527526855, 8.434362411499023, 2.3589248929638416, 0.3090735077857971, 2.6371518969535828, 3.744772434234619, 7.2841596603393555, 4.672402499616146, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 3.446016311645508, 3.446016311645508]
                '''

 # We need to add a space before the first noun phrase for every sentence except the first one
 if all_noun_phrases:
                    noun_phrases[0] = f" {noun_phrases[0]}"
                all_noun_phrases.extend(noun_phrases)
                all_noun_phrases_info.extend(noun_phrases_info)
 
 return [
                LexicalUnits('sent', text=sents, self_info=sent_self_info),
                LexicalUnits('phrase', text=all_noun_phrases, self_info=all_noun_phrases_info),
                LexicalUnits('token', text=all_tokens, self_info=all_token_self_info)
 ]

第三步：精選保留信息含量高的上下文

在計算了每個詞匯單元的自信息之后，我們面臨的問題是如何判斷其信息含量。論文介紹了一種創新方法，利用基于百分位數的篩選策略，動態挑選出信息最豐富的內容。這種方法相較于設定固定閾值或僅僅保留前 k 個最高信息量的詞匯單元更為靈活有效。

我們的操作流程是先按自信息值（self-information values）從高到低排序所有詞匯單元 ，接著計算所有詞匯單元自信息值的 p-th percentile （譯者注：“p-th percentile” 在統計學中指的是數據分布的一個特定點，在這一點之下包含了總數據中 p% 的數值。舉個例子，假設你有一個班級的數學成績分布，如果某個學生的成績位于第90百分位（90th percentile），這意味著班上90%的學生的成績低于或等于他的成績，而他僅比剩下的10%的學生成績低。）。最后，我們精挑細選出那些自信息值不低于該百分位數的詞匯單元，確保所保留的都是信息含量最高的部分。

相關代碼[13]如下：

class SelectiveContext:
 ...
 ...

 def self_info_mask(self, sents: List[str], self_info: List[float], mask_level):
 # mask_level: mask sentences, phrases, or tokens
        sents_after_mask = []
        masked_sents = []
 
        self.ppl_threshold = np.nanpercentile(self_info, self.mask_ratio * 100)

 # if title is not None:
 #     with open(os.path.join(self.path, title+'_prob_token.tsv'), 'w', encoding='utf-8') as f:
 #         for token, info in zip(tokens, self_info):
 #             f.write(f"{token}\t{info}\n")
 #     with open(os.path.join(self.path, title+'_prob_sent.tsv'), 'w', encoding='utf-8') as f:
 #         for sent, info in zip(sents, sent_self_info):
 #             f.write(f"{sent}\n{info}\n\n")

 for sent, info in zip(sents, self_info):
 if info < self.ppl_threshold:
                masked_sents.append(sent)
                sents_after_mask.append(self.mask_a_sent(sent, mask_level))
 else:
                sents_after_mask.append(sent)
        masked_context = " ".join(sents_after_mask) if mask_level == 'sent' else "".join(sents_after_mask)
 
 return masked_context, masked_sents

02 LLMLingua

2.1 Overview 概覽

LLMLingua[2] 這種方法認為，Selective Context[1] 方法常常忽略了壓縮內容間的內在聯系及 LLM 與用于提示詞壓縮的小型語言模型間的協同作用。LLMLingua 正好解決了這些問題。

具體而言，參照圖 4，LLMLingua 利用 budget controller 為原始提示詞的各個組成部分（如指導性提示詞、演示樣例和問題）動態分配不同的壓縮率。同時，它采取粗粒度的 demonstration-level （譯者注：在完整的演示案例上進行壓縮或處理，而不是單獨處理每個小的組成部分（比如單詞或短語）。）壓縮策略，確保即使在高度壓縮的情況下，語義依然完整無損。此外，LLMLingua[2] 還引入了一種基于 tokens 的迭代算法，進一步優化細粒度的提示詞壓縮過程。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 4：LLMLingua 方法的架構概覽。來源：LLMLingua[2]

與 Selective Context 相比，LLMLingua 能更有效地保留提示詞中的關鍵信息，同時還能夠考慮到 tokens 之間的條件依賴關系，其壓縮倍數可達 20 倍。

2.2 Budget controller

Budget controller 是 LLMLingua 的關鍵組件，用于為原始提示詞的不同部分動態分配不同的壓縮率。

考慮到提示詞各部分對壓縮行為的敏感程度各不相同 ------ 例如，問題需要保持較高的信息密度，而演示樣例部分則可適度壓縮。budget controller 的職責就在于此：對指導性提示詞和問題采用較低的壓縮比率，確保核心信息的完整留存；而對于演示樣例部分，則可實施更高比率的壓縮，剔除不必要的冗余信息。

budget controller 的具體算法，詳述于圖 5 中。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 5：budget controller 的具體算法。Source: LLMLingua[2]

其核心變量定義如下：

M?: 小型語言模型，比如 GPT-2 或 LLaMA。
x = (x^ins , x^dems , x^que): 原始提示詞，整合了指導性提示詞、演示樣例與問題三大部分。
L, L_ins, L_dems, 和 L_que分別代表 x, x^ins , x^dems, 和 x^que 中的 token 總數。
??_dems: 在總體壓縮率 τ 的約束下，依據指導性提示詞和問題預設的壓縮率 τ_ins 和 τ_que 來決定的演示樣例壓縮率。
D: 集合 D 將收納所有經過壓縮處理后的演示樣例。

主要操作步驟如下：

確定演示樣例的壓縮比例。
利用小型語言模型（如 GPT-2 或 LLaMA）計算原始演示樣例集合中每個演示樣例的困惑度（perplexity）。
按照困惑度從高到低排序全部演示樣例。
迭代挑選演示樣例并將其添加到集合 D。
完成演示樣例的壓縮后，將未使用的 budget （譯者注：在總體壓縮率的限制下，算法會優先確保演示樣例被充分壓縮，然后利用剩下的壓縮能力去進一步壓縮指導性提示詞和問題，以實現最佳的信息保留和資源利用平衡。）轉用于指導性提示詞和問題的處理。
輸出經過粗粒度壓縮后的集合 D。

借助 demonstration-level （譯者注：在完整的演示案例上進行壓縮或處理，而不是單獨處理每個小的組成部分（比如單詞或短語）。）的壓縮流程，budget controller 可以確保在削減數據量的同時，核心信息得以保全，能夠有效實現原始提示詞的瘦身。這一策略特別適合處理包含多重演示樣例的復雜提示詞。

涉及的程序代碼，可在 control_context_budget 函數[14]中找到實現細節。

2.3 Iterative Token-level Prompt Compression (ITPC)

利用困惑度（perplexity）作為壓縮標準，有其內在的局限性：the independence assumption（譯者注：假設文本序列中的每個詞匯（token）或字符的出現是彼此獨立的，其出現的概率只依賴于它前面的一個或幾個詞匯，而與序列中更遠的其他詞匯無關。）。該假設認為每個 token 在提示詞中孤立存在，其出現的概率僅取決于緊鄰的前一個token，而不受其他任何 tokens 的影響。

然而，這一假設忽略了自然語言中詞元（token）間錯綜復雜的相互依存關系，而這種關系對于理解上下文和保持語義的完整性至關重要。

忽略這些相互依存的關系，極有可能在壓縮過程中造成重要信息的流失。 例如，在進行高比例壓縮時，倘若某個 token 承載著上下文中的核心推理環節或邏輯聯系紐帶，那么僅僅依據其困惑度判定其去留，可能會導致推理鏈的斷裂。

為克服這一挑戰，LLMLingua 引入了 Iterative Token-level Prompt Compression（ITPC）算法。不同于僅憑獨立概率評判 token 價值的傳統方法，ITPC 算法在壓縮提示詞時，會更精細地評估每個 token 的實際貢獻。通過反復審視提示詞的每一部分，同時考量當前上下文中每個 token 的條件概率，這一算法能更有效地維系 token 間的內在聯系，確保壓縮后提示詞的語義完整性和邏輯連貫性。

圖 6 展示了 ITPC 算法的詳細步驟：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 6：ITPC 算法的詳細步驟。圖片由原文作者提供

借助這一流程，ITPC 算法可以有效縮短提示詞信息的長度，同時還能確保其語義內容的完整性，進而有效地降低了 LLM 的推理成本。

相關的實現代碼可以在函數 ??iterative_compress_prompt??[15] 中找到。

2.4 Instruction Tuning 指令調優

如圖 4 所示，在 LLMLingua 框架內，指令調優（instruction tuning）扮演著至關重要的角色。該步驟的核心目的是縮小用于提示詞壓縮的小型語言模型與大語言模型(LLMs)之間在分布特性上的差異。

圖 7 展示了 Instruction Tuning 算法的詳細步驟：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 7：Instruction Tuning 的詳細步驟。圖片由原文作者提供

2.5 Code Demonstration 代碼演示

我們現在開始展示代碼。首要步驟是配置好環境。

(base) Florian:~ Florian$ conda create -n "llmlingua" python=3.11

(base) Florian:~ Florian$ conda activate llmlingua

(llmlingua) Florian:~ Florian$ pip install llmlingua

以下是已安裝的版本信息：

llmlingua          0.2.1

下面是用于測試的代碼段：

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"

llm_lingua = PromptCompressor()

## Or use the phi-2 model,
# llm_lingua = PromptCompressor("microsoft/phi-2")

## Or use the quantation model, like TheBloke/Llama-2-7b-Chat-GPTQ, only need <8GB GPU memory.
## Before that, you need to pip install optimum auto-gptq
# llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

compressed_prompt = llm_lingua.compress_prompt(GSM8K_PROMPT.split("\n\n")[0], instruction="", question="", target_token=200)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

首次運行時會自動下載默認模型。當然，我們也有另一個選項，即使用量化模型（quantized model）。相關的運行結果展示在圖 8 中：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 8 ：LLMLingua 測試代碼的運行結果。此截圖由原文作者提供

03 LongLLMLingua

LLMLingua 的問題在于，在壓縮處理過程中忽略了用戶提出的問題，這可能導致一些無關緊要的信息被無謂地保留下來。

而 LongLLMLingua[3] 的設計初衷正是為了解決這一缺陷，它創新性地在壓縮流程中融入了對用戶問題的考量和處理。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 9：LongLLMLingua 框架，灰色斜體內容與 LLMLingua 相同。圖片來源：LongLLMLingua[3]

如圖 9 所示，LongLLMLingua 框架引入了四項新功能，以提升大語言模型識別關鍵信息的能力：

針對用戶問題的粗粒度和細粒度兩級壓縮技術（Question-aware coarse-grained and fine-grained compression）；
動態調整的文檔排序機制（Document reordering mechanism）；
可變的壓縮比例設定（Dynamic compression ratio）；
子序列恢復算法（Subsequence recovery algorithm）。

3.1 針對用戶問題的粗粒度壓縮技術

LongLLMLingua 推薦采用這樣一種方法，利用在不同文檔上下文 ??x^doc_k??? 背景下問題 ??x^que??? 的困惑度，來衡量兩者間的關聯強度。我們可以在問題 ??x^que??? 后面附加一句限定語 ??x^restrict = "我們可以在提供的文檔里找到這個問題的答案"???。這樣做的目的是強化 ??x^que??? 與 ??x^doc_k?? 之間的聯系，同時，這句話作為一個正則化項（regularization item），能有效降低模型產生不切實際的預測結果的可能性。這可以表示為：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

為什么不直接計算在問題 x^que 約束下的整體文檔困惑度呢？原因在于，文檔內往往充斥著許多與問題不相關的冗余信息。即便是在 x^que 的引導下，對整篇文檔計算出的困惑度值也可能不夠明顯，從而導致它無法成為評估文檔層面壓縮效果的理想指標。

可以在函數 get_distance_longllmlingua[16] 中找到實現這一技術的相關代碼。

3.2 針對用戶問題的細粒度壓縮技術

LongLLMLingua 引入了對比困惑度（contrastive perplexity）的概念。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

首先，我們計算一個 token 的困惑度，不考慮問題的情況下，表示為 ??perplexity(x_i | x<i)??? 。然后，我們再次測量困惑度，這次包括了問題，表示為 ??perplexity(x_i | x^que, x<i)???。這衡量的是在給定問題 ??x^que??? 的情況下，詞元 ??x_i?? 對之前所有詞元（token）的驚訝程度。

我們的目標是確定每個 token 的驚訝程度隨問題變化的程度。如果當問題被包括進來后，某個詞變得不那么令人驚訝，那么這個詞很可能與問題高度相關。

3.3 動態調整的文檔排序機制

如圖 10 所示，在推理階段，大語言模型(LLMs)傾向于利用提示詞信息的起始和尾部內容，而往往忽視了其中間部分的信息，這便是所謂的??"Lost in the Middle"??問題。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 10：大語言模型(LLM)對相關資訊的把握能力受到其在提示詞信息中位置的影響。為了解決中間信息丟失的問題，我們引入了一項文檔重排序機制。圖片來源：LongLLMLingua[3]

圖 10 進一步表明，當關鍵信息被置于開頭時，LLMs 的表現最為出色。基于此， LongLLMLingua 會根據粗粒度壓縮的結果來組織段落，從前往后依據評分高低進行排序。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

3.4 可變的壓縮比例設定

鑒于不同文檔中關鍵信息的密集程度存在差異，我們應當對那些與問題更加相關的文檔分配更多資源（即采取更低的壓縮比率）。

LongLLMLingua 運用在粗粒度壓縮過程中得出的重要性得分，來指引細粒度壓縮階段的資源分配策略。

具體操作如下：首先，通過 LLMLingua 的 budget controller 為保留的文檔設定初始資源量。隨后，在細粒度壓縮階段，為每個文檔動態分配資源。這一分配策略基于文檔在粗粒度壓縮階段確定的重要性得分排名，以排名順序作為資源分配依據。

LongLLMLingua 實施了一種線性調度方法（linear scheduler），實現資源的自適應分配（adaptive allocation）。對于每個詞元（token） ??xi??，其資源量計算公式如下：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

其中，??Nd???表示所有文檔的數量，????????是一個控制動態分配總資源量的超參數。

對應的源代碼可以在 get_dynamic_compression_ratio[17] 函數中找到。

3.5 子序列恢復算法

如圖 11 所示，在細粒度的逐 token 壓縮環節中，一些關鍵實體的 token 有被丟棄的風險。例如，??"2009"???在原始提示詞中可能被壓縮至??"209"???，??"Wilhelm Conrad Rontgen"???也可能被簡化壓縮為??"Wilhelmgen"??。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 11：展示了一個子序列恢復算法案例，其中紅色文本代表原始內容，而藍色文字則是經過壓縮后的結果。來源：LongLLMLingua[3]

LongLLMLingua 設計了一套子序列恢復算法，能夠從大語言模型(LLMs)的回應中復原原始信息，如圖 12 所示。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 12：子序列恢復算法流程圖。圖片來源：LongLLMLingua

其核心流程包括以下幾個步驟：

遍歷大語言模型(LLM)響應內容中的每一個詞元（token）??yl???，從中選取在壓縮提示詞 ??x???? 中出現的最長子序列 ??y?key,l??；
在原始提示詞 ??x??? 內，尋找與 ??y?key,l??? 匹配的最大公共最短子序列（maximum common shortest subsequence）??xi,j??；
將大語言模型(LLMs)響應內容中的相應詞元 ??y?key,l??? 替換為原始的 ??xi,j??。

這一算法的具體代碼可以在 recover 函數[18]中找到。

3.6 代碼演示

環境配置的方法與 LLMLingua 相同。下面是測試代碼：

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"
QUESTION = "Question: Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?"



llm_lingua = PromptCompressor()

compressed_prompt = llm_lingua.compress_prompt(
    GSM8K_PROMPT.split("\n\n")[0],
    question = QUESTION,
 # ratio=0.55
 # Set the special parameter for LongLLMLingua
    condition_in_question = "after_condition",
    reorder_context = "sort",
    dynamic_context_compression_ratio = 0.3, # or 0.4
    condition_compare = True,
    context_budget = "+100",
    rank_method = "longllmlingua",
)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])


print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

運行結果如圖 13 所示：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 13：LongLLMLingua 測試代碼的運行結果。截圖由作者提供。

04 AutoCompressor

不同于先前提及的方法，AutoCompressor[4] 采取了一種基于軟提示詞的創新途徑。

它巧妙地通過增加詞匯量和利用??"summary tokens"???和??"summary vectors"??來提煉大量上下文信息，進而精調現有的模型結構。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 14：AutoCompressor 通過遞歸生成 summary vectors 來處理長文檔，這些 summary vectors 作為軟提示詞（soft prompts）被傳遞給后續的所有文檔片段。圖片來源：AutoCompressor[4]

圖 14 描繪了 AutoCompressor 的工作原理，其運行步驟如下：

詞匯擴展（Expand Vocabulary）：在這一步驟中，我們將 “summary tokens” 加入到模型現有的詞匯庫中。這些 tokens 的作用是幫助模型將龐大的信息量壓縮成更緊湊的向量表征。
文檔分割（Split Document）：待處理的文檔被切割成若干小段，每一小段后都會附加有 summary tokens 。這些 tokens 不僅攜帶了本段的信息，還包含了前面所有段落的摘要信息，實現了摘要信息的連續積累（summary accumulation）。
微調訓練（Fine-tuning Training）：采用無監督訓練的方式，借助 “next word prediction” 任務對模型進行微調。該任務的核心在于，根據當前片段前的 tokens 序列以及之前片段的摘要向量（summary vectors），預測下一個單詞。
反向傳播（Backpropagation）：AutoCompressor 在每個文檔片段上運用 backpropagation through time(BPTT)（譯者注：對于每一個時間步，BPTT 都會計算損失函數關于當前時間步和所有之前時間步參數的梯度，然后將這些梯度反向傳播回網絡，以更新參數。）和 gradient checkpointing（譯者注：在標準的反向傳播過程中，為了計算梯度，需要保存前向傳播過程中的所有中間結果。但隨著網絡深度的增加，這會消耗大量的內存。Gradient checkpointing 通過犧牲一些計算效率來減少內存需求。）技術，能夠有效縮減計算圖（computational graph）的規模。反向傳播針對整個文檔進行，使得模型能夠全面理解并學習到整個上下文之間存在的關聯。

4.1 代碼演示

AutoCompressor[19] 開放了其源代碼，感興趣的讀者可以試著讀一讀。

import torch
from transformers import AutoTokenizer
from auto_compressor import LlamaAutoCompressorModel, AutoCompressorModel

# Load AutoCompressor trained by compressing 6k tokens in 4 compression steps
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k")
# Need bfloat16 + cuda to run Llama model with flash attention
model = LlamaAutoCompressorModel.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k", torch_dtype=torch.bfloat16).eval（).cuda()

prompt = 'The first name of the current US president is "'
prompt_tokens = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

context = """Joe Biden, born in Scranton, Pennsylvania, on November 20, 1942, had a modest upbringing in a middle-class family. He attended the University of Delaware, where he double-majored in history and political science, graduating in 1965. Afterward, he earned his law degree from Syracuse University College of Law in 1968.\nBiden's early political career began in 1970 when he was elected to the New Castle County Council in Delaware. In 1972, tragedy struck when his wife Neilia and 1-year-old daughter Naomi were killed in a car accident, and his two sons, Beau and Hunter, were injured. Despite this devastating loss, Biden chose to honor his commitment and was sworn in as a senator by his sons' hospital bedsides.\nHe went on to serve as the United States Senator from Delaware for six terms, from 1973 to 2009. During his time in the Senate, Biden was involved in various committees and was particularly known for his expertise in foreign affairs, serving as the chairman of the Senate Foreign Relations Committee on multiple occasions.\nIn 2008, Joe Biden was selected as the running mate for Barack Obama, who went on to win the presidential election. As Vice President, Biden played an integral role in the Obama administration, helping to shape policies and handling issues such as economic recovery, foreign relations, and the implementation of the Affordable Care Act (ACA), commonly known as Obamacare.\nAfter completing two terms as Vice President, Joe Biden decided to run for the presidency in 2020. He secured the Democratic nomination and faced the incumbent President Donald Trump in the general election. Biden campaigned on a platform of unity, promising to heal the divisions in the country and tackle pressing issues, including the COVID-19 pandemic, climate change, racial justice, and economic inequality.\nIn the November 2020 election, Biden emerged victorious, and on January 20, 2021, he was inaugurated as the 46th President of the United States. At the age of 78, Biden became the oldest person to assume the presidency in American history.\nAs President, Joe Biden has worked to implement his agenda, focusing on various initiatives, such as infrastructure investment, climate action, immigration reform, and expanding access to healthcare. He has emphasized the importance of diplomacy in international relations and has sought to rebuild alliances with global partners.\nThroughout his long career in public service, Joe Biden has been recognized for his commitment to bipartisanship, empathy, and his dedication to working-class issues. He continues to navigate the challenges facing the nation, striving to bring the country together and create positive change for all Americans."""
context_tokens = tokenizer(context, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

summary_vectors = model(context_tokens, output_softprompt=True).softprompt
print(f"Compressing {context_tokens.size(1)} tokens to {summary_vectors.size(1)} summary vectors")
# >>> Compressing 660 tokens to 50 summary vectors

generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors, max_new_tokens=12)[0]
print("Generation w/ summary vectors:\n" + tokenizer.decode(generation_with_summary_vecs))
# >>> The first name of the current US president is "Joe" and the last name is "Biden".

next_tokens_without_context = model.generate(prompt_tokens, do_sample=False, max_new_tokens=11)[0]
print("Generation w/o context:\n" + tokenizer.decode(next_tokens_without_context))
# >>> The first name of the current US president is "Donald" and the last name is "Trump".

05 LLMLingua-2

LLMLingua-2[6] 發現，通過基于因果語言模型（如LLaMa-7B）的信息熵刪除 tokens 或詞匯單位（lexical units）來進行提示詞壓縮存在兩大挑戰：

(1) 用來計算信息熵的小型語言模型與提示詞壓縮的實際目標不一致。

(2) 這一方法僅依賴于單向的上下文信息，而這或許無法覆蓋提示詞壓縮所需的所有必要信息。

這些問題的核心在于，基于信息熵（information entropy）進行提示詞壓縮可能并非是最優的選擇。

LLMLingua-2 的整體架構如圖 15 所示：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 15：LLMLingua-2的架構總覽。來源：LLMLingua-2[6]

針對第一個問題，LLMLingua-2 引入了數據蒸餾流程。該流程從大語言模型中提取知識，在不丟失關鍵信息的情況下壓縮提示詞。同時，它還構建了一個 extractive text compression dataset （譯者注：從原始文本中挑選出最重要的句子、短語或詞匯，直接組成一個較短的版本，以保留原文的主要信息和意義。一般來說不涉及生成新的句子來概括原文）。在這樣的數據集上進行訓練，有助于小型語言模型更精準地對齊提示詞壓縮的需求。

面對第二個問題，LLMLingua-2 采取了一種創新策略 ------ 將提示詞壓縮轉化為詞元（token）分類任務。這一策略確保了壓縮后的提示詞能忠實地反映原始提示詞的意圖。它選用 transformer 的編碼器作為底層架構，能夠充分利用完整的雙向上下文信息（bidirectional context），捕捉到進行提示詞壓縮所需的全部必要細節。

5.1 如何構建有效的提示詞壓縮數據集？

數據蒸餾

數據蒸餾從大語言模型（比如 GPT-4）中抽取知識，以便在不丟失基本信息的情況下實現有效壓縮提示詞。

在 LLMLingua-2 這一項目中，指導性提示詞的設計經過了精心設計，如圖 16 所示。這些指導性提示詞（instructions）指導 GPT-4 在不向生成文本中引入新詞匯的前提下，剔除原始文本中的冗余詞匯，從而實現文本的壓縮。

與此同時，這些指導性提示詞（instructions）并未強行規定壓縮的比例。相反，GPT-4 被鼓勵盡可能地縮減原始文本的體積，但前提是必須確保原始信息的完整性。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 16：LLMLingua-2 中用于數據蒸餾的指導性提示詞。

如圖 17 所示，在處理非常長的文本時，GPT-4 傾向于采取高比例的壓縮策略。可能是因為其處理長文本的能力有限。這種激進的壓縮策略往往伴隨著大量信息的流失，可能嚴重影響接下來的任務執行效果。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 17：在 MeetingBank 數據集上，根據原始文本長度，GPT-4 的壓縮比情況。在本研究中，我們使用了 GPT-4–32k ，并將輸出 tokens 的數量上限設為 4096。來源：LLMLingua-2[6]。

為了解決這個問題，LLMLingua-2 引入了一種分塊壓縮（chunk compression） 技術，即先將長文本拆解為若干個不超過 512 tokens 的小文本塊，再分別對每一小文本塊進行壓縮處理，由 GPT-4 來完成這一過程。

數據標注

現在，我們已經利用數據蒸餾手段，收集到了原始文本與其壓縮內容的組合。數據標注的目的是為原始文本里的每個 token 標上一個二元標簽，以此判斷壓縮后該字符是否應該被保留。

考慮到 GPT-4 不一定能夠完美遵循指導性提示詞，LLMLingua-2 采取了滑動窗口策略（sliding window） ，以此來限定搜索范圍。同時，還引入了模糊匹配技術（fuzzy matching） ，有效處理了 GPT-4 在提示詞壓縮過程中對原始詞匯可能做出的細微改動。

質量控制

在 LLMLingua-2 項目中，質量控制環節采用了兩個關鍵指標來評估通過 GPT-4 蒸餾生成的壓縮文本，以及自動標注標簽的優劣：Variation Rate（VR） 和Alignment Gap（AG） 。

Variation Rate（VR）衡量的是，壓縮后的文本與原始文本相比，有多少比例的詞匯發生了改變。而Alignment Gap（AG），則是用來衡量自動標注的標簽的精準程度。

通過這些評估指標，LLMLingua-2 便能有效地篩除不合格的樣本，從而保障整個數據集質量。

5.2 Compressor 壓縮器

將其視為二元分類問題

從本質上講，可將提示詞壓縮問題重塑為二元分類問題。其基本概念是將每一個詞匯單元視為一個獨立的實體，并為其分配一個標簽：??"保留"??? 或 ??"丟棄"??。這一策略不僅確保了壓縮后提示詞內容的完整性，同時還簡化了模型結構。

模型架構設計

采用了基于 Transformer 編碼器的特征編碼器（feature encoder），并在其上巧妙地疊加了一個線性分類層（linear classification layer）。

這樣的架構設計使得模型能夠深刻理解每個詞匯單元的雙向上下文信息，為高效完成壓縮任務奠定了堅實的基礎。

提示詞壓縮策略

壓縮原始提示詞 ??x??? 的策略分為三個步驟。目標壓縮比率設定為 ??1/τ???，這里 ??τ??? 即為壓縮后提示詞的詞匯量與原始提示詞 ??x?? 的詞匯量之商。

首先，我們計算出壓縮后提示詞 ??x???? 需要保留的 token 數量：??N? = τN??。
隨后，運用 token 分類模型來預估每個詞匯 ??xi??? 被標定為"保留"的概率 ??pi??。
最后，我們從原始提示詞 ??x??? 中篩選出前 ??N???? 個 ??pi??? 值最高的詞匯，嚴格保持其原有排列順序，進而組成壓縮后的提示詞 ??x???。

5.3 代碼演示

從上文可以看出，LLMLingua-2 的主要工作是構建壓縮器（compressor）。那么，當我們成功獲取了這個壓縮器之后，下一步該如何操作呢？

請參照下方的代碼示例（環境配置方式與 LLMLingua 一致）。compress_prompt_llmlingua2[20] 函數內集中體現了主要的處理邏輯。

from llmlingua import PromptCompressor

PROMPT = "John: So, um, I've been thinking about the project, you know, and I believe we need to, uh, make some changes. I mean, we want the project to succeed, right? So, like, I think we should consider maybe revising the timeline.\n\nSarah: I totally agree, John. I mean, we have to be realistic, you know. The timeline is, like, too tight. You know what I mean? We should definitely extend it."

llm_lingua = PromptCompressor(
    model_name = "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2 = True,
)
compressed_prompt = llm_lingua.compress_prompt(PROMPT, rate=0.33, force_tokens = ['\n', '?'])

## Or use LLMLingua-2-small model
# llm_lingua = PromptCompressor(
#     model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
#     use_llmlingua2=True,
# )

print('-' * 100)
print("original:")
print(PROMPT)

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

運行結果如圖 18 所示：

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 18：LLMLingua-2 測試代碼的運行結果。截圖由原文作者提供

06 RECOMP

RECOMP[7] 創新性地引入了兩類經過訓練的壓縮器：抽取型（extractive）和概括型（abstractive）。抽取型壓縮器擅長從已檢索的文檔中精挑細選出有價值的部分 ；而概括型壓縮器則通過融合多篇文檔的精華，自動生成摘要。

圖 19 生動描繪了壓縮器在 RECOMP 架構中的位置。

Advanced RAG 09：『提示詞壓縮』技術綜述-AI.x社區

圖 19：RECOMP 架構。圖片來源：RECOMP

6.1 抽取型壓縮器

給定輸入文檔集中的 n 個句子 ??[s1, s2, ..., sn]??? ，使用一個雙編碼器模型（dual encoder model）進行訓練。該模型能將每個句子 ??si??? 和輸入序列 ??x??? 轉換為固定長度的向量表征。這些嵌入向量的內積反映了將句子 ??si??? 添加到輸入序列 ??x?? 中，對于大語言模型（LLM）生成目標輸出序列（target output sequence）的幫助程度。

壓縮器最終生成的摘要 ??s??? 由排名前 ??N?? 的句子組成，按照它們與輸入序列的內積進行排序。

6.2 概括型壓縮器

概括型壓縮器采用的是編碼器-解碼器架構（encoder-decoder）。它處理輸入序列 ??x??? 與檢索出的文檔集合并將其連接起來，進而產生摘要 ??s??。

該方法具體步驟如下：首先利用大語言模型（如GPT-3）來生成訓練數據集；然后對數據集進行篩選；最后，使用經過篩選后的數據集來訓練編碼器-解碼器模型（encoder-decoder model）。

6.3 代碼演示

鑒于 RECOMP 當前尚處在開發初期，我們在此暫不進行演示。對此感興趣的讀者不妨親自動手體驗一番。

07 結論 Conclusion

本文探討了提示詞壓縮技術，覆蓋了該技術的方法分類、算法原理以及代碼實踐演示。

在本文所討論的各種方法中，LongLLMLingua 或許是一個更為出色的選擇 。我們已在項目實踐中應用了這一方法。原文作者承諾一旦他們發現了 LongLLMLingua 存在的不足，或是發現了更為優秀的替代方案，他們將對原文（??https://ai.gopubby.com/advanced-rag-09-prompt-compression-95a589f7b554?? ）進行更新（譯者注：如果有小伙伴關注到了內容更新，請在下方留言，我們會盡量及時進行內容補充，感謝??。）。此外，LLMLingua-2 也值得一試，它在運行速度和內存消耗方面都表現優異。

Thanks for reading!

Florian June

An artificial intelligence researcher, mainly write articles about Large Language Models, data structures and algorithms, and NLP.

END

參考資料

[1]??https://arxiv.org/pdf/2304.12102.pdf??

[2]??https://arxiv.org/pdf/2310.05736.pdf??

[3]??https://arxiv.org/pdf/2310.06839.pdf??

[4]??https://arxiv.org/pdf/2305.14788.pdf??

[5]??https://arxiv.org/pdf/2304.08467.pdf??

[6]??https://arxiv.org/pdf/2403.12968.pdf??

[7]??https://arxiv.org/pdf/2310.04408.pdf??