五種常用于LLM的令牌遮蔽技術介紹以及Pytorch的實現
本文將介紹大語言模型中使用的不同令牌遮蔽技術,并比較它們的優點,以及使用Pytorch實現以了解它們的底層工作原理。
令牌掩碼Token Masking是一種廣泛應用于語言模型分類變體和生成模型訓練的策略。BERT語言模型首先使用,并被用于許多變體(RoBERTa, ALBERT, DeBERTa…)。
而Text Corruption是一種更大的令牌遮蔽策略。在BART研究論文中,進行了大量實驗來訓練具有不同策略的編碼器-解碼器生成模型。
在進入正題之前,我們先介紹大型語言模型(llm)中掩碼策略的背景
從監督到自監督
語言模型的初始訓練中使用了大量文本,其目標是使模型學會正確地表示語言,并將這種知識隱式地存儲在其參數權重中。
大量的文本必須具有用于訓練的標簽,因為必須在處理模型輸入數據并使用參考數據之后計算損失(交叉熵)。但是注釋如此大量的數據是不可行的。所以智能將問題從監督學習變為自動生成標簽的自監督問題。
在這種情況下,被破壞的文本序列作為模型的訓練輸入,而所有或部分原始序列作為訓練數據的標簽。這樣通過自動生成的標簽,模型學習與每個訓練示例關聯的標簽,就不需要手動的注釋數據。
在Text Corruption中(特別是在Token Masking、Token Deletion和Text Infilling中),每個單詞可能會按照固定概率(通常約為15-20%)進行遮蔽。這個概率保持較低,以便模型即使在序列被損壞的情況下也能學習每個句子的上下文。
還有一些技術,如Sentence Permutation 或Document Rotation,不會專注于按照一定概率遮蔽單詞,我們后面會介紹。
在訓練語言模型時,標簽會根據是分類模型(僅編碼器)還是生成模型(編碼器-解碼器)而變化。在分類模型中,使用的標簽只關注輸入中被遮蔽的區域。因此如果一個詞在整個句子中被屏蔽,標簽只是單個單詞。而對于生成模型,由于模型必須能夠連續生成文本,輸出標簽是初始未損壞的序列,關注整個序列本身。
環境配置
我們已經簡要介紹了使用Text Corruption訓練語言模型的一些背景知識,下面我們開始使用示例代碼來介紹不同的Text Corruption技術。
我們將使用Stanza,一個由斯坦福NLP開發的庫,其中包含不同的NLP工具,這些工具對我們的預處理非常有用。
import stanza
stanza.download('en')
# Text used in our examples
text = "Huntington's disease is a neurodegenerative autosomal disease
results due to expansion of polymorphic CAG repeats in the huntingtin gene.
Phosphorylation of the translation initiation factor 4E-BP results in the
alteration of the translation control leading to unwanted protein synthesis
and neuronal function. Consequences of mutant huntington (mhtt) gene
transcription are not well known. Variability of age of onset is an
important factor of Huntington's disease separating adult and juvenile types.
The factors which are taken into account are-genetic modifiers, maternal
protection i.e excessive paternal transmission, superior ageing genes
and environmental threshold. A major focus has been given to the molecular
pathogenesis which includes-motor disturbance, cognitive disturbance and
neuropsychiatric disturbance. The diagnosis part has also been taken care of.
This includes genetic testing and both primary and secondary symptoms.
The present review also focuses on the genetics and pathology of Huntington's
disease."
# We will use a stanza model for getting each different sentence
# as an element of the list
nlp = stanza.Pipeline('en', use_gpu=False)
doc = nlp(text)
sentences = [sentence.text for sentence in doc.sentences]
Token Masking
令牌掩碼用<mask>替換文本中的隨機單詞
這是從BERT引入的策略,它包括通過屏蔽隨機單詞來破壞輸入序列,這些單詞將在訓練期間用作輸出標簽。
在分類模型中,我們可以直接使用Huggingface的DataCollatorForLanguageModeling類來生成必要的標簽,這樣就可以訓練像BERT或RoBERTa這樣的模型。
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
import torch
def load_dataset_mlm(sentences, tokenizer_class=AutoTokenizer,
collator_class=DataCollatorForLanguageModeling,
mlm=True, mlm_probability=0.20):
tokenizer = tokenizer_class.from_pretrained('google-bert/bert-base-uncased')
inputs = tokenizer(sentences, return_tensors='pt', padding=True,
truncation=True)
# Random masking configuration
data_collator = collator_class(
tokenizer=tokenizer,
mlm=mlm,
mlm_probability=mlm_probability
)
"""The collator expects a tuple of tensors, so you have to split
the input tensors and then remove the first dimension and pass it
to a tuple. """
tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
tuple_ids = list(tuple_ids)
for tensor in range(len(tuple_ids)):
tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
tuple_ids = tuple(tuple_ids)
# Get input_ids, attention_masks and labels for each sentence.
batch = data_collator(tuple_ids)
return batch['input_ids'], inputs['attention_mask'], batch['labels']
input_ids, attention_mask, labels = load_dataset_mlm(sentences)
"""
input_ids[0]:
tensor([ 101, 16364, 1005, 1055, 103, 2003, 1037, 103, 10976, 3207,
103, 25284, 103, 25426, 16870, 4295, 3463, 2349, 2000, 103,
1997, 26572, 18078, 6187, 2290, 17993, 1999, 1996, 5933, 7629,
103, 103, 102, 0, 0])
attention_mask[0]:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
labels[0]:
tensor([ -100, -100, -100, -100, 4295, -100, -100, 11265, -100, -100,
6914, -100, 8285, -100, 2389, -100, -100, -100, -100, 4935,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
4962, 1012, -100, -100, -100]))
"""
生成的inputs_ids對原始文本的每個標記都是整數。一個特殊的標記表示被屏蔽的單詞(在BERT中,這個標記是103)。這個特殊的標記根據所使用的語言模型而變化,因此不同的標記器將返回注意掩碼的不同標識符。
Huggingface還在模型中使用不同的操作分配唯一的令牌,因此用“-100”表示的令牌表示模型應該忽略它們。
對于像BART這樣的生成模型,我們可以使用DataCollatorForLanguageModeling類實現令牌屏蔽策略。但是需要一些小的更改,以使標記適應生成模型。
from transformers import BartTokenizer, DataCollatorForLanguageModeling
import torch
def load_dataset_mlm(sentences, tokenizer_class=BartTokenizer,
collator_class=DataCollatorForLanguageModeling,
mlm=True, mlm_probability=0.20):
tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
inputs = tokenizer(sentences, return_tensors='pt', padding=True,
truncation=True)
# Random masking configuration
data_collator = collator_class(
tokenizer=tokenizer,
mlm=mlm, # True for Masked Language Modelling
mlm_probability=mlm_probability # Chance for every token to get masked
)
"""The collator expects a tuple of tensors, so you have to split
the input tensors and then remove the first dimension and pass it
to a tuple. """
tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
tuple_ids = list(tuple_ids)
for tensor in range(len(tuple_ids)):
tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
tuple_ids = tuple(tuple_ids)
# Get input_ids, attention_masks and labels for each sentence.
batch = data_collator(tuple_ids)
batch['labels'] = inputs['input_ids']
return batch['input_ids'], inputs['attention_mask'], batch['labels']
input_ids, attention_mask, labels = load_dataset_mlm(sentences)
"""
input_ids[0]:
tensor([ 0, 38831, 2577, 1054, 18, 2199, 16, 10, 14913, 28904,
5777, 3693, 32226, 38868, 2199, 775, 528, 7, 2919, 9,
48052, 636, 230, 3450, 35315, 11, 5, 50264, 50264, 50264,
4, 2])
attention_mask[0]:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1])
labels[0]:
tensor([ 0, 38831, 2577, 1054, 18, 2199, 16, 10, 14913, 28904,
5777, 3693, 32226, 38868, 2199, 775, 528, 7, 2919, 9,
48052, 636, 230, 3450, 35315, 11, 5, 8217, 24276, 10596,
4, 2])
"""
每個輸入標記都標記著與之對應的標記,無論它是否被屏蔽。這是因為與分類模型不同,模型必須能夠基于給定給模型的序列生成文本序列。在BART的情況下,表示屏蔽的標記的ID是50264。
Token Deletion
使用標記刪除 Token Deletion,模型必須學習確切的位置和缺失的詞是什么,因此它必須比僅使用Token Masking學習更多的特征。
這種策略使用了一種不同的屏蔽方法。以一定的概率一個詞從原始文本序列中被移除,因此模型必須找到缺失的單詞及其位置。標準的屏蔽方法不會學習位置,因為屏蔽已經在模型的輸入中指示
def token_deletion(sentences, tokenizer_class=BartTokenizer,collator_class=DataCollatorForLanguageModeling,
mlm=True, mlm_probability=0.20):
tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
data_collator = collator_class(
tokenizer=tokenizer,
mlm=mlm,
mlm_probability=mlm_probability
)
tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
tuple_ids = list(tuple_ids)
for tensor in range(len(tuple_ids)):
tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
tuple_ids = tuple(tuple_ids)
batch = data_collator(tuple_ids)
# We use the initial inputs as labels
batch['labels'] = batch['input_ids'].clone()
# We remove tokens with mask identifier and thus make token deletion
# Change the value to the mask identifier of the specific token model
# It is necessary to know the identifier of the mask token for
# that specific model
mask = batch['input_ids'] != 50264
initial_size = batch['input_ids'].size(1)
total_sentences = batch['input_ids'].size(0)
# When we remove the specific token, we must fill with the padding
# token otherwise the tensor size is not respected.
for i in range(total_sentences):
new_tensor = batch['input_ids'][i][mask[i]]
new_tensor = F.pad(new_tensor, (0, initial_size - new_tensor.size(0)), value=1)
batch['input_ids'][i] = new_tensor
attention_mask = batch['input_ids'][i] == 1
inputs['attention_mask'][i][attention_mask] = 0
return batch['input_ids'], inputs['attention_mask'], batch['labels']
input_ids, attention_mask, labels = token_deletion(sentences)
"""
input_ids[0]:
tensor([ 0, 38831, 2577, 1054, 2199, 14913, 28904, 3693, 32226, 38868,
2199, 775, 528, 7, 2919, 9, 23404, 636, 230, 35315,
11, 5, 24276, 10596, 4, 2, 1, 1, 1, 1,
1, 1])
attention_mask[0]:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0])
labels[0]:
tensor([ 0, 38831, 2577, 1054, 50264, 2199, 50264, 50264, 14913, 28904,
50264, 3693, 32226, 38868, 2199, 775, 528, 7, 2919, 9,
23404, 636, 230, 50264, 35315, 11, 5, 50264, 24276, 10596,
4, 2])
"""
當使用Token Deletion訓練BART時,長序列用于問答、摘要生成任務和會話任務會有一定的提高。
Text Infilling
文本填充 Text Infilling允許模型學習每個屏蔽位置可以有多少個單詞。而先前的方法假設每個屏蔽位置只有一個單詞。
Text Infilling與Token Masking類似,因為我們會以一定的概率在原始文本上使用屏蔽。但是不同之處在于屏蔽可以覆蓋多個單詞。在BART中,屏蔽是用泊松分布 lambda = 3 進行的;這意味著平均而言,每次對句子中的文本進行屏蔽時,會有三個單詞被包含在一個單個的<mask>標記中,但由于這是一個概率分布,可能會有更多或更少的屏蔽單詞。
我們將使用Numpy庫和特定于我們的語言模型(在本例中是BART)的標記器來實現文本填充。
import numpy as np
from transformers import BartTokenizer
def text_infilling(sentence, probability=0.2, poisson_lambda=3):
# We'll use a binary mask to determine which words to replace
mask = np.random.choice([0, 1], size=len(sentence), p=[1-probability, probability])
# Now we'll replace the chosen words with a mask token
# We'll also use a Poisson distribution to determine the length of the spans to mask
for i in range(len(mask)):
if mask[i] == 1:
span_length = np.random.poisson(poisson_lambda)
for j in range(span_length):
if i + j < len(sentence):
sentence[i + j] = "<mask>"
infilled_sentence = []
for token in range(len(sentence)):
if sentence[token] == "<mask>":
if token < len(sentence)-1:
if sentence[token+1] == "<mask>":
continue
else:
infilled_sentence.append(sentence[token])
else:
infilled_sentence.append(sentence[token])
else:
infilled_sentence.append(sentence[token])
return " ".join(infilled_sentence)
def text_infilling_input(masked_sentences, sentences, tokenizer_class=BartTokenizer):
tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
inputs = tokenizer(masked_sentences, return_tensors='pt', padding=True, truncation=True)
labels = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
return inputs['input_ids'], inputs['attention_mask'], labels['input_ids']
input_ids, attention_mask, labels = text_infilling_input(masked_sentences, sentences)
"""
input_ids[0]:
tensor([ 0, 50264, 16, 50264, 2199, 775, 528, 50264, 48052, 636,
50264, 8217, 24276, 10596, 4, 2, 1, 1, 1, 1,
1, 1, 1])
attention_mask[0]:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
labels[0]:
tensor([ 0, 38831, 2577, 1054, 18, 2199, 16, 10, 14913, 28904,
5777, 3693, 32226, 38868, 2199, 775, 528, 7, 2919, 9,
48052, 636, 230, 3450, 35315, 11, 5, 8217, 24276, 10596,
4, 2])
"""
Text Infilling比Token Deletion更能改善BART語言模型的結果,在問題回答、文本摘要和會話任務中提供更好的生成。
Sentence Permutation
語言模型的輸入文本被分成隨機重新排列的句子,模型需要找出原始的順序。
在Sentence Permutation中,考慮適合模型輸入序列的句子數量是至關重要的(在小型模型中,輸入序列在512到1024之間)。在確定符合序列的句子數量之后,需要將它們分離到一個列表或數組中,并隨機選擇,而不重復其中任何一個。
# It selects the first "number_sentences" within a given set of "sentences"
# and returns those sentences in a random order.
def sentence_permutation(sentences, number_sentences):
new_sentences = sentences[:number_sentences]
random.shuffle(new_sentences)
new_sentences = sentence_joiner(new_sentences)
return new_sentences
def permuted_data_generation(sentences: list, total_sentences: int):
training_sentences = []
training_labels = []
sentences_copy = sentences.copy()
# We can apply sentence_permutation a number of times equal to the
# size of the list - 1 to get an example with each new sentence in
# the text, removing the oldest one.
for _ in range(len(sentences)-total_sentences+1):
new_sentences = sentence_permutation(sentences_copy, total_sentences)
joined_sentences = sentence_joiner(sentences_copy[:total_sentences])
sentences_copy = sentences_copy[1:]
training_sentences.append(new_sentences)
training_labels.append(joined_sentences)
return training_sentences, training_labels
def permutation_training(sentences: list, sentences_labels: list,
tokenizer_class=BartTokenizer,
collator_class=DataCollatorForLanguageModeling,
mlm=True, mlm_probability=0.0):
# We get input_ids and attention mask from the permuted sentences
input, attention_mask, _ = load_dataset_mlm(sentences, tokenizer_class, collator_class, mlm,mlm_probability)
# Labels from the original sentences
labels, _, _ = load_dataset_mlm(sentences_labels, tokenizer_class, collator_class, mlm,mlm_probability)
return input.squeeze(0), attention_mask.squeeze(0), labels.squeeze(0)
input_ids, attention_mask, labels = permutation_training(training_sentences, training_labels_sentences)
"""
input_ids[0]:
tensor([ 0, 38831, 2577, 1054, 18, 2199, 16, 10, 14913, 28904,
5777, 3693, 32226, 38868, 2199, 775, 528, 7, 2919, 9,
48052, 636, 230, 3450, 35315, 11, 5, 8217, 24276, 10596,
4, 2585, 33430, 8457, 9, 41419, 8217, 1054, 36, 119,
49491, 43, 10596, 37118, 32, 45, 157, 684, 4, 4129,
33839, 4405, 35019, 9, 5, 19850, 34939, 3724, 204, 717,
12, 21792, 775, 11, 5, 39752, 9, 5, 19850, 797,
981, 7, 15067, 8276, 37423, 8, 46282, 5043, 4, 2])
attention_mask[0]:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1])
labels[0]:
tensor([ 0, 38831, 2577, 1054, 18, 2199, 16, 10, 14913, 28904,
5777, 3693, 32226, 38868, 2199, 775, 528, 7, 2919, 9,
48052, 636, 230, 3450, 35315, 11, 5, 8217, 24276, 10596,
4, 4129, 33839, 4405, 35019, 9, 5, 19850, 34939, 3724,
204, 717, 12, 21792, 775, 11, 5, 39752, 9, 5,
19850, 797, 981, 7, 15067, 8276, 37423, 8, 46282, 5043,
4, 2585, 33430, 8457, 9, 41419, 8217, 1054, 36, 119,
49491, 43, 10596, 37118, 32, 45, 157, 684, 4, 2])
"""
我們對于模型的每個數據輸入,刪除原始序列中出現的第一個句子,然后在執行基于要選擇的固定句子數目的句子排列之前,將接下來的句子添加進去。這樣雖然重新排列了輸入序列中的句子,但保持了一個每個新例子中都會出現一個新的句子,并刪除最舊的句子的上下文窗口。
Document Rotation
當旋轉一個文檔時,選擇一個特定的詞,并將其設定為起始詞,而所有之前的詞都被粘貼到文本的末尾。
如果要應用Document Rotation,必須考慮到每個批次使用的維度。在應用填充的情況下,這個填充不能與文檔的其余部分一起旋轉,而是必須保持其原始位置,同時整個文檔旋轉。
def sentence_joiner(sentences: list):
return ' '.join(sentences)
# With this function we gather as many sentences as we want to form the input data to the tokenizer.
def rotated_data_generation(sentences: list, total_sentences: int):
training_sentences = []
sentences_copy = sentences.copy()
for _ in range(len(sentences)-total_sentences+1):
new_sentences = sentences_copy[:total_sentences]
new_sentences = sentence_joiner(new_sentences)
sentences_copy = sentences_copy[1:]
training_sentences.append(new_sentences)
return training_sentences
# Apply this function over the rotated sentences from previous function
def document_rotation_training(sentences, tokenizer_class=BartTokenizer):
tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
tokens = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
tokens['input_ids'] = tokens['input_ids'].squeeze(0)
tokens['labels'] = tokens['input_ids'].clone()
iterations = tokens['input_ids'].size(0)
for i in range(iterations):
# Get the attention mask and convert to list
attention_mask = tokens['attention_mask'][i].tolist()
# Calculate the position where padding starts
if 0 in attention_mask:
padding_start_position = attention_mask.index(0)
else:
padding_start_position = False
# We take into account the position of the padding so as not to rotate it along with the rest of the document.
if padding_start_position:
random_token = torch.randint(1, padding_start_position-1, (1,))
tokens['input_ids'][i] = torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
tokens['input_ids'][i][random_token.item():padding_start_position-1], #from random to padding
tokens['input_ids'][i][1:random_token.item()], #from 1 to random
tokens['input_ids'][i][padding_start_position-1:-1],
tokens['input_ids'][i][-1].unsqueeze(0)), 0)
# If there is no padding, we rotate the document without taking the padding into account.
else:
random_token = torch.randint(1, tokens['input_ids'].size(0)-1, (1,))
tokens['input_ids'][i] = torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
tokens['input_ids'][i][random_token.item():-1], #from random to end
tokens['input_ids'][i][1:random_token.item()],
tokens['input_ids'][i][-1].unsqueeze(0)), 0)
return tokens['input_ids'], tokens['attention_mask'].squeeze(0), tokens['labels']
data = rotated_data_generation(sentences, 3)
input_ids, attention_mask, labels = document_rotation_training(data)
"""
input_ids[2]:
tensor([ 0, 2433, 61, 32, 551, 88, 1316, 32, 12, 4138,
15557, 47605, 6, 22835, 2591, 939, 4, 242, 10079, 38422,
9235, 6, 10295, 22540, 14819, 8, 3039, 11543, 4, 347,
37347, 8457, 9, 41419, 8217, 1054, 36, 119, 49491, 43,
10596, 37118, 32, 45, 157, 684, 4, 41058, 4484, 9,
1046, 9, 23808, 16, 41, 505, 3724, 9, 18073, 18,
2199, 18246, 4194, 8, 13430, 3505, 4, 20, 2, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
attention_mask[2]:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
labels[2]:
tensor([ 0, 347, 37347, 8457, 9, 41419, 8217, 1054, 36, 119,
49491, 43, 10596, 37118, 32, 45, 157, 684, 4, 41058,
4484, 9, 1046, 9, 23808, 16, 41, 505, 3724, 9,
18073, 18, 2199, 18246, 4194, 8, 13430, 3505, 4, 20,
2433, 61, 32, 551, 88, 1316, 32, 12, 4138, 15557,
47605, 6, 22835, 2591, 939, 4, 242, 10079, 38422, 9235,
6, 10295, 22540, 14819, 8, 3039, 11543, 4, 2, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
"""
類似于序列排列,我們可以在每個數據輸入中移除最舊的句子,并添加一個新句子,從而保持上下文窗口。
總結
本文介紹了討論了訓練語言模型的不同的令牌掩碼。雖然這些都是比較常見的方法,但是大多數模型只使用了Token Masking。
對于短文本序列來說,Sentence Permutation 和Document Rotation技術可能沒有幫助甚至會降低準確率。而Token Masking、Token Deletion和Text Infilling 在短文本和長文本序列中都可以使用。