成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

<tfoot id="wm0wk"><source id="wm0wk"></source></tfoot>

<blockquote id="wm0wk"><samp id="wm0wk"><delect id="wm0wk"></delect></samp></blockquote><span id="wm0wk"></span>

鴻蒙開發者社區

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考信創認證華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發者社區

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發者社區訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業版APP

鴻蒙開發者社區視頻號

51CTO軟考題庫

賬號設置退出

五種常用于LLM的令牌遮蔽技術介紹以及Pytorch的實現

作者：Fabio Yá?ez Romero 2024-04-09 14:31:37

本文將介紹大語言模型中使用的不同令牌遮蔽技術，并比較它們的優點，以及使用Pytorch實現以了解它們的底層工作原理。

本文將介紹大語言模型中使用的不同令牌遮蔽技術，并比較它們的優點，以及使用Pytorch實現以了解它們的底層工作原理。

令牌掩碼Token Masking是一種廣泛應用于語言模型分類變體和生成模型訓練的策略。BERT語言模型首先使用，并被用于許多變體(RoBERTa, ALBERT, DeBERTa…)。

而Text Corruption是一種更大的令牌遮蔽策略。在BART研究論文中，進行了大量實驗來訓練具有不同策略的編碼器-解碼器生成模型。

在進入正題之前，我們先介紹大型語言模型(llm)中掩碼策略的背景

從監督到自監督

語言模型的初始訓練中使用了大量文本，其目標是使模型學會正確地表示語言，并將這種知識隱式地存儲在其參數權重中。

大量的文本必須具有用于訓練的標簽，因為必須在處理模型輸入數據并使用參考數據之后計算損失（交叉熵）。但是注釋如此大量的數據是不可行的。所以智能將問題從監督學習變為自動生成標簽的自監督問題。

在這種情況下，被破壞的文本序列作為模型的訓練輸入，而所有或部分原始序列作為訓練數據的標簽。這樣通過自動生成的標簽，模型學習與每個訓練示例關聯的標簽，就不需要手動的注釋數據。

在Text Corruption中（特別是在Token Masking、Token Deletion和Text Infilling中），每個單詞可能會按照固定概率（通常約為15-20%）進行遮蔽。這個概率保持較低，以便模型即使在序列被損壞的情況下也能學習每個句子的上下文。

還有一些技術，如Sentence Permutation 或Document Rotation，不會專注于按照一定概率遮蔽單詞，我們后面會介紹。

在訓練語言模型時，標簽會根據是分類模型（僅編碼器）還是生成模型（編碼器-解碼器）而變化。在分類模型中，使用的標簽只關注輸入中被遮蔽的區域。因此如果一個詞在整個句子中被屏蔽，標簽只是單個單詞。而對于生成模型，由于模型必須能夠連續生成文本，輸出標簽是初始未損壞的序列，關注整個序列本身。

環境配置

我們已經簡要介紹了使用Text Corruption訓練語言模型的一些背景知識，下面我們開始使用示例代碼來介紹不同的Text Corruption技術。

我們將使用Stanza，一個由斯坦福NLP開發的庫，其中包含不同的NLP工具，這些工具對我們的預處理非常有用。

import stanza
 stanza.download('en')
 
 # Text used in our examples
 text = "Huntington's disease is a neurodegenerative autosomal disease 
 results due to expansion of polymorphic CAG repeats in the huntingtin gene. 
 Phosphorylation of the translation initiation factor 4E-BP results in the 
 alteration of the translation control leading to unwanted protein synthesis 
 and neuronal function. Consequences of mutant huntington (mhtt) gene 
 transcription are not well known. Variability of age of onset is an 
 important factor of Huntington's disease separating adult and juvenile types. 
 The factors which are taken into account are-genetic modifiers, maternal 
 protection i.e excessive paternal transmission, superior ageing genes 
 and environmental threshold. A major focus has been given to the molecular 
 pathogenesis which includes-motor disturbance, cognitive disturbance and 
 neuropsychiatric disturbance. The diagnosis part has also been taken care of. 
 This includes genetic testing and both primary and secondary symptoms. 
 The present review also focuses on the genetics and pathology of Huntington's 
 disease."
 
 
 # We will use a stanza model for getting each different sentence 
 # as an element of the list
 nlp = stanza.Pipeline('en', use_gpu=False)
 doc = nlp(text)
 sentences = [sentence.text for sentence in doc.sentences]

Token Masking

令牌掩碼用<mask>替換文本中的隨機單詞

這是從BERT引入的策略，它包括通過屏蔽隨機單詞來破壞輸入序列，這些單詞將在訓練期間用作輸出標簽。

在分類模型中，我們可以直接使用Huggingface的DataCollatorForLanguageModeling類來生成必要的標簽，這樣就可以訓練像BERT或RoBERTa這樣的模型。

from transformers import AutoTokenizer, DataCollatorForLanguageModeling
 import torch
 
 def load_dataset_mlm(sentences, tokenizer_class=AutoTokenizer, 
                      collator_class=DataCollatorForLanguageModeling, 
                      mlm=True, mlm_probability=0.20):
     tokenizer = tokenizer_class.from_pretrained('google-bert/bert-base-uncased')
     inputs = tokenizer(sentences, return_tensors='pt', padding=True, 
                        truncation=True)
     
     # Random masking configuration
     data_collator = collator_class(
         tokenizer=tokenizer, 
         mlm=mlm,  
         mlm_probability=mlm_probability 
    )
 
     """The collator expects a tuple of tensors, so you have to split 
    the input tensors and then remove the first dimension and pass it 
    to a tuple. """
     tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids = list(tuple_ids)
     for tensor in range(len(tuple_ids)):
         tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
     tuple_ids = tuple(tuple_ids)
     
     # Get input_ids, attention_masks and labels for each sentence.
     batch = data_collator(tuple_ids)
     return batch['input_ids'], inputs['attention_mask'], batch['labels']
 
 
 input_ids, attention_mask, labels = load_dataset_mlm(sentences)
 
 """
 input_ids[0]:
 tensor([ 101, 16364, 1005, 1055,   103, 2003, 1037,   103, 10976, 3207,
          103, 25284,   103, 25426, 16870, 4295, 3463, 2349, 2000,   103,
          1997, 26572, 18078, 6187, 2290, 17993, 1999, 1996, 5933, 7629,
          103,   103,   102,     0,     0])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
 
 labels[0]:
 tensor([ -100, -100, -100, -100, 4295, -100, -100, 11265, -100, -100,
          6914, -100, 8285, -100, 2389, -100, -100, -100, -100, 4935,
          -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
          4962, 1012, -100, -100, -100]))
 
 """

生成的inputs_ids對原始文本的每個標記都是整數。一個特殊的標記表示被屏蔽的單詞(在BERT中，這個標記是103)。這個特殊的標記根據所使用的語言模型而變化，因此不同的標記器將返回注意掩碼的不同標識符。

Huggingface還在模型中使用不同的操作分配唯一的令牌，因此用“-100”表示的令牌表示模型應該忽略它們。

對于像BART這樣的生成模型，我們可以使用DataCollatorForLanguageModeling類實現令牌屏蔽策略。但是需要一些小的更改，以使標記適應生成模型。

from transformers import BartTokenizer, DataCollatorForLanguageModeling
 import torch
 
 def load_dataset_mlm(sentences, tokenizer_class=BartTokenizer, 
                      collator_class=DataCollatorForLanguageModeling, 
                      mlm=True, mlm_probability=0.20):
     tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
     inputs = tokenizer(sentences, return_tensors='pt', padding=True, 
                        truncation=True)
     
     # Random masking configuration
     data_collator = collator_class(
         tokenizer=tokenizer, 
         mlm=mlm,  # True for Masked Language Modelling
         mlm_probability=mlm_probability  # Chance for every token to get masked
    )
 
     """The collator expects a tuple of tensors, so you have to split 
    the input tensors and then remove the first dimension and pass it 
    to a tuple. """
     tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids = list(tuple_ids)
     for tensor in range(len(tuple_ids)):
         tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
     tuple_ids = tuple(tuple_ids)
     
     # Get input_ids, attention_masks and labels for each sentence.
     batch = data_collator(tuple_ids)
     batch['labels'] = inputs['input_ids']
     return batch['input_ids'], inputs['attention_mask'],  batch['labels']
 
 input_ids, attention_mask, labels = load_dataset_mlm(sentences)
 
 """
 input_ids[0]:
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 50264, 50264, 50264,
            4,     2])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
 
 labels[0]:
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4,     2])
 """

每個輸入標記都標記著與之對應的標記，無論它是否被屏蔽。這是因為與分類模型不同，模型必須能夠基于給定給模型的序列生成文本序列。在BART的情況下，表示屏蔽的標記的ID是50264。

Token Deletion

使用標記刪除 Token Deletion，模型必須學習確切的位置和缺失的詞是什么，因此它必須比僅使用Token Masking學習更多的特征。

這種策略使用了一種不同的屏蔽方法。以一定的概率一個詞從原始文本序列中被移除，因此模型必須找到缺失的單詞及其位置。標準的屏蔽方法不會學習位置，因為屏蔽已經在模型的輸入中指示

def token_deletion(sentences, tokenizer_class=BartTokenizer,collator_class=DataCollatorForLanguageModeling, 
                  mlm=True, mlm_probability=0.20):
     tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
     inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
     
     data_collator = collator_class(
         tokenizer=tokenizer, 
         mlm=mlm,
         mlm_probability=mlm_probability 
    )
 
     tuple_ids = torch.split(inputs['input_ids'], 1, dim=0)
     tuple_ids = list(tuple_ids)
     for tensor in range(len(tuple_ids)):
         tuple_ids[tensor] = tuple_ids[tensor].squeeze(0)
     tuple_ids = tuple(tuple_ids)
  
     batch = data_collator(tuple_ids)
 
     # We use the initial inputs as labels
     batch['labels'] = batch['input_ids'].clone()
     
     # We remove tokens with mask identifier and thus make token deletion
     # Change the value to the mask identifier of the specific token model
     # It is necessary to know the identifier of the mask token for 
     # that specific model
     mask = batch['input_ids'] != 50264
     initial_size = batch['input_ids'].size(1)
     total_sentences = batch['input_ids'].size(0)
 
     # When we remove the specific token, we must fill with the padding 
     # token otherwise the tensor size is not respected.
     for i in range(total_sentences):
         new_tensor = batch['input_ids'][i][mask[i]]
         new_tensor = F.pad(new_tensor, (0, initial_size - new_tensor.size(0)), value=1)
         batch['input_ids'][i] = new_tensor
         attention_mask = batch['input_ids'][i] == 1
         inputs['attention_mask'][i][attention_mask] = 0
         
     return batch['input_ids'], inputs['attention_mask'], batch['labels']
 
 input_ids, attention_mask, labels = token_deletion(sentences)
 
 """
 input_ids[0]:
 tensor([   0, 38831, 2577, 1054, 2199, 14913, 28904, 3693, 32226, 38868,
          2199,   775,   528,     7, 2919,     9, 23404,   636,   230, 35315,
            11,     5, 24276, 10596,     4,     2,     1,     1,     1,     1,
            1,     1])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 0, 0, 0, 0, 0])
 
 labels[0]:
 tensor([   0, 38831, 2577, 1054, 50264, 2199, 50264, 50264, 14913, 28904,
        50264, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        23404,   636,   230, 50264, 35315,   11,     5, 50264, 24276, 10596,
            4,     2])
 
 """

當使用Token Deletion訓練BART時，長序列用于問答、摘要生成任務和會話任務會有一定的提高。

Text Infilling

文本填充 Text Infilling允許模型學習每個屏蔽位置可以有多少個單詞。而先前的方法假設每個屏蔽位置只有一個單詞。

Text Infilling與Token Masking類似，因為我們會以一定的概率在原始文本上使用屏蔽。但是不同之處在于屏蔽可以覆蓋多個單詞。在BART中，屏蔽是用泊松分布 lambda = 3 進行的；這意味著平均而言，每次對句子中的文本進行屏蔽時，會有三個單詞被包含在一個單個的<mask>標記中，但由于這是一個概率分布，可能會有更多或更少的屏蔽單詞。

我們將使用Numpy庫和特定于我們的語言模型(在本例中是BART)的標記器來實現文本填充。

import numpy as np
 from transformers import BartTokenizer
 
 def text_infilling(sentence, probability=0.2, poisson_lambda=3):
     # We'll use a binary mask to determine which words to replace
     mask = np.random.choice([0, 1], size=len(sentence), p=[1-probability, probability])
 
     # Now we'll replace the chosen words with a mask token
     # We'll also use a Poisson distribution to determine the length of the spans to mask
     for i in range(len(mask)):
         if mask[i] == 1:
             span_length = np.random.poisson(poisson_lambda)
             for j in range(span_length):
                 if i + j < len(sentence):
                     sentence[i + j] = "<mask>"
 
     infilled_sentence = []
     for token in range(len(sentence)):
         if sentence[token] == "<mask>":
             if token < len(sentence)-1:
                 if sentence[token+1] == "<mask>":
                     continue
                 else:
                     infilled_sentence.append(sentence[token])
             else:
                 infilled_sentence.append(sentence[token])
         else:
             infilled_sentence.append(sentence[token])
     return " ".join(infilled_sentence)
 
 def text_infilling_input(masked_sentences, sentences, tokenizer_class=BartTokenizer):
     tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
     inputs = tokenizer(masked_sentences, return_tensors='pt', padding=True, truncation=True)
     labels = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
     return inputs['input_ids'], inputs['attention_mask'], labels['input_ids']
 
 input_ids, attention_mask, labels = text_infilling_input(masked_sentences, sentences)
 
 """
 input_ids[0]:
 tensor([   0, 50264,   16, 50264, 2199,   775,   528, 50264, 48052,   636,
        50264, 8217, 24276, 10596,     4,     2,     1,     1,     1,     1,
            1,     1,     1])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
 
 labels[0]:
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4,     2])
 
 """

Text Infilling比Token Deletion更能改善BART語言模型的結果，在問題回答、文本摘要和會話任務中提供更好的生成。

Sentence Permutation

語言模型的輸入文本被分成隨機重新排列的句子，模型需要找出原始的順序。

在Sentence Permutation中，考慮適合模型輸入序列的句子數量是至關重要的(在小型模型中，輸入序列在512到1024之間)。在確定符合序列的句子數量之后，需要將它們分離到一個列表或數組中，并隨機選擇，而不重復其中任何一個。

# It selects the first "number_sentences" within a given set of "sentences" 
 # and returns those sentences in a random order.
 def sentence_permutation(sentences, number_sentences):
     new_sentences = sentences[:number_sentences]
     random.shuffle(new_sentences)
     new_sentences = sentence_joiner(new_sentences)
     return new_sentences
 
 def permuted_data_generation(sentences: list, total_sentences: int):
     training_sentences = []
     training_labels = []
     sentences_copy = sentences.copy()
     # We can apply sentence_permutation a number of times equal to the 
     # size of the list - 1 to get an example with each new sentence in 
     # the text, removing the oldest one.
     for _ in range(len(sentences)-total_sentences+1):
         new_sentences = sentence_permutation(sentences_copy, total_sentences)
         joined_sentences = sentence_joiner(sentences_copy[:total_sentences])
         sentences_copy = sentences_copy[1:]
         training_sentences.append(new_sentences)
         training_labels.append(joined_sentences)
 
     return training_sentences, training_labels
 
 
 def permutation_training(sentences: list, sentences_labels: list, 
                          tokenizer_class=BartTokenizer, 
                          collator_class=DataCollatorForLanguageModeling, 
                         mlm=True, mlm_probability=0.0):
     # We get input_ids and attention mask from the permuted sentences
     input, attention_mask, _ = load_dataset_mlm(sentences, tokenizer_class, collator_class, mlm,mlm_probability)
     
     # Labels from the original sentences
     labels, _, _ = load_dataset_mlm(sentences_labels, tokenizer_class, collator_class, mlm,mlm_probability)
 
     return input.squeeze(0), attention_mask.squeeze(0), labels.squeeze(0)
 
 input_ids, attention_mask, labels = permutation_training(training_sentences, training_labels_sentences)
 
 """
 input_ids[0]:
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4, 2585, 33430, 8457,     9, 41419, 8217, 1054,   36,   119,
        49491,   43, 10596, 37118,   32,   45,   157,   684,     4, 4129,
        33839, 4405, 35019,     9,     5, 19850, 34939, 3724,   204,   717,
            12, 21792,   775,   11,     5, 39752,     9,     5, 19850,   797,
          981,     7, 15067, 8276, 37423,     8, 46282, 5043,     4,     2])
 
 attention_mask[0]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1])
 
 labels[0]:
 tensor([   0, 38831, 2577, 1054,   18, 2199,   16,   10, 14913, 28904,
          5777, 3693, 32226, 38868, 2199,   775,   528,     7, 2919,     9,
        48052,   636,   230, 3450, 35315,   11,     5, 8217, 24276, 10596,
            4, 4129, 33839, 4405, 35019,     9,     5, 19850, 34939, 3724,
          204,   717,   12, 21792,   775,   11,     5, 39752,     9,     5,
        19850,   797,   981,     7, 15067, 8276, 37423,     8, 46282, 5043,
            4, 2585, 33430, 8457,     9, 41419, 8217, 1054,   36,   119,
        49491,   43, 10596, 37118,   32,   45,   157,   684,     4,     2])
 
 """

我們對于模型的每個數據輸入，刪除原始序列中出現的第一個句子，然后在執行基于要選擇的固定句子數目的句子排列之前，將接下來的句子添加進去。這樣雖然重新排列了輸入序列中的句子，但保持了一個每個新例子中都會出現一個新的句子，并刪除最舊的句子的上下文窗口。

Document Rotation

當旋轉一個文檔時，選擇一個特定的詞，并將其設定為起始詞，而所有之前的詞都被粘貼到文本的末尾。

如果要應用Document Rotation，必須考慮到每個批次使用的維度。在應用填充的情況下，這個填充不能與文檔的其余部分一起旋轉，而是必須保持其原始位置，同時整個文檔旋轉。

def sentence_joiner(sentences: list):
   return ' '.join(sentences)
 
 # With this function we gather as many sentences as we want to form the input data to the tokenizer.
 def rotated_data_generation(sentences: list, total_sentences: int):
   training_sentences = []
   sentences_copy = sentences.copy()
   for _ in range(len(sentences)-total_sentences+1):
     new_sentences = sentences_copy[:total_sentences]
     new_sentences = sentence_joiner(new_sentences)
     sentences_copy = sentences_copy[1:]
     training_sentences.append(new_sentences)
   return training_sentences
 
 # Apply this function over the rotated sentences from previous function
 def document_rotation_training(sentences, tokenizer_class=BartTokenizer):
   tokenizer = tokenizer_class.from_pretrained('facebook/bart-base')
   tokens = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
   tokens['input_ids'] = tokens['input_ids'].squeeze(0)
   tokens['labels'] = tokens['input_ids'].clone()
  
   iterations = tokens['input_ids'].size(0)
   for i in range(iterations):
     # Get the attention mask and convert to list
     attention_mask = tokens['attention_mask'][i].tolist()
     # Calculate the position where padding starts
     if 0 in attention_mask:
       padding_start_position = attention_mask.index(0)
     else:
       padding_start_position = False
     # We take into account the position of the padding so as not to rotate it along with the rest of the document.
     if padding_start_position:
       random_token = torch.randint(1, padding_start_position-1, (1,))
       tokens['input_ids'][i] = torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
                                       tokens['input_ids'][i][random_token.item():padding_start_position-1], #from random to padding
                                       tokens['input_ids'][i][1:random_token.item()], #from 1 to random
                                       tokens['input_ids'][i][padding_start_position-1:-1],
                                       tokens['input_ids'][i][-1].unsqueeze(0)), 0)
                                         
     # If there is no padding, we rotate the document without taking the padding into account.
     else:
       random_token = torch.randint(1, tokens['input_ids'].size(0)-1, (1,))
       tokens['input_ids'][i] = torch.cat((tokens['input_ids'][i][0].unsqueeze(0), #initial token
                                       tokens['input_ids'][i][random_token.item():-1], #from random to end
                                       tokens['input_ids'][i][1:random_token.item()],
                                       tokens['input_ids'][i][-1].unsqueeze(0)), 0)
   return tokens['input_ids'], tokens['attention_mask'].squeeze(0), tokens['labels']
 
 data = rotated_data_generation(sentences, 3)
 input_ids, attention_mask, labels = document_rotation_training(data)
 
 """
 input_ids[2]:
 tensor([   0, 2433,   61,   32,   551,   88, 1316,   32,   12, 4138,
        15557, 47605,     6, 22835, 2591,   939,     4,   242, 10079, 38422,
          9235,     6, 10295, 22540, 14819,     8, 3039, 11543,     4,   347,
        37347, 8457,     9, 41419, 8217, 1054,   36,   119, 49491,   43,
        10596, 37118,   32,   45,   157,   684,     4, 41058, 4484,     9,
          1046,     9, 23808,   16,   41,   505, 3724,     9, 18073,   18,
          2199, 18246, 4194,     8, 13430, 3505,     4,   20,     2,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1])
 
 attention_mask[2]:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
 
 labels[2]:
 tensor([   0,   347, 37347, 8457,     9, 41419, 8217, 1054,   36,   119,
        49491,   43, 10596, 37118,   32,   45,   157,   684,     4, 41058,
          4484,     9, 1046,     9, 23808,   16,   41,   505, 3724,     9,
        18073,   18, 2199, 18246, 4194,     8, 13430, 3505,     4,   20,
          2433,   61,   32,   551,   88, 1316,   32,   12, 4138, 15557,
        47605,     6, 22835, 2591,   939,     4,   242, 10079, 38422, 9235,
            6, 10295, 22540, 14819,     8, 3039, 11543,     4,     2,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1])
 
 """

類似于序列排列，我們可以在每個數據輸入中移除最舊的句子，并添加一個新句子，從而保持上下文窗口。

總結

本文介紹了討論了訓練語言模型的不同的令牌掩碼。雖然這些都是比較常見的方法，但是大多數模型只使用了Token Masking。

對于短文本序列來說，Sentence Permutation 和Document Rotation技術可能沒有幫助甚至會降低準確率。而Token Masking、Token Deletion和Text Infilling 在短文本和長文本序列中都可以使用。

責任編輯：華軒來源： DeepHub IMBA

大語言模型 Pytorch

51CTO技術棧公眾號

業務
速覽

媒體

51CTO CIOAge HC3i

社區

51CTO博客鴻蒙開發者社區 AI.x社區

教育

51CTO學堂精培企業培訓 CTO訓練營

主站蜘蛛池模板： h免费观看| 伊人网99 | 99re视频在线观看 | 欧美a在线看 | 免费成人在线网站 | 日韩a在线 | 国产精品美女久久久久久久网站 | www.亚洲一区二区三区 | 亚洲午夜视频在线观看 | 久久久久www | 91视频一区 | 亚洲精品一 | 三级在线视频 | 在线不卡视频 | 免费在线观看黄色av | 成人精品国产一区二区4080 | 久久精品一区二区视频 | 岛国二区| 一区二区视屏 | 国产三区在线观看视频 | 免费看的黄网站 | 日日夜夜狠狠操 | 国产色片在线 | 中国免费黄色片 | 99精品久久久 | 中文字幕在线观看 | 91视频电影| 欧美精品在线一区二区三区 | 久久久久黑人 | 国产成人叼嘿视频在线观看 | 国产欧美精品一区二区 | 日本在线免费看最新的电影 | 午夜欧美a级理论片915影院 | 欧美日韩成人在线 | 国产成都精品91一区二区三 | 成人在线视频网站 | 自拍偷拍亚洲一区 | 九九天堂网 | 中文字幕国产一区 | 91资源在线 | 国产特级毛片 |

<abbr id="uyqzx"></abbr>