用 PyTorch 實(shí)現(xiàn)基于字符的循環(huán)神經(jīng)網(wǎng)絡(luò)

作者： LCTT zxp 譯 2020-12-19 11:05:57

在過去的幾周里，我花了很多時(shí)間用 PyTorch 實(shí)現(xiàn)了一個(gè) char-rnn 的版本。我以前從未訓(xùn)練過神經(jīng)網(wǎng)絡(luò)，所以這可能是一個(gè)有趣的開始。

這個(gè)想法(來自循環(huán)神經(jīng)網(wǎng)絡(luò)的不合理效應(yīng) )可以讓你在文本上訓(xùn)練一個(gè)基于字符的循環(huán)神經(jīng)網(wǎng)絡(luò)(recurrent neural network)(RNN)，并得到一些出乎意料好的結(jié)果。

[[358756]]

不過，雖然沒有得到我想要的結(jié)果，但是我還是想分享一些示例代碼和結(jié)果，希望對(duì)其他開始嘗試使用 PyTorch 和 RNN 的人有幫助。

這是 Jupyter 筆記本格式的代碼： char-rnn in PyTorch.ipynb 。你可以點(diǎn)擊這個(gè)網(wǎng)頁(yè)最上面那個(gè)按鈕 “Open in Colab”，就可以在 Google 的 Colab 服務(wù)中打開，并使用免費(fèi)的 GPU 進(jìn)行訓(xùn)練。所有的東西加起來大概有 75 行代碼，我將在這篇博文中盡可能地詳細(xì)解釋。

第一步：準(zhǔn)備數(shù)據(jù)

首先，我們要下載數(shù)據(jù)。我使用的是古登堡項(xiàng)目(Project Gutenberg)中的這個(gè)數(shù)據(jù)： Hans Christian Anderson’s fairy tales 。

!wget -O fairy-tales.txt

這個(gè)是準(zhǔn)備數(shù)據(jù)的代碼。我使用 fastai 庫(kù)中的 Vocab 類進(jìn)行數(shù)據(jù)處理，它能將一堆字母轉(zhuǎn)換成“詞表”，然后用這個(gè)“詞表”把字母變成數(shù)字。

之后我們就得到了一個(gè)大的數(shù)字?jǐn)?shù)組(training_set)，我們可以用于訓(xùn)練我們的模型。

from fastai.text import * 
text = unidecode.unidecode(open('fairy-tales.txt').read()) 
v = Vocab.create((x for x in text), max_vocab=400, min_freq=1) 
training_set = torch.Tensor(v.numericalize([x for x in text])).type(torch.LongTensor).cuda() 
num_letters = len(v.itos)

第二步：定義模型

這個(gè)是 PyTorch 中 LSTM 類的封裝。除了封裝 LSTM 類以外，它還做了三件事：

對(duì)輸入向量進(jìn)行 one-hot 編碼，使得它們具有正確的維度。
在 LSTM 層后一層添加一個(gè)線性變換，因?yàn)?LSTM 輸出的是一個(gè)長(zhǎng)度為 hidden_size 的向量，我們需要的是一個(gè)長(zhǎng)度為 input_size 的向量這樣才能把它變成一個(gè)字符。
把 LSTM 隱藏層的輸出向量(實(shí)際上有 2 個(gè)向量)保存成實(shí)例變量，然后在每輪運(yùn)行結(jié)束后執(zhí)行 .detach() 函數(shù)。(我很難解釋清 .detach() 的作用，但我的理解是，它在某種程度上“結(jié)束”了模型的求導(dǎo)計(jì)算)(LCTT 譯注：detach() 函數(shù)是將該張量的 requires_grad 參數(shù)設(shè)置為 False，即反向傳播到該張量就結(jié)束。)

class MyLSTM(nn.Module): 
    def __init__(self, input_size, hidden_size): 
        super().__init__() 
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True) 
        self.h2o = nn.Linear(hidden_size, input_size) 
        self.input_size=input_size 
        self.hidden = None 
 
    def forward(self, input): 
        input = torch.nn.functional.one_hot(input, num_classes=self.input_size).type(torch.FloatTensor).cuda().unsqueeze(0) 
        if self.hidden is None: 
            l_output, selfself.hidden = self.lstm(input) 
        else: 
            l_output, selfself.hidden = self.lstm(input, self.hidden) 
        self.hidden = (self.hidden[0].detach(), self.hidden[1].detach()) 
 
        return self.h2o(l_output)

這個(gè)代碼還做了一些比較神奇但是不太明顯的功能。如果你的輸入是一個(gè)向量(比如 [1,2,3,4,5,6])，對(duì)應(yīng)六個(gè)字母，那么我的理解是 nn.LSTM 會(huì)在內(nèi)部使用沿時(shí)間反向傳播更新隱藏向量 6 次。

第三步：編寫訓(xùn)練代碼

模型不會(huì)自己訓(xùn)練的!

我最開始的時(shí)候嘗試用 fastai 庫(kù)中的一個(gè)輔助類(也是 PyTorch 中的封裝)。我有點(diǎn)疑惑因?yàn)槲也恢浪谧鍪裁矗宰詈笪易约壕帉懥四Ｐ陀?xùn)練代碼。

下面這些代碼(epoch() 方法)就是有關(guān)于一輪訓(xùn)練過程的基本信息。基本上就是重復(fù)做下面這幾件事情：

往 RNN 模型中傳入一個(gè)字符串，比如 and they ought not to teas。(要以數(shù)字向量的形式傳入)
得到下一個(gè)字母的預(yù)測(cè)結(jié)果
計(jì)算 RNN 模型預(yù)測(cè)結(jié)果和真實(shí)的下一個(gè)字母之間的損失函數(shù)(e，因?yàn)?tease 這個(gè)單詞是以 e 結(jié)尾的)
計(jì)算梯度(用 loss.backward() 函數(shù))
沿著梯度下降的方向修改模型中參數(shù)的權(quán)重(用 self.optimizer.step() 函數(shù))

class Trainer(): 
  def __init__(self): 
      self.rnn = MyLSTM(input_size, hidden_size).cuda() 
      self.optimizer = torch.optim.Adam(self.rnn.parameters(), amsgrad=True, lrlr=lr) 
  def epoch(self): 
      i = 0 
      while i < len(training_set) - 40: 
        seq_len = random.randint(10, 40) 
        input, target = training_set[i:i+seq_len],training_set[i+1:i+1+seq_len] 
        i += seq_len 
        # forward pass 
        output = self.rnn(input) 
        loss = F.cross_entropy(output.squeeze()[-1:], target[-1:]) 
        # compute gradients and take optimizer step 
        self.optimizer.zero_grad() 
        loss.backward() 
        self.optimizer.step()

使用 nn.LSTM 沿著時(shí)間反向傳播，不要自己寫代碼

開始的時(shí)候我自己寫代碼每次傳一個(gè)字母到 LSTM 層中，之后定期計(jì)算導(dǎo)數(shù)，就像下面這樣：

for i in range(20): 
    input, target = next(iter) 
    output, hidden = self.lstm(input, hidden) 
loss = F.cross_entropy(output, target) 
hiddenhidden = hidden.detach() 
self.optimizer.zero_grad() 
loss.backward() 
self.optimizer.step()

這段代碼每次傳入 20 個(gè)字母，每次一個(gè)，并且在最后訓(xùn)練了一次。這個(gè)步驟就被稱為沿時(shí)間反向傳播，Karpathy 在他的博客中就是用這種方法。

這個(gè)方法有些用處，我編寫的損失函數(shù)開始能夠下降一段時(shí)間，但之后就會(huì)出現(xiàn)峰值。我不知道為什么會(huì)出現(xiàn)這種現(xiàn)象，但之后我改為一次傳入 20 個(gè)字符到 LSTM 之后(按 seq_len 維度)，再進(jìn)行反向傳播，情況就變好了。

第四步：訓(xùn)練模型!

我在同樣的數(shù)據(jù)上重復(fù)執(zhí)行了這個(gè)訓(xùn)練代碼大概 300 次，直到模型開始輸出一些看起來像英文的文本。差不多花了一個(gè)多小時(shí)吧。

這種情況下我也不關(guān)注模型是不是過擬合了，但是如果你在真實(shí)場(chǎng)景中訓(xùn)練模型，應(yīng)該要在驗(yàn)證集上驗(yàn)證你的模型。

第五步：生成輸出!

最后一件要做的事就是用這個(gè)模型生成一些輸出。我寫了一個(gè)輔助方法從這個(gè)訓(xùn)練好的模型中生成文本(make_preds 和 next_pred)。這里主要是把向量的維度對(duì)齊，重要的一點(diǎn)是：

output = rnn(input) 
prediction_vector = F.softmax(output/temperature) 
letter = v.textify(torch.multinomial(prediction_vector, 1).flatten(), sep='').replace('_', ' ')

基本上做的事情就是這些：

RNN 層為字母表中的每一個(gè)字母或者符號(hào)輸出一個(gè)數(shù)值向量(output)。
這個(gè) output 向量并不是一個(gè)概率向量，所以需要 F.softmax(output/temperature) 操作，將其轉(zhuǎn)換為概率值(也就是所有數(shù)值加起來和為 1)。temperature 某種程度上控制了對(duì)更高概率的權(quán)重，在限制范圍內(nèi)，如果設(shè)置 temperature=0.0000001，它將始終選擇概率最高的字母。
torch.multinomial(prediction_vector) 用于獲取概率向量，并使用這些概率在向量中選擇一個(gè)索引(如 12)。
v.textify 把 12 轉(zhuǎn)換為字母。

如果我們想要處理的文本長(zhǎng)度為 300，那么只需要重復(fù)這個(gè)過程 300 次就可以了。

結(jié)果!

我把預(yù)測(cè)函數(shù)中的參數(shù)設(shè)置為 temperature = 1 得到了下面的這些由模型生成的結(jié)果。看起來有點(diǎn)像英語(yǔ)，這個(gè)結(jié)果已經(jīng)很不錯(cuò)了，因?yàn)檫@個(gè)模型要從頭開始“學(xué)習(xí)”英語(yǔ)，并且是在字符序列的級(jí)別上進(jìn)行學(xué)習(xí)的。

雖然這些話沒有什么含義，但我們也不知道到底想要得到什么輸出。

“An who was you colotal said that have to have been a little crimantable and beamed home the beetle. “I shall be in the head of the green for the sound of the wood. The pastor. “I child hand through the emperor’s sorthes, where the mother was a great deal down the conscious, which are all the gleam of the wood they saw the last great of the emperor’s forments, the house of a large gone there was nothing of the wonded the sound of which she saw in the converse of the beetle. “I shall know happy to him. This stories herself and the sound of the young mons feathery in the green safe.”

“That was the pastor. The some and hand on the water sound of the beauty be and home to have been consider and tree and the face. The some to the froghesses and stringing to the sea, and the yellow was too intention, he was not a warm to the pastor. The pastor which are the faten to go and the world from the bell, why really the laborer’s back of most handsome that she was a caperven and the confectioned and thoughts were seated to have great made

下面這些結(jié)果是當(dāng) temperature=0.1 時(shí)生成的，它選擇字符的方式更接近于“每次都選擇出現(xiàn)概率最高的字符”。這就使得輸出結(jié)果有很多是重復(fù)的。

ole the sound of the beauty of the beetle. “She was a great emperor of the sea, and the sun was so warm to the confectioned the beetle. “I shall be so many for the beetle. “I shall be so many for the beetle. “I shall be so standen for the world, and the sun was so warm to the sea, and the sun was so warm to the sea, and the sound of the world from the bell, where the beetle was the sea, and the sound of the world from the bell, where the beetle was the sea, and the sound of the wood flowers and the sound of the wood, and the sound of the world from the bell, where the world from the wood, and the sound of the

這段輸出對(duì)這幾個(gè)單詞 beetles、confectioners、sun 和 sea 有著奇怪的執(zhí)念。

總結(jié)!

至此，我的結(jié)果遠(yuǎn)不及 Karpathy 的好，可能有一下幾個(gè)原因：