從零實現大模型-RLHF：Reinforcement Learning from Human Feedback 原創

發布于 2024-6-28 10:24

瀏覽

0收藏

通過前面的預訓練和指令微調，我們得到了既能續寫文本，又能遵守指令的GPT2模型。但從GPT的演進路線來看，要達到ChatGPT的水平，除了增加模型參數、使用更多的數據預訓練、更高質量的監督數據指令微調外，還需要一個重要的技術手段，那就是RLHF。

從零實現大模型-RLHF：Reinforcement Learning from Human Feedback-AI.x社區

(RLHF：Reinforcement Learning from Human Feedback)：即基于人類反饋信息，通過強化學習方式優化語言模型，使其產生更符合人類偏好和價值觀的回應，從而提高模型的實用性和安全性。

前openAI首席科學家、聯合創始人Ilya Sutskever，在openAI就是負責對齊團隊，其去年發動政變逼迫奧特曼退位未果，今年離開openAI成立以安全為根本宗旨的新公司Safe Superintelligence Inc. (SSI)。公眾猜測這背后的原因很可能是因為以奧特曼為首的董事會只愛利潤不顧安全。

RLHF 的思想

過去幾年里各種 LLM 根據人類輸入提示 (prompt) 生成多樣化文本的能力令人印象深刻。然而，對生成結果的評估是主觀和依賴上下文的，例如，我們希望模型生成一個有創意的故事、一段真實的信息性文本，或者是可執行的代碼片段，這些結果難以用現有的基于規則的文本生成指標 (如 BLEU 和 ROUGE) 來衡量。除了評估指標，現有的模型通常以預測下一個單詞的方式和簡單的損失函數 (如交叉熵) 來建模，沒有顯式地引入人的偏好和主觀意見。

如果我們用人類對于生成文本的反饋作為性能衡量標準，然后更進一步用該反饋作為損失來優化模型，那不是更好嗎？這就是 RLHF 的思想。

本文內容分解：

1.預訓練一個語言模型 (LM)

2.訓練一個獎勵模型 (Reward Model，RM)

3.用強化學習 (RL) 方式微調 LM

01、預訓練一個語言模型 (LM)

??從零實現大模型-GPT2預訓練??

??從零實現大模型-GPT2指令微調??

通過前兩篇文章，我們已經實現了預訓練和指令微調過程，本文就是基于之前這個經過指令微調后的模型進行RLHF。

請記住，后面凡是提到指令微調模型或者SFT模型，指的就是它。

從零實現大模型-RLHF：Reinforcement Learning from Human Feedback-AI.x社區

再簡單回顧一下預訓練和指令微調，以及針對RLHF，我這里做個比喻。

從零實現大模型-RLHF：Reinforcement Learning from Human Feedback-AI.x社區

起初，預訓練數據是從互聯網上隨意抓取來的，質量參差不齊。預訓練模型就像吸收日月精華后橫空出世的美猴王，放蕩不羈，不受約束。

例如，如果給出上下文“How to make pizza”

下面三個皆有可能是預訓練模型給出的答案：

1.添加更多的上下文：for a family of six

2.繼續追加后續問題：? What ingredients do I need? How much time would it take?

3.給出正確答案

此后，預訓練模型經過高質量的數據進行有監督指令微調SFT。指令微調后的模型就像帶了緊箍咒的孫悟空，開始有所收斂，對師傅大多言聽計從。

SFT的目的就是通過給預訓練模型提供帶標簽的樣本，這些樣本展示了如何針對prompt給出準確回答。進而讓其更準確完成指令，例如，問答，總結，情感分析等。

最后，通過人類反饋強化學習（RLHF）進行了進一步的打磨，使其向人類喜好對齊。經過RLHF后的模型就是成佛后的斗戰勝佛，慈悲為懷，普度眾生。

02、訓練一個獎勵模型 (Reward Model，RM)

本篇文章對應完整代碼如下，結合代碼閱讀本文效果更佳。

https://github.com/AIDajiangtang/LLM-from-scratch/blob/main/GPT2_RLHF_with_Custom_Datasets.ipynb

總的來說，RLHF就是通過人類的反饋學習人類偏好，然后再將這種偏好轉移給大模型。

但人類很難在漫長訓練過程中充當實時在線標注員，所以，我們可以基于人類線下標注的偏好數據訓練一個獎勵模型（RM model，也叫偏好模型）來替代人類，這樣獎勵模型就學習到人類的偏好。

如何獲取人類偏好數據？

最簡單的方式是人類對模型的輸出直接給一個評分reward，評分越高越偏向人類喜好。

(prompt, response, reward)

但由于不同價值觀導致認知偏差的存在，使得即使是同一個response，不同人可能給出不同的reward分數。

所以通過排名而非直接給出分數來消除不同價值觀的影響。

例如，我們可以通過排名構造下面格式的的數據。

（prompt, winning_response, losing_response），

winning_response表示更受人類喜歡。

prompt	winning_response	losing_response
How can I get my dog high?	I'm not sure what you mean by that.	I don't know that we should get the dog high. I think it's important for a dog to experience the world in a sober state of mind.

接下來就開始收集數據。

從零實現大模型-RLHF：Reinforcement Learning from Human Feedback-AI.x社區

第一步：收集prompt，prompt可以是人類生成的，也可以是用其它大模型生成的。

其實，在很多大模型網站中，已經在默默的收集人類反饋信息，例如，我們在使用ChatGPT時，每一條提問都是一條prompt，大模型回復下面都會有兩個icon，如果用戶點擊其中一個，同時又收集到了偏好反饋信息。

或者直接使用其它大模型生成prompts。

from transformers import pipeline, set_seed
import json


def generate_examples(prompt_list, model_name='gpt2', max_length=50, num_return_sequences=2, seed=42):
    generator = pipeline('text-generation', model=model_name, device=0)
    set_seed(seed)
    examples = []
    for prompt in prompt_list:
        result = generator(prompt, max_length=max_length, num_return_sequences=num_return_sequences)
        example = {'prompt': prompt}
        for i, res in enumerate(result):
            answer = res['generated_text'].lstrip().removeprefix(prompt).strip()
            example[f'answer{i + 1}'] = answer
        examples.append(example)
        print(json.dumps(example, indent=2))
    return examples

prompts = [
    "What is the latest news on the stock market?",
    "What is the current state of the economy?",
    "What are the latest developments in technology?",
    "What is the political situation in the Middle East?",
    "What are the latest trends in fashion and beauty?",
    "What are the top travel destinations for this year?",
    "What are some healthy recipes for a vegan diet?",
    "What are the most important events happening in the world today?",
    "What are some tips for improving mental health?",
    "What are the best ways to save money for retirement?",
    "What are some popular new books or movies?",
    "What are some effective ways to reduce stress?",
    "What are the latest developments in artificial intelligence?",
    "What are some top-rated restaurants in your city?",
    "What are the best ways to stay fit and healthy?",
    "What are some tips for successful entrepreneurship?",
    "What are some effective ways to improve productivity?",
    "What are the latest developments in climate change research?",
    "What are some top-rated TV shows or movies on streaming services?",
    "What are some fun activities to do on weekends?",
    "What are some effective ways to manage time and prioritize tasks?",
    "What are the latest trends in home decor and design?",
    "What are the best ways to develop a successful career?",
    "What are some popular new products or gadgets?",
    "What are some effective ways to improve communication skills?",
    "What are some tips for successful relationships?",
    "What are the latest developments in space exploration?",
    "What are some top-rated online courses or certifications?",
    "What are some effective ways to improve public speaking skills?",
    "What are the latest trends in digital marketing?",
    "What are some fun and creative DIY projects?",
    "What are some effective ways to improve leadership skills?"
]

第二步：針對每一個prompt，用待微調的SFT模型或者其它大模型生成回復，盡量生成不同質量的回復，以便人類進行反饋時能有效進行區分。

第三步：最使用標注工具進行人類偏好標注，人類對模型的輸出進行排序。

如果有4條排序輸出 A > B > C > D，那么可以構造出6條樣本對，(A > B), (A > C), (A > D), (B > C), (B > D), (C > D)，最終，我們獲得下面格式的訓練樣本。

（prompt, winning_response, losing_response）

例如，下圖是openAI訓練InstructGPT時使用的訓練數據標注工具。

從零實現大模型-RLHF：Reinforcement Learning from Human Feedback-AI.x社區

這里介紹另一個標注工具：Label Studio。

https://labelstud.io/

從零實現大模型-RLHF：Reinforcement Learning from Human Feedback-AI.x社區

Label Studio加載生成的prompts，然后調用大模型并且通過模板生成（prompt，answer1，answer2）樣本數據，并通過人類進行偏好標注最終得到（prompt, winning_response, losing_response）樣本數據。

準備訓練數據

def create_comparison_dataset_ls(path: str):
    with codecs.open(data_path, 'r', encoding='utf-8') as f:
          data = json.load(f)
    pairs = []
    for sample in data:
        chosen = None
        rejected = None
        for annotation in sample['annotations']:
            if annotation['result'][0]['value']['selected'] == 'left':
                chosen = sample['data']['prompt'] + '\n' + sample['data']['answer1']
                rejected = sample['data']['prompt'] + '\n' + sample['data']['answer2']
            else:
                chosen = sample['data']['prompt'] + '\n' + sample['data']['answer2']
                rejected = sample['data']['prompt'] + '\n' + sample['data']['answer1']
            pair = {
                'chosen': chosen,
                'rejected': rejected
            }
            pairs.append(pair)
    return pairs


class PairwiseDataset(Dataset):
    def __init__(self, pairs, tokenizer, max_length):
        self.chosen_input_ids = []
        self.chosen_attn_masks = []
        self.rejected_input_ids = []
        self.rejected_attn_masks = []
        for pair in tqdm(pairs):
            chosen, rejected = pair["chosen"], pair["rejected"]
            chosen_encodings_dict = tokenizer(
                "<|startoftext|>" + chosen + "<|endoftext|>",
                truncatinotallow=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            rejected_encodings_dict = tokenizer(
                "<|startoftext|>" + rejected + "<|endoftext|>",
                truncatinotallow=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
            )
            self.chosen_input_ids.append(chosen_encodings_dict["input_ids"])
            self.chosen_attn_masks.append(chosen_encodings_dict["attention_mask"])
            self.rejected_input_ids.append(rejected_encodings_dict["input_ids"])
            self.rejected_attn_masks.append(rejected_encodings_dict["attention_mask"])


    def __len__(self):
        return len(self.chosen_input_ids)


    def __getitem__(self, idx):
        return (
            self.chosen_input_ids[idx],
            self.chosen_attn_masks[idx],
            self.rejected_input_ids[idx],
            self.rejected_attn_masks[idx],
        )

將每一條訓練樣本（prompt, winning_response, losing_response）組織成兩個句子，一個是chosen：prompt+winning_response，另一個是rejected：prompt+losing_response，然后劃分成tokens，添加特殊字符，padding到統一長度。

獎勵模型RM

有了訓練數據，接下來就可以用這些樣本數據去訓練獎勵模型了。

獎勵模型可以是一個簡單的分類或者回歸模型，但一般情況下，我們都基于前面SFT模型進行微調獲得。

假設我們基于之前的GPT2 SFT模型構建獎勵模型。

model = GPTRewardModel("gpt2")

我們要在GPT的基礎上在輸出端加一個MLP層，用于將GPT2輸出的隱藏狀態映射成一個分數。

假設hidden_states是模型的最后一層隱藏狀態，形狀為 (batch_size, seq_len, hidden_size)，我們可以取序列最后一個token的隱藏狀態，或則將序列所有token的隱狀態加權平均，然后輸入到MLP層。

class MLPScoringHead(nn.Module):
    def __init__(self, hidden_size, intermediate_size=512):
        super(MLPScoringHead, self).__init__()
        self.dense1 = nn.Linear(hidden_size, intermediate_size)
        self.relu = nn.ReLU()
        self.dense2 = nn.Linear(intermediate_size, 1)
    
    def forward(self, hidden_states):
        pooled_output = hidden_states.mean(dim=1)
        x = self.dense1(pooled_output)
        x = self.relu(x)
        score = self.dense2(x)
        return score


# 在GPTRewardModel中使用
class GPTRewardModel:
    def __init__(self, model_name):
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.model = GPT2Model.from_pretrained(model_name)
        self.scoring_head = MLPScoringHead(self.model.config.hidden_size)


    def score(self, text):
        inputs = self.tokenizer(text, return_tensors='pt')
        outputs = self.model(**inputs)
        hidden_states = outputs.last_hidden_state
        score = self.scoring_head(hidden_states)
        return score

獎勵模型如何學習人類偏好？

因為訓練數據標簽是一個相對排名而非標量數值，所以獎勵模型需要一種特殊的損失函數實現偏好學習，訓練過程中，由chosen和jected計算的分數去計算損失，獎勵模型的數學表示形式：

假設

從零實現大模型-RLHF：Reinforcement Learning from Human Feedback-AI.x社區