Why RLHF?
Reinforcement Learning with Human Feedback (RLHF) marks a significant leap in AI development, especially in the realm of Large Language Models (LLMs). This innovative approach, central to the success of groundbreaking models like OpenAI's ChatGPT, InstructGPT, DeepMind's Sparrow, and Anthropic's Claude, transcends traditional training methods. By focusing on understanding instructions and generating useful responses, RLHF elevates the capabilities of LLMs, paving the way for more intuitive, responsive, and effective AI interactions. This paradigm shift ensures that AI models are not only word predictors but also insightful companions in various digital realms.
Step 1: Commencing with Unsupervised Pre-training
Step 1: Commencing with Unsupervised Pre-training
The initial phase of training an RLHF (Reinforcement Learning with Human Feedback) model is called Unsupervised Pre-training. This involves using a pre-trained language model like GPT-3, which understands and generates text without explicit guidance. It learns from vast text data, acquiring knowledge of grammar, vocabulary, and basic reasoning. This step lays the groundwork for further enhancing the model in later stages of RLHF.
Step 2: Progressing to Supervised Fine-tuning
The second stage of RLHF training is Supervised Fine-tuning, where a dataset of <prompt, ideal response> pairs is created, such as “generate a story about Harry Potter,” followed by a human-authored response. Companies use platforms like Surge AI to gather this data. The pre-trained model is then fine-tuned to mimic these human responses.
Step 3: Building a 'Human Feedback' Reward Model
The third step is developing a 'human feedback' reward model to evaluate an LLM's response quality. It's a secondary model, often another LLM minus its final layers, that inputs a prompt and generation and outputs a scalar reward. New commands are generated, followed by machine responses, which are then evaluated by Surge AI members for quality. This data trains the reward model to score any <prompt, generation> pair.
Step 4: Training a Reinforcement Learning Policy Optimized for the Reward Model
In the final phase, an RLHF algorithm is created by training a Reinforcement Learning policy optimized using the reward model. This algorithm, aiming to produce human-preferred text, generates words or tokens based on the reward model's scoring. The RL policy, starting with the fine-tuned LLM from Step 2, generates responses to prompts, receives scores from the reward model, and updates itself to enhance response accuracy. This process helps the model learn and produce more appropriate responses.