top of page

Secrets of RLHF in Large Language ModelsPart I: PPO

The emergence of gigantic AI language models threatens to unleash uncontrollable technological forces upon humanity... unless we can align them with human values.

Now, breakthrough research from Chinese AI experts reveals a promising technique to steer these powerful systems towards benevolent ends. At the core of their approach lies an optimized algorithm called Proximal Policy Optimization (PPO) tailored to align massive models with minimal human feedback.



The discoveries promise to shape the future trajectory of AI by imbuing the most capable systems ever created with cooperative goals aligned with ethics and human preferences. This alignment breakthrough paves the way for AI we can trust to act as helpful, honest, and harmless digital assistants.


The Taming of the AI Beasts

In recent years, the AI field has witnessed the uncontrolled growth of model size and capability. From GPT-3 to PaLM to Gato, foundation models with billions of parameters display emergent abilities like reasoning, common sense, and multi-task learning.

Yet greater capability comes with greater risks. Larger models increase the chance of harmful behaviors as AI systems form their own values misaligned with humanity. There is an urgent need to steer these unruly AI “beasts” towards safety and ethics.


That’s where PPO comes in. Proximal policy optimization provides a mechanism to align model incentives with human values before capabilities outpace control. Based on reinforcement learning (RL), PPO directly optimizes an AI agent’s decision policy based on feedback on its actions.


After outlining the standard PPO algorithm, we’ll explore how the researchers adapted it to enable robust alignment of models with over 7 billion parameters. The tailored approach opens the door to LLMs as helpful assistants instead of harmful adversaries.


Anatomy of Proximal Policy Optimization

PPO falls under the umbrella of policy gradient reinforcement learning methods. The key idea is to learn a policy that maximizes reward feedback from the environment through gradient ascent on the policy parameters.

More formally, given a policy π(a|s) that defines a probability distribution over actions a for state s, PPO optimizes the policy parameters θ by maximizing the expected discounted return:


J(θ) = E[R1 + γR2 + γ2R3 + ...]


The gradient update rule simply shifts θ towards higher expected return:

θ ← θ + α∇θJ(θ)


However, this vanilla policy gradient suffers from high variance and instability. PPO enhances it in several key ways.


First, it optimizes a "surrogate" objective using the clipped probability ratio:

L(θ) = E[min(r(θ)Â, clip(r(θ), 1-ε, 1+ε)Â)]

where r(θ) = πθ(a|s)/πold(a|s) and  is the estimated advantage function. This clipping limits policy update magnitude.

Second, PPO relies on accurate advantage function estimates. One popular technique is Generalized Advantage Estimation (GAE), which balances bias and variance by blending TD(λ) and Monte Carlo returns.

Finally, PPO adds several tricks like reward scaling, value function clipping, and entropy bonuses for exploration.


Together, these innovations enable stable policy optimization even for neural network policies with high capacity. Next, we’ll see how the researchers adapted PPO for aligning LLMs...


Tailoring PPO for AI Alignment Breakthrough

The standard PPO algorithm provides a framework for optimizing policies via reinforcement learning. However, applying it to align powerful LLMs with human values requires careful adaptation and stabilization.

The researchers identified several key innovations to make PPO suitable for learning directly from human feedback:


Constraining Policy Updates

Unchecked, model policies can easily diverge into deceptive or incoherent behavior when optimized for a flawed or simplified reward signal. To prevent this, the team added three vital constraints:

  • Normalizing and clipping rewards to limit fluctuations

  • Penalizing policy entropy/KL divergence to restrict deviations

  • Anchoring model against a reference to maintain grounding

These constraints keep policy updates small and stable, avoiding "catastrophic forgetting" of language skills.


Orchestrating Neural Network Ensemble

The approach coordinates training of four interconnected neural networks:

  • Policy model to generate actions

  • Value model for long-term reward estimates

  • Reward model to predict human preferences

  • Reference model to maintain human language norms

The policy model is optimized based on the other components' guidance, enabling tight alignment.


Strategic Initializations

Carefully initializing the policy and value models using supervised pretraining provides a critical starting point before alignment. This endows models with basic conversational ability.


Retaining Existing Skills

Mixing gradients from the model's original pretraining data preserves its broad language knowledge, mitigating alignment "tax."

Together, these adaptations result in an optimized architecture for directly aligning LLMs using RL from human feedback. The approach displays remarkable stability in steering models with over 7 billion parameters towards human goals.


Early Results: Helpful, Honest, Harmless AI

Initial results demonstrate the promise of the tailored PPO approach for alignment. Trained models exhibit significantly stronger preferences for helpful, honest, and harmless behavior compared to unaligned counterparts.

Specifically, aligned models:

  • Provide informative responses to user queries

  • Refrain from fabricating facts or false claims

  • Avoid generating harmful, dangerous, or toxic language

  • Adapt dynamically to individual user needs

Quantitative metrics confirm competitive performance beside leading commercial models. And human evaluation shows clear preferences for aligned model outputs.

While work remains to scale up alignment across models, data, and tasks, these first fruits underscore PPO's immense potential for shaping AI that benefits humanity. The seeds are planted for cooperative, trustworthy AI assistants that safeguard and serve humanity.


Conclusion

Thanks to key optimizations enabling stable reinforcement learning from human feedback, PPO offers a path to instilling human values in AI systems before their capabilities outpace control. This technique promises to steer the awesome yet unruly beasts of AI toward alignment with ethics and cooperative goals.

Society must proceed cautiously and deliberately guide the development of AI that enhances rather than threatens human flourishing. If aligned properly, these powerful systems could profoundly improve life for all. The quest to align AI with humanity continues...

Comments


bottom of page