RLHF | Glossary | Vedang Vatsa

Reinforcement Learning from Human Feedback is a training technique where humans rank model outputs and the model learns to maximize human preferences. Generate multiple outputs for the same prompt, have humans rank them from best to worst, train a reward model that predicts human preferences, then use that reward model to fine-tune the original model toward generating outputs humans prefer.

RLHF is how GPT-4 became significantly better than its predecessor. The base model generated text that was technically coherent but often unhelpful, misleading, or harmful. RLHF taught it to generate outputs that humans find more useful and trustworthy. The technique works because it aligns the model's optimization toward human preferences.

Without RLHF, optimizing raw language prediction likelihood often produces undesired behavior. With RLHF, the model learns that certain outputs get higher reward signals. This is how you encode human judgment into training. The limitation is that human preferences vary and can be wrong. You're encoding whatever biases exist in your human labelers.

You're also creating labor-intensive training pipelines. But RLHF represents the most successful approach we have to making AI systems more aligned with human values. It's part of constitutional AI training too.

Related Terms

Alignment Fine-Tuning

Related Essays

The AI Agent Economy →

Reinforcement Learning from Human Feedback (RLHF)

Related Terms

Related Essays