Direct Preference Optimization is an alignment technique that fine-tunes language models directly on preference data without requiring a separate reward model, simplifying the RLHF pipeline while achieving comparable results. Standard RLHF involves three stages: collect preference data, train a reward model on preferences, then use reinforcement learning to optimize the language model against the reward model. DPO collapses this into a single supervised learning step by deriving a loss function directly from the preference data. Given pairs of preferred and dispreferred responses to the same prompt, DPO increases the probability of preferred responses relative to dispreferred ones using a specific mathematical formulation derived from the RLHF objective. The loss function implicitly optimizes the same objective as RLHF but without the instabilities of reinforcement learning or the compounding errors of a learned reward model. Training is more stable, requires less hyperparameter tuning, and uses standard supervised learning infrastructure. The main requirement is paired preference data, examples where humans indicated which of two responses they prefer. DPO has become popular for alignment research because it's simpler to implement and debug than full RLHF. Variants like IPO and KTO further refine the approach. DPO represents a shift toward treating alignment as a supervised learning problem with cleverly designed loss functions.
Back to Glossary