veda.ng
Back to Glossary

Reward Model

A reward model is a neural network trained to predict human preferences, serving as the learned objective function in reinforcement learning from human feedback. Human labelers compare pairs of model outputs and indicate which they prefer. The reward model learns to predict these preferences, assigning higher scores to outputs humans would rate more favorably. Once trained, the reward model can evaluate millions of outputs without human involvement, providing the signal that guides model fine-tuning toward human-preferred behavior. The reward model is typically initialized from the same base model being aligned, then trained on the preference dataset. This gives it understanding of language and context. The training objective is to maximize the probability that the preferred output receives a higher score than the rejected one. Reward model quality is critical, errors compound through training. If the reward model systematically prefers verbose responses or confident-sounding mistakes, the aligned model will learn those behaviors. Reward hacking occurs when the model finds ways to achieve high reward scores without actually satisfying human intent, exploiting blind spots in the reward model. Constitutional AI and Direct Preference Optimization emerged partly to address reward model limitations. Despite these challenges, reward models remain central to producing helpful, harmless AI assistants.