Reward Model

A reward model is a neural network trained to predict human preferences, serving as the learned objective function in reinforcement learning from human feedback. Human labelers compare pairs of model outputs and indicate which they prefer. The reward model learns to predict these preferences, assigning higher scores to outputs humans would rate more favorably.

Once trained, the reward model can evaluate millions of outputs without human involvement, providing the signal that guides model fine-tuning toward human-preferred behavior. The reward model is typically initialized from the same base model being aligned, then trained on the preference dataset. This gives it understanding of language and context.

The training objective is to maximize the probability that the preferred output receives a higher score than the rejected one. Reward model quality is critical, errors compound through training. If the reward model systematically prefers verbose responses or confident-sounding mistakes, the aligned model will learn those behaviors.

Reward hacking occurs when the model finds ways to achieve high reward scores without actually satisfying human intent, exploiting blind spots in the reward model. Constitutional AI and Direct Preference Optimization emerged partly to address reward model limitations. Despite these challenges, reward models remain central to producing helpful, harmless AI assistants.

Interactive Concept: reward model

Reward Model Training

Train a neural network to predict human preferences by comparing model outputs. The reward model learns to score outputs based on human feedback.

1. Collect Human Preferences

2. Train Reward Model

3. Evaluate Model Outputs

Compare Model Outputs

Which response do you prefer? Click to indicate your preference.

Preferences collected: 0

Interactive Concept: reward model

Reward Model Training

Train a neural network to predict human preferences by comparing model outputs. The reward model learns to score outputs based on human feedback.

1. Collect Human Preferences

2. Train Reward Model

3. Evaluate Model Outputs

Compare Model Outputs

Which response do you prefer? Click to indicate your preference.

Preferences collected: 0

Reward Model Training

Compare Model Outputs

Related Terms

Reward Model

Reward Model Training

Compare Model Outputs

Related Terms