Cross-entropy loss is the standard objective function for training classification and language models, measuring the discrepancy between predicted probability distributions and true labels. For a single prediction, cross-entropy equals the negative logarithm of the probability assigned to the correct class. High confidence in correct answers yields low loss; low confidence yields high loss; confident wrong answers yield very high loss. The logarithmic penalty creates strong gradients for incorrect predictions, accelerating learning. For language models, cross-entropy is computed at each token position: how much probability did the model assign to the token that actually appeared? The total loss is averaged across all positions. Minimizing cross-entropy during training encourages the model to assign high probability to correct tokens. The information-theoretic interpretation: cross-entropy measures the expected number of bits needed to encode data from the true distribution using a code optimized for the predicted distribution. When predictions perfectly match reality, cross-entropy equals entropy, the theoretical minimum encoding length. Cross-entropy is differentiable and pairs naturally with softmax outputs, making it computationally tractable for gradient descent. Nearly all modern language model training uses cross-entropy loss, often called 'language modeling loss' or 'next-token prediction loss' in that context.
Back to Glossary