Perplexity is a metric measuring how well a language model predicts a test dataset, calculated as the exponentiated average negative log-likelihood per token. Intuitively, it represents the effective number of equally-likely choices the model faces at each position. A perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 options.
A perplexity of 100 indicates much greater uncertainty. Lower perplexity indicates better predictive performance because the model assigns higher probability to the actual words that appear. Perplexity provides a standardized way to compare language models: train two models, evaluate perplexity on the same held-out test set, and the lower-perplexity model is typically better.
Standard benchmarks like WikiText-103 and Penn Treebank provide consistent evaluation datasets. However, perplexity has limitations. It measures prediction of the test distribution, not necessarily usefulness for downstream tasks. A model with excellent Wikipedia perplexity might perform poorly on dialogue or code.
It's also sensitive to tokenization because different tokenizers produce incomparable perplexity scores. Perplexity is a necessary but not sufficient measure of model quality; downstream task performance, human evaluation, and safety testing provide complementary signals.
Interactive Visualizer
Perplexity Visualization
Watch how different language models predict the next word. Lower perplexity means better predictions and less uncertainty.
Predicting: "The" (Position 1)
correct
70.0%
alternative
20.0%
wrong
10.0%
Current Perplexity
The good model shows higher confidence (taller green bars) in correct predictions, resulting in lower perplexity. The poor model is more uncertain across all choices, leading to higher perplexity.