Perplexity is a metric measuring how well a language model predicts a test dataset, calculated as the exponentiated average negative log-likelihood per token. Intuitively, it represents the effective number of equally-likely choices the model faces at each position. A perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 options. A perplexity of 100 indicates much greater uncertainty. Lower perplexity indicates better predictive performance because the model assigns higher probability to the actual words that appear. Perplexity provides a standardized way to compare language models: train two models, evaluate perplexity on the same held-out test set, and the lower-perplexity model is typically better. Standard benchmarks like WikiText-103 and Penn Treebank provide consistent evaluation datasets. However, perplexity has limitations. It measures prediction of the test distribution, not necessarily usefulness for downstream tasks. A model with excellent Wikipedia perplexity might perform poorly on dialogue or code. It's also sensitive to tokenization because different tokenizers produce incomparable perplexity scores. Perplexity is a necessary but not sufficient measure of model quality; downstream task performance, human evaluation, and safety testing provide complementary signals.
Back to Glossary