Top-k sampling is a text generation strategy that restricts the model's next token choice to the k most probable tokens, preventing selection of unlikely tokens that could derail coherent generation. After the model computes logits for all vocabulary tokens, top-k keeps only the k highest-scoring tokens, sets all others to negative infinity, then applies softmax to get a probability distribution over the remaining k tokens. The next token is sampled from this truncated distribution. The hyperparameter k controls the tradeoff between diversity and quality: k=1 is greedy decoding (always pick the most likely token, deterministic but repetitive); k=50 allows moderate diversity; k=1000 permits significant creativity but risks incoherent outputs. Top-k's main limitation is that k is fixed regardless of the distribution shape. When the model is confident (one token has 95% probability), k=50 includes 49 near-zero probability tokens that add noise. When the model is uncertain (many tokens have similar probability), k=50 might exclude plausible options. Top-p (nucleus) sampling addresses this by dynamically adjusting the cutoff based on cumulative probability. In practice, top-k and top-p are often combined, with top-k providing a hard cap and top-p providing adaptive truncation within that cap.
Back to Glossary