Softmax is a mathematical function that transforms a vector of arbitrary real numbers into a probability distribution, where all values are positive and sum to exactly 1. For each input value, softmax computes the exponential of that value divided by the sum of all exponentials. This normalizes the outputs into valid probabilities. The function preserves relative ordering so that larger inputs yield larger probabilities while amplifying differences. A small gap between raw scores becomes a much larger gap after softmax, making the highest-probability option dominate. This sharpening behavior is desirable for classification: the model's most confident prediction stands out clearly. In language models, softmax converts raw output scores (logits) into a probability distribution over the vocabulary, from which the next token is sampled or selected. The temperature parameter controls softmax behavior: temperature 1 is standard softmax; temperature below 1 sharpens the distribution (confident predictions become more dominant); temperature above 1 flattens it (all options become more equally likely). At temperature approaching 0, softmax becomes argmax and always selects the highest-probability option. Softmax is differentiable everywhere, making it suitable for gradient-based training. Combined with cross-entropy loss, it forms the standard training objective for classification and language modeling.
Back to Glossary