Softmax

Softmax is a mathematical function that transforms a vector of arbitrary real numbers into a probability distribution, where all values are positive and sum to exactly 1. For each input value, softmax computes the exponential of that value divided by the sum of all exponentials. This normalizes the outputs into valid probabilities.

The function preserves relative ordering so that larger inputs yield larger probabilities while amplifying differences. A small gap between raw scores becomes a much larger gap after softmax, making the highest-probability option dominate. This sharpening behavior is desirable for classification: the model's most confident prediction stands out clearly.

In language models, softmax converts raw output scores (logits) into a probability distribution over the vocabulary, from which the next token is sampled or selected.

The temperature parameter controls softmax behavior: temperature 1 is standard softmax; temperature below 1 sharpens the distribution (confident predictions become more dominant); temperature above 1 flattens it (all options become more equally likely). At temperature approaching 0, softmax becomes argmax and always selects the highest-probability option.

Softmax is differentiable everywhere, making it suitable for gradient-based training. Combined with cross-entropy loss, it forms the standard training objective for classification and language modeling.

Interactive Concept: softmax

Softmax Function

Transform any vector of real numbers into a probability distribution. Adjust inputs and temperature to see how softmax preserves ordering while creating valid probabilities.

Temperature: 1.0

Sharp (0.1)Smooth (5.0)

Raw Inputs

x[0]: 1.0

x[1]: 2.0

x[2]: 3.0

Raw Values

x[0]1.00

x[1]2.00

x[2]3.00

Softmax Probabilities

p[0]0.090

p[1]0.245

p[2]0.665

Sum: 1.000000

Softmax Formula

softmax(x_i) = exp(x_i / T) / Σ(exp(x_j / T)) for all j

Where T is temperature. Lower T makes the distribution sharper (more confident), higher T makes it smoother.

In language models, softmax converts raw output scores (logits) into a probability distribution over the vocabulary, from which the next token is sampled or selected.

Interactive Concept: softmax

Softmax Function

Transform any vector of real numbers into a probability distribution. Adjust inputs and temperature to see how softmax preserves ordering while creating valid probabilities.

Temperature: 1.0

Sharp (0.1)Smooth (5.0)

Raw Inputs

x[0]: 1.0

x[1]: 2.0

x[2]: 3.0

Raw Values

x[0]1.00

x[1]2.00

x[2]3.00

Softmax Probabilities

p[0]0.090

p[1]0.245

p[2]0.665

Sum: 1.000000

Softmax Formula

softmax(x_i) = exp(x_i / T) / Σ(exp(x_j / T)) for all j

Where T is temperature. Lower T makes the distribution sharper (more confident), higher T makes it smoother.

Softmax Function

Raw Inputs

Raw Values

Softmax Probabilities

Softmax Formula

Related Terms

Softmax

Softmax Function

Raw Inputs

Raw Values

Softmax Probabilities

Softmax Formula

Related Terms