Attention Score | Glossary

An attention score quantifies how much one position in a sequence should attend to another position when computing its output representation. ' The attention score between positions i and j is the dot product of position i's Query with position j's Key, measuring their compatibility. Higher scores mean stronger relevance.

After computing all pairwise scores, softmax normalization converts them into weights summing to 1. Each position's output is then a weighted sum of all Value vectors, weighted by these attention scores. Positions with high attention scores contribute more to the output.

Attention scores are interpretable, visualizing them reveals which input positions the model considers when processing each output position. This interpretability is valuable for debugging and understanding model behavior. In 'The cat sat on the mat,' the word 'sat' might attend strongly to 'cat' (subject) and 'mat' (location).

Attention scores are scaled by the square root of the key dimension to prevent dot products from growing too large, which would push softmax toward hard attention on single positions.

Interactive Concept: attention score

Attention Score Mechanism

Explore how each token queries and attends to other tokens in the sequence

Temperature:1.0

Query Vector for "The"

0.80

0.20

0.30

Attention Scores & Weights

The

cat

sat

The

Score:

0.640

Weight:

0.357

Score:

0.410

Weight:

0.283

Score:

0.650

Weight:

0.360

Click tokens to change the query position. Hover over attention scores to see the key vectors and dot product calculation. Adjust temperature to see how it affects the attention distribution.

Related Terms

Attention Mechanism Transformer