veda.ng
Back to Glossary

Attention Score

Attention Score infographic

An attention score quantifies how much one position in a sequence should attend to another position when computing its output representation. ' The attention score between positions i and j is the dot product of position i's Query with position j's Key, measuring their compatibility. Higher scores mean stronger relevance.

After computing all pairwise scores, softmax normalization converts them into weights summing to 1. Each position's output is then a weighted sum of all Value vectors, weighted by these attention scores. Positions with high attention scores contribute more to the output.

Attention scores are interpretable, visualizing them reveals which input positions the model considers when processing each output position. This interpretability is valuable for debugging and understanding model behavior. In 'The cat sat on the mat,' the word 'sat' might attend strongly to 'cat' (subject) and 'mat' (location).

Attention scores are scaled by the square root of the key dimension to prevent dot products from growing too large, which would push softmax toward hard attention on single positions.

Interactive Visualizer

Attention Score Mechanism

Explore how each token queries and attends to other tokens in the sequence

1.0

Query Vector for "The"

0.80
0.20
0.30

Attention Scores & Weights

The
cat
sat
The
Score:
0.640
Weight:
0.357
Score:
0.410
Weight:
0.283
Score:
0.650
Weight:
0.360
Click tokens to change the query position. Hover over attention scores to see the key vectors and dot product calculation. Adjust temperature to see how it affects the attention distribution.