veda.ng
Back to Glossary

Attention Score

An attention score quantifies how much one position in a sequence should attend to another position when computing its output representation. In the transformer attention mechanism, each position computes a Query vector asking 'what am I looking for?' and every position computes Key and Value vectors answering 'what do I contain?' and 'what is my content?' The attention score between positions i and j is the dot product of position i's Query with position j's Key, measuring their compatibility. Higher scores mean stronger relevance. After computing all pairwise scores, softmax normalization converts them into weights summing to 1. Each position's output is then a weighted sum of all Value vectors, weighted by these attention scores. Positions with high attention scores contribute more to the output. Attention scores are interpretable, visualizing them reveals which input positions the model considers when processing each output position. This interpretability is valuable for debugging and understanding model behavior. In 'The cat sat on the mat,' the word 'sat' might attend strongly to 'cat' (subject) and 'mat' (location). Attention scores are scaled by the square root of the key dimension to prevent dot products from growing too large, which would push softmax toward hard attention on single positions.