Self-Attention | Glossary

Self-attention is an attention mechanism where a sequence attends to itself, allowing each position to gather information from all other positions in the same sequence to compute a context-aware representation. The Query, Key, and Value vectors all derive from the same input sequence. ' and receives a weighted combination of information from all positions.

This captures dependencies regardless of distance, a word can attend to another word hundreds of positions away as easily as to its neighbor. Self-attention replaced the sequential processing of RNNs with parallel computation: all positions attend simultaneously. This enables efficient training on GPUs and better captures long-range dependencies.

In language models, self-attention learns syntactic relationships (subject-verb agreement), semantic relationships (pronoun references), and complex reasoning patterns. The mechanism is remarkably general, the same architecture works for language, images, proteins, and code.

Self-attention complexity is quadratic in sequence length: computing attention between all pairs of N positions requires N² operations. This limits context length and has driven research into efficient attention variants like sparse attention, linear attention, and sliding window attention.

Despite this limitation, self-attention's expressive power and parallelism made transformers the dominant architecture in modern AI.

Interactive Concept: self attention

Self-Attention Mechanism

Hover over a word to see how it "attends" to context.

The

bank

the

river

Notice how the model understands which 'bank' is meant.

Related Terms

Attention Mechanism Transformer