veda.ng
Back to Glossary

Self-Attention

Self-attention is an attention mechanism where a sequence attends to itself, allowing each position to gather information from all other positions in the same sequence to compute a context-aware representation. The Query, Key, and Value vectors all derive from the same input sequence. Each token asks 'what in my sequence is relevant to understanding me?' and receives a weighted combination of information from all positions. This captures dependencies regardless of distance, a word can attend to another word hundreds of positions away as easily as to its neighbor. Self-attention replaced the sequential processing of RNNs with parallel computation: all positions attend simultaneously. This enables efficient training on GPUs and better captures long-range dependencies. In language models, self-attention learns syntactic relationships (subject-verb agreement), semantic relationships (pronoun references), and complex reasoning patterns. The mechanism is remarkably general, the same architecture works for language, images, proteins, and code. Self-attention complexity is quadratic in sequence length: computing attention between all pairs of N positions requires N² operations. This limits context length and has driven research into efficient attention variants like sparse attention, linear attention, and sliding window attention. Despite this limitation, self-attention's expressive power and parallelism made transformers the dominant architecture in modern AI.