The attention mechanism is the core innovation inside transformer models that allows them to weigh the importance of different parts of an input sequence when generating each output token. Before attention, recurrent neural networks processed sequences step by step, losing context over long distances. Attention solved this by letting every token directly attend to every other token, regardless of distance. The mechanism computes three vectors for each token: Query, Key, and Value. The Query asks 'what am I looking for?' The Key says 'what do I contain?' The dot product of Query and Key determines attention scores, which are then used to weight the Value vectors in producing the output. Self-attention allows a model to understand context and relationships. In the sentence 'The bank by the river was muddy,' attention connects 'bank' with 'river' and 'muddy,' disambiguating which meaning applies. Multi-head attention runs multiple attention operations in parallel, each learning different relationship types including syntactic, semantic, and positional. The result is rich contextual understanding that makes LLMs capable of detailed reasoning across long contexts.
Back to Glossary