veda.ng
Back to Glossary

Attention Head

An attention head is one of multiple parallel attention mechanisms within a transformer layer, each independently learning different types of relationships between tokens in a sequence. In multi-head attention, the model doesn't compute attention just once, it computes it multiple times simultaneously through separate heads. Each head has its own Query, Key, and Value projection matrices, allowing it to specialize in different patterns. One head might track syntactic dependencies like subject-verb agreement. Another might learn semantic relationships like pronoun coreference. A third might focus on positional patterns or long-range dependencies. The diverse specialization emerges naturally through training without explicit programming. After computing attention independently, the outputs of all heads are concatenated and projected through a linear layer. This aggregation lets the model combine multiple relationship types into a unified representation. The number of attention heads scales with model architecture: GPT-2 has 12-24 heads per layer, GPT-3 has 96 heads per layer. More heads increase representational capacity but also computational cost. Research has shown that attention heads are interpretable, visualization reveals meaningful patterns corresponding to linguistic phenomena. Some heads become specialized for specific tasks while others remain general-purpose.