Attention Head

An attention head is one of multiple parallel attention mechanisms within a transformer layer, each independently learning different types of relationships between tokens in a sequence. In multi-head attention, the model doesn't compute attention just once, it computes it multiple times simultaneously through separate heads.

Each head has its own Query, Key, and Value projection matrices, allowing it to specialize in different patterns. One head might track syntactic dependencies like subject-verb agreement. Another might learn semantic relationships like pronoun coreference. A third might focus on positional patterns or long-range dependencies.

The diverse specialization emerges naturally through training without explicit programming. After computing attention independently, the outputs of all heads are concatenated and projected through a linear layer. This aggregation lets the model combine multiple relationship types into a unified representation.

The number of attention heads scales with model architecture: GPT-2 has 12-24 heads per layer, GPT-3 has 96 heads per layer. More heads increase representational capacity but also computational cost. Research has shown that attention heads are interpretable, visualization reveals meaningful patterns corresponding to linguistic phenomena.

Some heads become specialized for specific tasks while others remain general-purpose.

Interactive Concept: attention head

Multi-Head Attention Visualization

Explore how different attention heads in a transformer learn specialized patterns. Each head focuses on different types of relationships between words.

Subject-Verb Relations

Click tokens to see attention patterns • Hover for details

Show Q,K,V Projections

The

pos 0

cat

pos 1

sat

pos 2

pos 3

the

pos 4

mat

pos 5

Attention Weights for Head 1

The

cat

sat

the

mat

The

cat

sat

the

mat

Current Pattern Analysis

This head focuses on connecting subjects with their verbs (e.g., 'cat' → 'sat').

Some heads become specialized for specific tasks while others remain general-purpose.

Interactive Concept: attention head

Multi-Head Attention Visualization

Explore how different attention heads in a transformer learn specialized patterns. Each head focuses on different types of relationships between words.

Subject-Verb Relations

Click tokens to see attention patterns • Hover for details

Show Q,K,V Projections

The

pos 0

cat

pos 1

sat

pos 2

pos 3

the

pos 4

mat

pos 5

Attention Weights for Head 1

The

cat

sat

the

mat

The

cat

sat

the

mat

Current Pattern Analysis

This head focuses on connecting subjects with their verbs (e.g., 'cat' → 'sat').

Multi-Head Attention Visualization

Related Terms

Attention Head

Multi-Head Attention Visualization

Related Terms