Attention Mechanism

The attention mechanism is the core innovation inside transformer models that allows them to weigh the importance of different parts of an input sequence when generating each output token. Before attention, recurrent neural networks processed sequences step by step, losing context over long distances.

Attention solved this by letting every token directly attend to every other token, regardless of distance. The mechanism computes three vectors for each token: Query, Key, and Value. ' The dot product of Query and Key determines attention scores, which are then used to weight the Value vectors in producing the output. Self-attention allows a model to understand context and relationships.

In the sentence 'The bank by the river was muddy,' attention connects 'bank' with 'river' and 'muddy,' disambiguating which meaning applies. Multi-head attention runs multiple attention operations in parallel, each learning different relationship types including syntactic, semantic, and positional.

The result is rich contextual understanding that makes LLMs capable of detailed reasoning across long contexts.

Interactive Concept: attention mechanism

Attention Mechanism Visualizer

See how each token attends to others in a sequence. Click tokens to see their attention patterns, toggle vectors to understand Q, K, V computation.

The

pos 0

cat

pos 1

sat

pos 2

pos 3

the

pos 4

mat

pos 5

Attention Pattern for "sat" (Query)

The

10%

cat

30%

sat

40%

10%

the

mat

Attention(Q,K,V) = softmax(Q·K^T/√d_k)·V

Current: "sat" attends most to "sat"

High attention (>50%)

Medium attention (30-50%)

Low attention (10-30%)

Minimal attention (<10%)

Attention Mechanism Visualizer

Attention Pattern for "sat" (Query)

Related Terms

Attention Mechanism

Attention Mechanism Visualizer

Attention Pattern for "sat" (Query)

Related Terms