Positional Encoding

Positional encoding is a technique for injecting sequence position information into transformer models, which otherwise process all tokens in parallel with no inherent notion of order. Without positional information, a transformer would treat 'the dog bit the man' identically to 'the man bit the dog', it would see the same set of tokens with no understanding of their arrangement.

The original transformer paper introduced sinusoidal positional encodings: sine and cosine functions at different frequencies, creating unique position signatures that the model can distinguish. Each position gets a different encoding pattern, and the model learns to interpret these patterns as positional information.

Learned positional embeddings, an alternative approach, let the model discover its own position representations during training. These work well but require setting a maximum sequence length.

Relative positional encodings like RoPE (Rotary Position Embedding) encode distances between tokens rather than absolute positions, helping models generalize to sequences longer than those seen during training. ALiBi (Attention with Linear Biases) adds linear penalties for distant tokens directly to attention scores.

The choice of positional encoding significantly affects a model's ability to handle long contexts and generalize to new sequence lengths, an active research area as context windows expand.

Interactive Concept: positional encoding

Interactive visualization of how transformers inject position information using sine and cosine functions at different frequencies

Sequence Length8

Model Dimension16

Input Tokens

Positional Encoding Matrix

Dimension 0Dimension 15

Formula Breakdown

For position 0:

Hover over encoding matrix to see calculation

Key Insight

Each position gets a unique pattern of sine/cosine values across dimensions, allowing the model to distinguish word order.

High magnitude

Medium magnitude

Low magnitude

Learned positional embeddings, an alternative approach, let the model discover its own position representations during training. These work well but require setting a maximum sequence length.

The choice of positional encoding significantly affects a model's ability to handle long contexts and generalize to new sequence lengths, an active research area as context windows expand.

Interactive Concept: positional encoding