Positional encoding is a technique for injecting sequence position information into transformer models, which otherwise process all tokens in parallel with no inherent notion of order. Without positional information, a transformer would treat 'the dog bit the man' identically to 'the man bit the dog', it would see the same set of tokens with no understanding of their arrangement. The original transformer paper introduced sinusoidal positional encodings: sine and cosine functions at different frequencies, creating unique position signatures that the model can distinguish. Each position gets a different encoding pattern, and the model learns to interpret these patterns as positional information. Learned positional embeddings, an alternative approach, let the model discover its own position representations during training. These work well but require setting a maximum sequence length. Relative positional encodings like RoPE (Rotary Position Embedding) encode distances between tokens rather than absolute positions, helping models generalize to sequences longer than those seen during training. ALiBi (Attention with Linear Biases) adds linear penalties for distant tokens directly to attention scores. The choice of positional encoding significantly affects a model's ability to handle long contexts and generalize to new sequence lengths, an active research area as context windows expand.
Back to Glossary