Causal masking is an attention mechanism constraint that prevents each position from attending to future positions, enforcing left-to-right information flow essential for autoregressive language generation. In a sequence of N tokens, position i can only attend to positions 0 through i, never to positions i+1 through N-1. This is implemented by adding negative infinity to attention scores for future positions before softmax, which zeros out those attention weights. Without causal masking, a model generating text could 'cheat' by looking at the answer it's supposed to predict, making training trivially easy but the model useless for generation where future tokens don't exist yet. During training on sequences like 'The cat sat,' the model simultaneously predicts 'cat' given 'The,' predicts 'sat' given 'The cat,' etc. Causal masking ensures each prediction uses only legitimate context. The attention pattern forms a lower triangular matrix: full attention for the last position, progressively less for earlier positions. GPT and other decoder-only language models use causal masking throughout. BERT doesn't use causal masking because it's designed for understanding, not generation. Prefix-tuning and some multimodal models use prefix causal masking: full bidirectional attention within a prefix, then causal attention for generation.
Back to Glossary