Causal Masking | Glossary

Causal masking is an attention mechanism constraint that prevents each position from attending to future positions, enforcing left-to-right information flow required for autoregressive language generation. In a sequence of N tokens, position i can only attend to positions 0 through i, never to positions i+1 through N-1.

This is implemented by adding negative infinity to attention scores for future positions before softmax, which zeros out those attention weights. Without causal masking, a model generating text could 'cheat' by looking at the answer it's supposed to predict, making training trivially easy but the model useless for generation where future tokens don't exist yet.

During training on sequences like 'The cat sat,' the model simultaneously predicts 'cat' given 'The,' predicts 'sat' given 'The cat,' etc. Causal masking guarantees each prediction uses only legitimate context. The attention pattern forms a lower triangular matrix: full attention for the last position, progressively less for earlier positions.

GPT and other decoder-only language models use causal masking throughout. BERT doesn't use causal masking because it's designed for understanding, not generation. Prefix-tuning and some multimodal models use prefix causal masking: full bidirectional attention within a prefix, then causal attention for generation.

Interactive Concept: causal masking

Causal Masking Visualizer

Interactive demonstration of how causal masking prevents tokens from attending to future positions, enforcing left-to-right information flow in autoregressive models.

Query Position:

pos 0

The

pos 1

cat

pos 2

sits

pos 3

pos 4

mat

Attention Weights for Position 2 ("sits")

1. Raw Scores

2. Apply Mask

3. Softmax

to "The"

37.2%

to "cat"

28.4%

to "sits"

34.4%

to "on"

-∞

masked

to "mat"

-∞

masked

Can attend (past/current)

Masked (future)

Causal Mask Formula

mask[i,j] = 0 if j ≤ i, else -∞

attention_weight[i,j] = softmax(score[i,j] + mask[i,j])

Related Terms

Transformer Attention Mechanism

Interactive Concept: causal masking

Causal Masking Visualizer

Interactive demonstration of how causal masking prevents tokens from attending to future positions, enforcing left-to-right information flow in autoregressive models.

Query Position:

pos 0

The

pos 1

cat

pos 2

sits

pos 3

pos 4

mat

Attention Weights for Position 2 ("sits")

1. Raw Scores

2. Apply Mask

3. Softmax

to "The"

37.2%

to "cat"

28.4%

to "sits"

34.4%

to "on"

-∞

masked

to "mat"

-∞

masked

Can attend (past/current)

Masked (future)

Causal Mask Formula

mask[i,j] = 0 if j ≤ i, else -∞

attention_weight[i,j] = softmax(score[i,j] + mask[i,j])

Related Terms

Transformer Attention Mechanism