Cross-Attention | Glossary

Cross-attention is an attention mechanism where one sequence attends to a different sequence, enabling information flow between distinct representations like a translated sentence attending to its source or an image caption attending to image features. The Query vectors come from one sequence while Key and Value vectors come from another.

This asymmetry distinguishes cross-attention from self-attention where all vectors derive from the same sequence. In encoder-decoder transformer architectures for machine translation, the decoder uses cross-attention to look at the encoded source sentence when generating each target word. ' and receives a weighted combination of encoder outputs.

Cross-attention is central to multimodal models: CLIP uses cross-attention between text and image representations; DALL-E uses cross-attention to condition image generation on text prompts; GPT-4V uses cross-attention to integrate visual and textual information. In retrieval-augmented generation, cross-attention allows language models to attend to retrieved documents when generating responses.

The mechanism creates a flexible bridge between any two representation spaces, enabling neural networks to combine information from different modalities, sources, or processing stages. Cross-attention complexity scales with the product of both sequence lengths, which can be costly when one sequence is very long.

Interactive Concept: cross attention

Cross-Attention Mechanism

Click a target word (Query) to see how it attends to source words (Keys & Values)

Source Sequence (English) - Keys & Values

The

K1, V1

cat

K2, V2

sleeps

K3, V3

peacefully

K4, V4

Target Sequence (French) - Queries

chat

dort

Related Terms

Attention Mechanism Transformer