Cross-attention is an attention mechanism where one sequence attends to a different sequence, enabling information flow between distinct representations like a translated sentence attending to its source or an image caption attending to image features. The Query vectors come from one sequence while Key and Value vectors come from another. This asymmetry distinguishes cross-attention from self-attention where all vectors derive from the same sequence. In encoder-decoder transformer architectures for machine translation, the decoder uses cross-attention to look at the encoded source sentence when generating each target word. The decoder's Query asks 'what in the source is relevant for generating this word?' and receives a weighted combination of encoder outputs. Cross-attention is central to multimodal models: CLIP uses cross-attention between text and image representations; DALL-E uses cross-attention to condition image generation on text prompts; GPT-4V uses cross-attention to integrate visual and textual information. In retrieval-augmented generation, cross-attention allows language models to attend to retrieved documents when generating responses. The mechanism creates a flexible bridge between any two representation spaces, enabling neural networks to combine information from different modalities, sources, or processing stages. Cross-attention complexity scales with the product of both sequence lengths, which can be costly when one sequence is very long.
Back to Glossary