veda.ng

CLIP (Contrastive Language-Image Pre-training) is a multimodal model from OpenAI that learns to connect images and text in a shared embedding space through contrastive learning on 400 million image-text pairs scraped from the internet.

The architecture consists of separate image and text encoders (typically a Vision Transformer and a text transformer) trained to produce aligned embeddings: an image and its caption should map to nearby points, while unrelated image-text pairs should map far apart. This enables powerful zero-shot capabilities.

To classify an image, encode the image and encode text descriptions of each class ('a photo of a dog', 'a photo of a cat'), then choose the class whose text embedding is closest to the image embedding. No task-specific training required. CLIP achieves competitive accuracy on many benchmarks without seeing any labeled examples from those datasets.

Beyond classification, CLIP enables semantic image search (find images matching a text query), image generation guidance (DALL-E uses CLIP to guide generation toward prompts), and multimodal understanding. Natural language provides richer supervision than discrete labels: 'a golden retriever playing fetch in a sunny park' contains more information than the label 'dog'.

CLIP's web-scale training captures this richness. Limitations include biases from web data and struggles with fine-grained distinctions.

Interactive Visualizer

CLIP: Contrastive Language-Image Pre-training

Interactive visualization of how CLIP learns to align images and text in a shared embedding space

Image Encoder (Vision Transformer)

🐱
Cat Photo
🐶
Dog Photo
🚗
Car Photo
🌳
Tree Photo

Shared Embedding Space

Text Encoder (Transformer)

"A cute cat sitting"
"A happy dog playing"
"A red sports car"
"A green oak tree"

Contrastive Learning Process

🐱
"A cute cat sitting"
Sim: 1.00
Positive
🐱
"A happy dog playing"
Sim: 0.80
Negative
🐱
"A red sports car"
Sim: 0.93
Negative
🐱
"A green oak tree"
Sim: 0.89
Negative
🐶
"A cute cat sitting"
Sim: 0.80
Negative
🐶
"A happy dog playing"
Sim: 1.00
Positive
🐶
"A red sports car"
Sim: 0.95
Negative
🐶
"A green oak tree"
Sim: 0.38
Negative
🚗
"A cute cat sitting"
Sim: 0.92
Negative
🚗
"A happy dog playing"
Sim: 0.98
Negative
🚗
"A red sports car"
Sim: 1.00
Positive
🚗
"A green oak tree"
Sim: 0.60
Negative
🌳
"A cute cat sitting"
Sim: 0.82
Negative
🌳
"A happy dog playing"
Sim: 0.39
Negative
🌳
"A red sports car"
Sim: 0.61
Negative
🌳
"A green oak tree"
Sim: 1.00
Positive

CLIP learns by maximizing similarity between correct image-text pairs (green, positive examples) while minimizing similarity between incorrect pairs (red, negative examples).