CLIP

CLIP (Contrastive Language-Image Pre-training) is a multimodal model from OpenAI that learns to connect images and text in a shared embedding space through contrastive learning on 400 million image-text pairs scraped from the internet.

The architecture consists of separate image and text encoders (typically a Vision Transformer and a text transformer) trained to produce aligned embeddings: an image and its caption should map to nearby points, while unrelated image-text pairs should map far apart. This enables powerful zero-shot capabilities.

To classify an image, encode the image and encode text descriptions of each class ('a photo of a dog', 'a photo of a cat'), then choose the class whose text embedding is closest to the image embedding. No task-specific training required. CLIP achieves competitive accuracy on many benchmarks without seeing any labeled examples from those datasets.

Beyond classification, CLIP enables semantic image search (find images matching a text query), image generation guidance (DALL-E uses CLIP to guide generation toward prompts), and multimodal understanding. Natural language provides richer supervision than discrete labels: 'a golden retriever playing fetch in a sunny park' contains more information than the label 'dog'.

CLIP's web-scale training captures this richness. Limitations include biases from web data and struggles with fine-grained distinctions.

Interactive Concept: clip

CLIP: Contrastive Language-Image Pre-training

Interactive visualization of how CLIP learns to align images and text in a shared embedding space

Image Encoder (Vision Transformer)

🐱

Cat Photo

🐶

Dog Photo

🚗

Car Photo

🌳

Tree Photo

Shared Embedding Space

Text Encoder (Transformer)

"A cute cat sitting"

"A happy dog playing"

"A red sports car"

"A green oak tree"

Contrastive Learning Process

🐱

"A cute cat sitting"

Sim: 1.00

Positive

🐱

"A happy dog playing"

Sim: 0.80

Negative

🐱

"A red sports car"

Sim: 0.93

Negative

🐱

"A green oak tree"

Sim: 0.89

Negative

🐶

"A cute cat sitting"

Sim: 0.80

Negative

🐶

"A happy dog playing"

Sim: 1.00

Positive

🐶

"A red sports car"

Sim: 0.95

Negative

🐶

"A green oak tree"

Sim: 0.38

Negative

🚗

"A cute cat sitting"

Sim: 0.92

Negative

🚗

"A happy dog playing"

Sim: 0.98

Negative

🚗

"A red sports car"

Sim: 1.00

Positive

🚗

"A green oak tree"

Sim: 0.60

Negative

🌳

"A cute cat sitting"

Sim: 0.82

Negative

🌳

"A happy dog playing"

Sim: 0.39

Negative

🌳

"A red sports car"

Sim: 0.61

Negative

🌳

"A green oak tree"

Sim: 1.00

Positive

CLIP learns by maximizing similarity between correct image-text pairs (green, positive examples) while minimizing similarity between incorrect pairs (red, negative examples).

CLIP's web-scale training captures this richness. Limitations include biases from web data and struggles with fine-grained distinctions.

Interactive Concept: clip

CLIP: Contrastive Language-Image Pre-training

Interactive visualization of how CLIP learns to align images and text in a shared embedding space

Image Encoder (Vision Transformer)

🐱

Cat Photo

🐶

Dog Photo

🚗

Car Photo

🌳

Tree Photo

Shared Embedding Space

Text Encoder (Transformer)

"A cute cat sitting"

"A happy dog playing"

"A red sports car"

"A green oak tree"

Contrastive Learning Process

🐱

"A cute cat sitting"

Sim: 1.00

Positive

🐱

"A happy dog playing"

Sim: 0.80

Negative

🐱

"A red sports car"

Sim: 0.93

Negative

🐱

"A green oak tree"

Sim: 0.89

Negative

🐶

"A cute cat sitting"

Sim: 0.80

Negative

🐶

"A happy dog playing"

Sim: 1.00

Positive

🐶

"A red sports car"

Sim: 0.95

Negative

🐶

"A green oak tree"

Sim: 0.38

Negative

🚗

"A cute cat sitting"

Sim: 0.92

Negative

🚗

"A happy dog playing"

Sim: 0.98

Negative

🚗

"A red sports car"

Sim: 1.00

Positive

🚗

"A green oak tree"

Sim: 0.60

Negative

🌳

"A cute cat sitting"

Sim: 0.82

Negative

🌳

"A happy dog playing"

Sim: 0.39

Negative

🌳

"A red sports car"

Sim: 0.61

Negative

🌳

"A green oak tree"

Sim: 1.00

Positive

CLIP learns by maximizing similarity between correct image-text pairs (green, positive examples) while minimizing similarity between incorrect pairs (red, negative examples).

CLIP: Contrastive Language-Image Pre-training

Image Encoder (Vision Transformer)

Shared Embedding Space

Text Encoder (Transformer)

Contrastive Learning Process

Related Terms

Related Essays

CLIP

CLIP: Contrastive Language-Image Pre-training

Image Encoder (Vision Transformer)

Shared Embedding Space

Text Encoder (Transformer)

Contrastive Learning Process

Related Terms

Related Essays