CLIP (Contrastive Language-Image Pre-training) is a multimodal model from OpenAI that learns to connect images and text in a shared embedding space through contrastive learning on 400 million image-text pairs scraped from the internet. The architecture consists of separate image and text encoders (typically a Vision Transformer and a text transformer) trained to produce aligned embeddings: an image and its caption should map to nearby points, while unrelated image-text pairs should map far apart. This enables powerful zero-shot capabilities. To classify an image, encode the image and encode text descriptions of each class ('a photo of a dog', 'a photo of a cat'), then choose the class whose text embedding is closest to the image embedding. No task-specific training required. CLIP achieves competitive accuracy on many benchmarks without seeing any labeled examples from those datasets. Beyond classification, CLIP enables semantic image search (find images matching a text query), image generation guidance (DALL-E uses CLIP to guide generation toward prompts), and multimodal understanding. The key insight is that natural language provides richer supervision than discrete labels: 'a golden retriever playing fetch in a sunny park' contains more information than the label 'dog'. CLIP's web-scale training captures this richness. Limitations include biases from web data and struggles with fine-grained distinctions.
Back to Glossary