veda.ng
Back to Glossary

Vision Transformer

Vision Transformer (ViT) applies the transformer architecture to images by splitting images into fixed-size patches and treating each patch as a token, demonstrating that the attention mechanism developed for language can match or exceed CNNs for computer vision. The process: divide an image into non-overlapping patches (typically 16x16 pixels), flatten each patch into a vector, add learnable positional embeddings, and process the sequence of patch embeddings through standard transformer encoder layers. A learnable classification token aggregates information across patches for the final prediction. ViT challenged the assumption that inductive biases specific to images (translation equivariance, locality) are necessary for visual understanding. The attention mechanism has no built-in notion of spatial locality, patches attend to all other patches regardless of distance. Yet ViT matches CNN performance when trained on sufficient data, and exceeds it when pretrained on very large datasets. The original ViT paper showed that scale matters: smaller ViTs underperform CNNs, but larger ViTs trained on more data dominate. This mirrors the scaling patterns of language models. ViT enabled unified architectures across vision and language, leading to multimodal models like CLIP and GPT-4V. Variants include DeiT with improved training efficiency, Swin Transformer with hierarchical representations, and BEiT with masked image modeling.