veda.ng
Back to Glossary

Computer Vision

Computer vision is the field of artificial intelligence that enables machines to interpret and understand visual information from images and video, extracting structured knowledge from pixel data the way humans extract meaning from sight. Core tasks span a hierarchy of understanding: image classification assigns labels to entire images; object detection locates and identifies multiple objects with bounding boxes; semantic segmentation assigns class labels to every pixel; instance segmentation distinguishes individual objects; pose estimation identifies body positions; and depth estimation reconstructs 3D structure from 2D images. Convolutional neural networks dominated computer vision for a decade, learning hierarchical features from edges to textures to parts to objects through their translation-invariant architecture. Vision Transformers now achieve comparable or better results by treating images as sequences of patches and applying attention mechanisms, demonstrating that image-specific inductive biases aren't strictly necessary given sufficient scale. Modern multimodal models like CLIP and GPT-4V connect vision and language, enabling zero-shot image understanding through natural language. Applications permeate modern technology: autonomous vehicles interpret road scenes; medical imaging detects tumors and analyzes scans; manufacturing inspection finds defects; security systems recognize faces and activities; augmented reality understands environments; and agriculture monitors crops from drones and satellites. Computer vision has progressed from research curiosity to essential infrastructure underlying countless products and services.