A convolutional neural network is a neural architecture designed for processing grid-structured data like images by using learned filters that slide across the input to detect local patterns. Unlike fully connected networks where every input connects to every neuron, CNNs use convolution operations that apply small learned filters across the entire image. Each filter learns to detect a specific pattern, an edge, a texture, a color gradient. Because the same filter scans the entire image, CNNs achieve translation invariance: they detect features regardless of where they appear. A cat in the corner and a cat in the center activate the same feature detectors. CNN architecture typically alternates convolutional layers with pooling layers that reduce spatial dimensions, creating a hierarchy of increasingly abstract features. Early layers detect edges and simple textures. Middle layers combine these into parts like eyes and ears. Deep layers recognize complete objects. The final layers are typically fully connected, mapping the learned features to output classes. Landmark architectures include AlexNet, which won ImageNet 2012 and sparked the deep learning revolution; VGGNet with its uniform architecture; ResNet with skip connections enabling very deep networks; and EfficientNet optimizing for accuracy-efficiency tradeoffs. Although Vision Transformers now match or exceed CNN performance on many tasks, CNNs remain important for edge deployment due to their efficiency and remain the standard approach for many computer vision applications.
Back to Glossary