veda.ng
Back to Glossary

Transformer Architecture

The Transformer is a neural-network design introduced in 2017 that processes sequences of data without relying on recurrent connections. It replaced the older step-by-step processing style with attention mechanisms that let every element of the input attend to every other element simultaneously. The model has an encoder stack that converts raw inputs into rich representations and a decoder stack that generates target sequences from those representations. The key innovation is self-attention. Instead of processing tokens one at a time and passing a hidden state forward (which creates a bottleneck), the Transformer lets each token directly examine all other tokens in parallel. This makes it far better at capturing long-range dependencies in text, and it trains much faster because operations can be parallelized across GPUs. Transformers are the backbone of virtually all modern language AI. They power translation services, chat assistants, search engines that rank results by contextual meaning, and code-generation tools. The same architecture also works for images (replacing convolutional layers), protein folding, and video understanding. Companies adopt Transformers because a single pre-trained model can be fine-tuned for many downstream tasks, saving enormous compute compared to training separate architectures from scratch.