veda.ng

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model that transformed natural language processing in 2018 by learning deep bidirectional representations through masked language modeling. Unlike GPT which reads text left-to-right, BERT processes all tokens simultaneously and uses context from both directions when understanding each word. During pretraining, BERT randomly masks 15% of input tokens and learns to predict them from surrounding context. This masked language modeling forces the model to develop rich bidirectional understanding. BERT also learns to predict whether two sentences follow each other, developing document-level comprehension. After pretraining on books and Wikipedia, BERT can be fine-tuned on specific tasks with relatively little labeled data. Fine-tuned BERT achieved the best results at the time on question answering, sentiment analysis, named entity recognition, and text classification, often by large margins. The key insight is that bidirectional pretraining creates representations useful for understanding tasks, while GPT's left-to-right approach better suits generation tasks. BERT spawned a family of variants: RoBERTa with improved training, ALBERT with parameter sharing, DistilBERT with distillation for efficiency, and domain-specific versions like BioBERT and SciBERT. Although large language models have superseded BERT for many applications, BERT-style models remain important for embedding, classification, and information extraction tasks where bidirectional understanding matters more than generation.