veda.ng

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model that transformed natural language processing in 2018 by learning deep bidirectional representations through masked language modeling. Unlike GPT which reads text left-to-right, BERT processes all tokens simultaneously and uses context from both directions when understanding each word.

During pretraining, BERT randomly masks 15% of input tokens and learns to predict them from surrounding context. This masked language modeling forces the model to develop rich bidirectional understanding. BERT also learns to predict whether two sentences follow each other, developing document-level comprehension.

After pretraining on books and Wikipedia, BERT can be fine-tuned on specific tasks with relatively little labeled data. Fine-tuned BERT achieved the best results at the time on question answering, sentiment analysis, named entity recognition, and text classification, often by large margins.

Bidirectional pretraining creates representations useful for understanding tasks, while GPT's left-to-right approach better suits generation tasks. BERT spawned a family of variants: RoBERTa with improved training, ALBERT with parameter sharing, DistilBERT with distillation for efficiency, and domain-specific versions like BioBERT and SciBERT.

Although large language models have superseded BERT for many applications, BERT-style models remain important for embedding, classification, and information extraction tasks where bidirectional understanding matters more than generation.

Interactive Visualizer

BERT: Bidirectional Language Understanding

Explore how BERT uses context from both directions and masked language modeling

Bidirectional Context Processing

The
cat
sat
on
the
mat
Click on any word to see bidirectional context flow

Masked Language Modeling (15% masking)

The
cat
sat
on
the
mat
Step 1 of 4

Attention Weights Matrix