BERT

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model that transformed natural language processing in 2018 by learning deep bidirectional representations through masked language modeling. Unlike GPT which reads text left-to-right, BERT processes all tokens simultaneously and uses context from both directions when understanding each word.

During pretraining, BERT randomly masks 15% of input tokens and learns to predict them from surrounding context. This masked language modeling forces the model to develop rich bidirectional understanding. BERT also learns to predict whether two sentences follow each other, developing document-level comprehension.

After pretraining on books and Wikipedia, BERT can be fine-tuned on specific tasks with relatively little labeled data. Fine-tuned BERT achieved the best results at the time on question answering, sentiment analysis, named entity recognition, and text classification, often by large margins.

Bidirectional pretraining creates representations useful for understanding tasks, while GPT's left-to-right approach better suits generation tasks. BERT spawned a family of variants: RoBERTa with improved training, ALBERT with parameter sharing, DistilBERT with distillation for efficiency, and domain-specific versions like BioBERT and SciBERT.

Although large language models have superseded BERT for many applications, BERT-style models remain important for embedding, classification, and information extraction tasks where bidirectional understanding matters more than generation.

Interactive Concept: bert

BERT: Bidirectional Language Understanding

Explore how BERT uses context from both directions and masked language modeling

Bidirectional Context Processing

The

cat

sat

the

mat

Click on any word to see bidirectional context flow

Masked Language Modeling (15% masking)

The

cat

sat

the

mat

Step 1 of 4

Attention Weights Matrix

Interactive Concept: bert

BERT: Bidirectional Language Understanding

Explore how BERT uses context from both directions and masked language modeling

Bidirectional Context Processing

The

cat

sat

the

mat

Click on any word to see bidirectional context flow

Masked Language Modeling (15% masking)

The

cat

sat

the

mat

Step 1 of 4

BERT: Bidirectional Language Understanding

Bidirectional Context Processing

Masked Language Modeling (15% masking)

Attention Weights Matrix

Related Terms

Related Essays

BERT

BERT: Bidirectional Language Understanding

Bidirectional Context Processing

Masked Language Modeling (15% masking)

Attention Weights Matrix

Related Terms

Related Essays