BERT - Pre-training of Deep Bidirectional Transformers
Devlin et al. (2018)
Why It Matters
Masked language modeling for bidirectional pre-training. SOTA on 11 benchmarks simultaneously. Established pre-train/fine-tune paradigm.
Key Ideas
- Pre-train a deep bidirectional Transformer encoder on large unlabeled text, then fine-tune the same model on many downstream NLP tasks.
- Use masked language modeling so the model learns context from both left and right instead of only predicting the next token autoregressively.
- Add task-specific heads for classification, question answering, and sentence-pair tasks, showing one backbone can adapt to many benchmarks.
- Next Sentence Prediction helped the original paper’s setup, though later work showed MLM was the more durable contribution.
Notes
- BERT is encoder-only. That makes it strong for understanding tasks, while autoregressive decoder-only models later dominated generation.
- The paper established the modern pre-train/fine-tune workflow for NLP and drove the transition from task-specific architectures to foundation backbones.
- Tokenization, special tokens (
[CLS],[SEP]), and fine-tuning stability became practical concerns in almost every later Transformer NLP system.