BERT - Pre-training of Deep Bidirectional Transformers

Devlin et al. (2018)

Read paper

Why It Matters

Masked language modeling for bidirectional pre-training. SOTA on 11 benchmarks simultaneously. Established pre-train/fine-tune paradigm.

Key Ideas

  1. Pre-train a deep bidirectional Transformer encoder on large unlabeled text, then fine-tune the same model on many downstream NLP tasks.
  2. Use masked language modeling so the model learns context from both left and right instead of only predicting the next token autoregressively.
  3. Add task-specific heads for classification, question answering, and sentence-pair tasks, showing one backbone can adapt to many benchmarks.
  4. Next Sentence Prediction helped the original paper’s setup, though later work showed MLM was the more durable contribution.

Notes

  • BERT is encoder-only. That makes it strong for understanding tasks, while autoregressive decoder-only models later dominated generation.
  • The paper established the modern pre-train/fine-tune workflow for NLP and drove the transition from task-specific architectures to foundation backbones.
  • Tokenization, special tokens ([CLS], [SEP]), and fine-tuning stability became practical concerns in almost every later Transformer NLP system.