BERT and Masked Language Models
What
BERT (Bidirectional Encoder Representations from Transformers): a pretrained transformer encoder that understands text by looking at context from both directions.
Pretraining
- Masked Language Modeling: mask 15% of tokens, predict them from surrounding context
- “The [MASK] sat on the mat” → “cat”
- This forces the model to understand context deeply
Fine-tuning
Take pretrained BERT, add a small head, fine-tune on your task:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
inputs = tokenizer("This movie is great!", return_tensors="pt")
outputs = model(**inputs)Use cases
- Text classification (sentiment, topic, intent)
- Named entity recognition
- Question answering
- Sentence similarity
- Any task that requires understanding text (not generating)
Variants
| Model | Notes |
|---|---|
| BERT | Original, English |
| RoBERTa | Better training, no NSP, more data |
| DistilBERT | 60% size, 97% performance — for production |
| DeBERTa | Disentangled attention, often best |