BERT and Masked Language Models

What

BERT (Bidirectional Encoder Representations from Transformers): a pretrained transformer encoder that understands text by looking at context from both directions.

Pretraining

  • Masked Language Modeling: mask 15% of tokens, predict them from surrounding context
  • “The [MASK] sat on the mat” → “cat”
  • This forces the model to understand context deeply

Fine-tuning

Take pretrained BERT, add a small head, fine-tune on your task:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
 
inputs = tokenizer("This movie is great!", return_tensors="pt")
outputs = model(**inputs)

Use cases

  • Text classification (sentiment, topic, intent)
  • Named entity recognition
  • Question answering
  • Sentence similarity
  • Any task that requires understanding text (not generating)

Variants

ModelNotes
BERTOriginal, English
RoBERTaBetter training, no NSP, more data
DistilBERT60% size, 97% performance — for production
DeBERTaDisentangled attention, often best