BERT and Masked Language Models

What

BERT (Bidirectional Encoder Representations from Transformers): a pretrained transformer encoder that understands text by looking at context from both directions.

Pretraining

Masked Language Modeling: mask 15% of tokens, predict them from surrounding context
“The [MASK] sat on the mat” → “cat”
This forces the model to understand context deeply

Fine-tuning

Take pretrained BERT, add a small head, fine-tune on your task:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
 
inputs = tokenizer("This movie is great!", return_tensors="pt")
outputs = model(**inputs)

Use cases

Text classification (sentiment, topic, intent)
Named entity recognition
Question answering
Sentence similarity
Any task that requires understanding text (not generating)

Variants

Model	Notes
BERT	Original, English
RoBERTa	Better training, no NSP, more data
DistilBERT	60% size, 97% performance — for production
DeBERTa	Disentangled attention, often best

AI/ML Notes

Explorer

BERT and Masked Language Models

BERT and Masked Language Models

What

Pretraining

Fine-tuning

Use cases

Variants

Links

Graph View

Table of Contents

Backlinks