Knowledge Distillation

What

Train a smaller “student” model to mimic a larger “teacher” model. The student learns from the teacher’s soft probability outputs — not just hard labels.

Why soft targets help

Teacher outputs: [cat: 0.7, dog: 0.2, horse: 0.1] Hard label: [cat: 1, dog: 0, horse: 0]

The soft targets carry dark knowledge: “this looks a bit like a dog too” — richer signal than a one-hot label. The teacher assigns non-trivial probability to incorrect classes, revealing the model’s understanding of conceptual similarity.

Process

Teacher (large, accurate) → generate soft labels on training data
Student (small, fast) → train on both:
  - Soft labels from teacher (KL divergence loss)
  - Hard labels from data (cross-entropy loss)
  - Total loss = α × soft_loss + (1-α) × hard_loss

Distillation temperature

The softmax temperature T controls how “soft” the teacher’s distribution is:

def distill_loss(student_logits, teacher_logits, temperature=2.0):
    soft_student = F.log_softmax(student_logits / T, dim=-1)
    soft_teacher = F.softmax(teacher_logits / T, dim=-1)
    return T**2 * F.kl_div(soft_student, soft_teacher, reduction='batchmean')
  • T=1: standard softmax
  • T>1: softer probability distribution over more classes
  • High T amplifies dark knowledge from teacher

Types of distillation

Response distillation

Student learns to match teacher’s final output layer. Simplest form — used in DistilBERT.

Feature distillation

Student learns to match intermediate representations. The teacher intermediate layers serve as hints:

# Feature matching: align hidden states
feature_loss = F.mse_loss(student_hidden, teacher_hidden)

Relationship distillation

Student learns the relationships between teacher’s representations — attention maps, similarity matrices.

Modern distillation techniques

1. Self-distillation (Born-Again Networks)

A model distilled into an identical architecture. Iterative self-distillation often improves performance without a larger teacher — the student becomes its own teacher.

2. Language model distillation (LLM压缩)

Distilling large language models into smaller ones:

  • Special loss for token-level knowledge
  • Logit matching at the final layer
  • Intermediate layer matching for deeper architectures
  • Example: TinyLlama (1.1B) distilled from Llama 2 (7B+)

3. Task-specific distillation

Fine-tune a general teacher on a specific domain, then distill. A GPT-4 distilled into a 7B model specifically for code generation or instruction following.

4. Data-free distillation

When you don’t have access to the original training data, generate synthetic data from the teacher (use teacher to label generated samples), or use adversarial setup to create informative samples.

Distillation vs other compression techniques

DistillationQuantizationPruning
MechanismTrain small from largeReduce weight precisionRemove weights
QualityBest — leverages teacher knowledgeGood (INT8 > FP16)Variable
SpeedSmaller model = fasterFaster matmulsDepends on sparsity
CombinationCan combine with quantizationCan combine with distillationCan combine with distillation

Applications

  • Deploy smaller models in production (DistilBERT = 60% of BERT, 97% quality)
  • Mobile/edge deployment
  • Using GPT-4 outputs to train smaller open models
  • Specialized models from general models

Key papers

  • Distilling the Knowledge in a Neural Network (Hinton et al., 2015) — arXiv:1503.02531
  • Born Again Networks (Furlanello et al., 2018) — self-distillation
  • TinyBERT (Jiao et al., 2020) — BERT distillation
  • MiniLM (Wang et al., 2020) — deep self-distillation for language models
  • A Survey on Knowledge Distillation (Gou et al., 2021) — arXiv:2006.05525