Instruction Tuning

What

Fine-tuning a pretrained LLM on (instruction, response) pairs so it learns to follow instructions rather than just predict the next token. The foundation of every modern chat model.

Why it’s needed

A base model trained on web text will continue your text, not follow your instructions:

Base model: “Write a poem about cats” → “in iambic pentameter. Consider the…”
Instruction-tuned: “Write a poem about cats” → [an actual poem about cats]

Process

Collect diverse instruction-response pairs
Fine-tune the base model on this data
(Optional) Apply RLHF and Alignment or DPO for further refinement

Self-Instruct

Why pay humans to write thousands of instruction pairs when the model can generate its own? Self-Instruct (Wang et al., 2023): start with a small seed set of examples, then have the model generate new instructions, inputs, and outputs. Filter low-quality ones and repeat. Stanford’s Alpaca dataset was built this way — GPT-3.5 generated 52k examples from 175 seeds. It works surprisingly well, though the generated data inherits the teacher model’s biases and mistakes.

Quality over quantity

The LIMA paper (Zhou et al., 2023) showed that just 1,000 carefully curated examples can match models trained on 52k+ noisy ones. The takeaway: data quality dominates data quantity for instruction tuning.

What makes a good instruction example:

Diverse tasks (not 10k variations of “summarize this”)
High-quality responses (well-structured, accurate, complete)
Clear instruction formatting
Covers edge cases (refusals, multi-step reasoning, ambiguity)

Chat templates and formatting

Modern instruction tuning uses structured chat templates so the model knows who’s speaking:

<|system|>You are a helpful assistant.<|end|>
<|user|>What is LoRA?<|end|>
<|assistant|>LoRA is...<|end|>

Each model family has its own template (ChatML, Llama-style, etc.). Using the wrong template at inference tanks performance because the model never saw that format during training.

Multi-turn instruction tuning

Single-turn (one question, one answer) is the basics. Multi-turn trains on full conversations: the model sees the history and learns to maintain context, refer back, and handle follow-ups. This is what makes chat models feel conversational rather than like a Q&A bot. Training data includes the full conversation with all turns as context.

DPO as alternative to RLHF

After instruction tuning, you can refine further. DPO (Direct Preference Optimization) skips the reward model entirely — it directly optimizes the policy using pairs of (preferred, rejected) responses. Simpler pipeline than RLHF, similar results, increasingly popular. See RLHF and Alignment.

Key datasets

Dataset	Size	Notes
FLAN Collection	1.8k tasks	Google’s diverse task collection
Alpaca	52k	GPT-generated, Stanford
OpenAssistant	160k	Human-written, multilingual
ShareGPT	varies	Real conversations with ChatGPT
LIMA	1k	Curated, quality over quantity

Key paper

Scaling Instruction-Finetuned Language Models (Chung et al., 2022) — FLAN-T5

AI/ML Notes

Explorer

Instruction Tuning

Instruction Tuning

What

Why it’s needed

Process

Self-Instruct

Quality over quantity

Chat templates and formatting

Multi-turn instruction tuning

DPO as alternative to RLHF

Key datasets

Key paper

Links

Graph View

Table of Contents

Backlinks