Instruction Tuning
What
Fine-tuning a pretrained LLM on (instruction, response) pairs so it learns to follow instructions rather than just predict the next token. The foundation of every modern chat model.
Why it’s needed
A base model trained on web text will continue your text, not follow your instructions:
- Base model: “Write a poem about cats” → “in iambic pentameter. Consider the…”
- Instruction-tuned: “Write a poem about cats” → [an actual poem about cats]
Process
- Collect diverse instruction-response pairs
- Fine-tune the base model on this data
- (Optional) Apply RLHF and Alignment or DPO for further refinement
Self-Instruct
Why pay humans to write thousands of instruction pairs when the model can generate its own? Self-Instruct (Wang et al., 2023): start with a small seed set of examples, then have the model generate new instructions, inputs, and outputs. Filter low-quality ones and repeat. Stanford’s Alpaca dataset was built this way — GPT-3.5 generated 52k examples from 175 seeds. It works surprisingly well, though the generated data inherits the teacher model’s biases and mistakes.
Quality over quantity
The LIMA paper (Zhou et al., 2023) showed that just 1,000 carefully curated examples can match models trained on 52k+ noisy ones. The takeaway: data quality dominates data quantity for instruction tuning.
What makes a good instruction example:
- Diverse tasks (not 10k variations of “summarize this”)
- High-quality responses (well-structured, accurate, complete)
- Clear instruction formatting
- Covers edge cases (refusals, multi-step reasoning, ambiguity)
Chat templates and formatting
Modern instruction tuning uses structured chat templates so the model knows who’s speaking:
<|system|>You are a helpful assistant.<|end|>
<|user|>What is LoRA?<|end|>
<|assistant|>LoRA is...<|end|>
Each model family has its own template (ChatML, Llama-style, etc.). Using the wrong template at inference tanks performance because the model never saw that format during training.
Multi-turn instruction tuning
Single-turn (one question, one answer) is the basics. Multi-turn trains on full conversations: the model sees the history and learns to maintain context, refer back, and handle follow-ups. This is what makes chat models feel conversational rather than like a Q&A bot. Training data includes the full conversation with all turns as context.
DPO as alternative to RLHF
After instruction tuning, you can refine further. DPO (Direct Preference Optimization) skips the reward model entirely — it directly optimizes the policy using pairs of (preferred, rejected) responses. Simpler pipeline than RLHF, similar results, increasingly popular. See RLHF and Alignment.
Key datasets
| Dataset | Size | Notes |
|---|---|---|
| FLAN Collection | 1.8k tasks | Google’s diverse task collection |
| Alpaca | 52k | GPT-generated, Stanford |
| OpenAssistant | 160k | Human-written, multilingual |
| ShareGPT | varies | Real conversations with ChatGPT |
| LIMA | 1k | Curated, quality over quantity |
Key paper
- Scaling Instruction-Finetuned Language Models (Chung et al., 2022) — FLAN-T5
Links
- RLHF and Alignment — next step after instruction tuning
- Fine-Tuning LLMs — the mechanics of fine-tuning
- LoRA and PEFT — efficient fine-tuning for instruction tuning
- Language Models