Fine-Tuning LLMs
What
Adapting a pretrained large language model to a specific task or domain by continuing training on your data. Unlike prompt engineering (which conditions the model without changing weights), fine-tuning modifies the model’s weights directly.
Why Fine-Tune
| Limitation of Prompting | What Fine-Tuning Adds |
|---|---|
| Context window is limited | Compressed into weights |
| Same prompt repeated = same cost | Zero marginal inference cost |
| Model must learn task from examples in context | Task internalized in weights |
| Prompt injection risk | Harder to manipulate |
| Slow for complex formats | Fast, consistent output |
When to Fine-Tune vs Prompt
Start with prompting → add RAG for knowledge → fine-tune for behavior/style
| Scenario | Best Approach |
|---|---|
| General tasks, few examples | Prompt engineering |
| Specific format/output structure | Fine-tuning |
| Domain-specific knowledge (facts) | RAG (fresher, easier) |
| Writing style, persona, tone | Fine-tuning |
| Task-specific reasoning patterns | Fine-tuning |
| Large labeled dataset available | Fine-tuning |
Fine-Tuning Approaches
Full Fine-Tuning
Update all parameters. Produces the best results but requires:
- Significant GPU memory (70B model → ~280GB fp16)
- Large dataset (10K+ examples recommended)
- Careful learning rate (typically 10x lower than pretraining)
- Distributed training across GPUs
LoRA (Low-Rank Adaptation)
Only train small adapter matrices, freeze the original weights. 10-100x less memory.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, # rank of adapter matrices
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
# Only ~0.1-1% of parameters are trainableQLoRA (Quantized LoRA)
LoRA on a 4-bit quantized base model. Fine-tune 70B models on a single 48GB GPU.
from transformers import BitsAndBytesConfig
import torch
# Quantize base model to 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)DoRA (Weight-Decomposed LoRA, 2024)
Decomposes weights into magnitude + direction, applies LoRA only to direction. Consistently outperforms LoRA at same parameter budget.
Dataset Preparation
Dataset sizes
| Task Complexity | Minimum | Good | Excellent |
|---|---|---|---|
| Simple classification | 100 | 1K | 10K |
| Entity extraction | 500 | 5K | 50K |
| Instruction following | 1K | 10K | 100K |
| Code generation | 1K | 10K | 100K |
Format: Instruction tuning
{
"instruction": "Extract the name and price from this product description.",
"input": "Apple MacBook Pro 14-inch costs $1999.",
"output": "Name: Apple MacBook Pro 14-inch\nPrice: $1999"
}Quality matters more than quantity
- 1K high-quality examples > 100K noisy examples
- Remove duplicates, fix errors, be consistent in output format
- Include edge cases the model struggles with
Tools
- Argilla: collaborative data labeling
- FastChat: conversation format conversion
- LlamaFactory: data preprocessing
Training Configuration
Key hyperparameters
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3, # 1-3 for LoRA, 1-2 for full ft
per_device_train_batch_size=4, # Reduce if OOM
gradient_accumulation_steps=4, # Effective batch = 16
learning_rate=1e-4, # LoRA: 1e-4 to 3e-4
# Full: 1e-6 to 5e-6
warmup_ratio=0.03, # 3% of steps
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
bf16=True, # Better than fp16, use if supported
gradient_checkpointing=True, # Trade compute for memory
)LoRA-specific settings
config = LoraConfig(
r=16, # Higher = more capacity, more params
lora_alpha=32, # Higher = stronger adaptation
target_modules=["q_proj", "v_proj"], # Minimal for most tasks
modules_to_save=None, # Add if you modify embeddings/lm_head
bias="none", # Don't train biases
task_type="CAUSAL_LM"
)Recommended Models (2025)
| Model | Params | VRAM (QLoRA) | Quality | Notes |
|---|---|---|---|---|
| Qwen 2.5 3B | 3B | ~6GB | Good | Best sub-3B, most capable |
| Llama 3.2 3B | 3B | ~6GB | Good | Good community support |
| Gemma 2 2B | 2B | ~4GB | Good | Google’s model, well-documented |
| Qwen 2.5 7B | 7B | ~10GB | Very good | Strong for price |
| Llama 3.1 8B | 8B | ~12GB | Very good | Open weights, large community |
| Mistral Nemo 12B | 12B | ~16GB | Excellent | Best quality under 20B |
| Qwen 2.5 14B | 14B | ~20GB | Excellent | Needs high-end consumer GPU |
Training Tools
| Tool | Best For |
|---|---|
| Axolotl | All-in-one, yaml configs, active community |
| LLaMA Factory | Web UI, many algorithms, easy |
| Unsloth | 2-5x faster, 50% less memory, free |
| TRL (HuggingFace) | SFT, DPO, PPO trainers |
| PEFT | LoRA/QLoRA/DoRA implementations |
Evaluation
Benchmarks
- MT-Bench: Multi-turn dialogue (8 categories)
- AlpacaEval: Instruction following vs GPT-4
- HumanEval: Code generation
- MMLU: 57-task knowledge benchmark
Practical evaluation
from evaluate import load
import json
# Load test set
test_set = load_dataset("json", data_files="test.jsonl")["train"]
# Generate and compare
predictions = []
references = []
for example in test_set:
pred = generate(example["instruction"], example["input"])
predictions.append(pred)
references.append(example["output"])
# Compute metrics
exact_match = load("exact_match")
results = exact_match.compute(predictions=predictions, references=references)LLM-as-judge
Use GPT-4 or Claude to evaluate quality when ground truth is unavailable:
Rate this response from 1-5 on:
- Helpfulness: Does it answer the user's question?
- Accuracy: Is the information correct?
- Coherence: Is it well-structured and clear?
Full Pipeline Example
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
# 1. Load and prepare data
dataset = load_dataset("json", data_files="training_data.jsonl")["train"]
dataset = dataset.map(lambda x: {
"text": f"### Instruction:\n{x['instruction']}\n\n### Input:\n{x['input']}\n\n### Response:\n{x['output']}"
})
# 2. Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
# 3. Prepare for k-bit training and add LoRA
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, LoraConfig(r=16, lora_alpha=32))
# 4. Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
max_seq_length=512,
)
trainer.train()
# 5. Save and use
model.save_pretrained("./fine_tuned_model")Key Papers
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021, ICLR) · arXiv:2106.09685
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023, NeurIPS) · arXiv:2305.14314
- DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024, ICML) · arXiv:2404.10192
- LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) · arXiv:2302.13971
Links
- LoRA and PEFT — detailed LoRA/QLoRA/DoRA explanation
- Transfer Learning — general transfer learning concepts
- Language Models — foundation model architecture
- Prompt Engineering — when to use prompting instead
- Retrieval Augmented Generation — for knowledge-heavy tasks