Fine-Tuning LLMs

What

Adapting a pretrained large language model to a specific task or domain by continuing training on your data. Unlike prompt engineering (which conditions the model without changing weights), fine-tuning modifies the model’s weights directly.

Why Fine-Tune

Limitation of Prompting	What Fine-Tuning Adds
Context window is limited	Compressed into weights
Same prompt repeated = same cost	Zero marginal inference cost
Model must learn task from examples in context	Task internalized in weights
Prompt injection risk	Harder to manipulate
Slow for complex formats	Fast, consistent output

When to Fine-Tune vs Prompt

Start with prompting → add RAG for knowledge → fine-tune for behavior/style

Scenario	Best Approach
General tasks, few examples	Prompt engineering
Specific format/output structure	Fine-tuning
Domain-specific knowledge (facts)	RAG (fresher, easier)
Writing style, persona, tone	Fine-tuning
Task-specific reasoning patterns	Fine-tuning
Large labeled dataset available	Fine-tuning

Fine-Tuning Approaches

Full Fine-Tuning

Update all parameters. Produces the best results but requires:

Significant GPU memory (70B model → ~280GB fp16)
Large dataset (10K+ examples recommended)
Careful learning rate (typically 10x lower than pretraining)
Distributed training across GPUs

LoRA (Low-Rank Adaptation)

Only train small adapter matrices, freeze the original weights. 10-100x less memory.

from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=16,                    # rank of adapter matrices
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
# Only ~0.1-1% of parameters are trainable

QLoRA (Quantized LoRA)

LoRA on a 4-bit quantized base model. Fine-tune 70B models on a single 48GB GPU.

from transformers import BitsAndBytesConfig
import torch
 
# Quantize base model to 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

DoRA (Weight-Decomposed LoRA, 2024)

Decomposes weights into magnitude + direction, applies LoRA only to direction. Consistently outperforms LoRA at same parameter budget.

Dataset Preparation

Dataset sizes

Task Complexity	Minimum	Good	Excellent
Simple classification	100	1K	10K
Entity extraction	500	5K	50K
Instruction following	1K	10K	100K
Code generation	1K	10K	100K

Format: Instruction tuning

{
  "instruction": "Extract the name and price from this product description.",
  "input": "Apple MacBook Pro 14-inch costs $1999.",
  "output": "Name: Apple MacBook Pro 14-inch\nPrice: $1999"
}

Quality matters more than quantity

1K high-quality examples > 100K noisy examples
Remove duplicates, fix errors, be consistent in output format
Include edge cases the model struggles with

Tools

Argilla: collaborative data labeling
FastChat: conversation format conversion
LlamaFactory: data preprocessing

Training Configuration

Key hyperparameters

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,           # 1-3 for LoRA, 1-2 for full ft
    per_device_train_batch_size=4,  # Reduce if OOM
    gradient_accumulation_steps=4,  # Effective batch = 16
    learning_rate=1e-4,           # LoRA: 1e-4 to 3e-4
                                  # Full:  1e-6 to 5e-6
    warmup_ratio=0.03,           # 3% of steps
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,                   # Better than fp16, use if supported
    gradient_checkpointing=True,  # Trade compute for memory
)

LoRA-specific settings

config = LoraConfig(
    r=16,                        # Higher = more capacity, more params
    lora_alpha=32,               # Higher = stronger adaptation
    target_modules=["q_proj", "v_proj"],  # Minimal for most tasks
    modules_to_save=None,        # Add if you modify embeddings/lm_head
    bias="none",                 # Don't train biases
    task_type="CAUSAL_LM"
)

Recommended Models (2025)

Model	Params	VRAM (QLoRA)	Quality	Notes
Qwen 2.5 3B	3B	~6GB	Good	Best sub-3B, most capable
Llama 3.2 3B	3B	~6GB	Good	Good community support
Gemma 2 2B	2B	~4GB	Good	Google’s model, well-documented
Qwen 2.5 7B	7B	~10GB	Very good	Strong for price
Llama 3.1 8B	8B	~12GB	Very good	Open weights, large community
Mistral Nemo 12B	12B	~16GB	Excellent	Best quality under 20B
Qwen 2.5 14B	14B	~20GB	Excellent	Needs high-end consumer GPU

Training Tools

Tool	Best For
Axolotl	All-in-one, yaml configs, active community
LLaMA Factory	Web UI, many algorithms, easy
Unsloth	2-5x faster, 50% less memory, free
TRL (HuggingFace)	SFT, DPO, PPO trainers
PEFT	LoRA/QLoRA/DoRA implementations

Evaluation

Benchmarks

MT-Bench: Multi-turn dialogue (8 categories)
AlpacaEval: Instruction following vs GPT-4
HumanEval: Code generation
MMLU: 57-task knowledge benchmark

Practical evaluation

from evaluate import load
import json
 
# Load test set
test_set = load_dataset("json", data_files="test.jsonl")["train"]
 
# Generate and compare
predictions = []
references = []
for example in test_set:
    pred = generate(example["instruction"], example["input"])
    predictions.append(pred)
    references.append(example["output"])
 
# Compute metrics
exact_match = load("exact_match")
results = exact_match.compute(predictions=predictions, references=references)

LLM-as-judge

Use GPT-4 or Claude to evaluate quality when ground truth is unavailable:

Rate this response from 1-5 on:
- Helpfulness: Does it answer the user's question?
- Accuracy: Is the information correct?
- Coherence: Is it well-structured and clear?

Full Pipeline Example

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
 
# 1. Load and prepare data
dataset = load_dataset("json", data_files="training_data.jsonl")["train"]
dataset = dataset.map(lambda x: {
    "text": f"### Instruction:\n{x['instruction']}\n\n### Input:\n{x['input']}\n\n### Response:\n{x['output']}"
})
 
# 2. Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
 
# 3. Prepare for k-bit training and add LoRA
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, LoraConfig(r=16, lora_alpha=32))
 
# 4. Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=512,
)
trainer.train()
 
# 5. Save and use
model.save_pretrained("./fine_tuned_model")

Key Papers

LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021, ICLR) · arXiv:2106.09685
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023, NeurIPS) · arXiv:2305.14314
DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024, ICML) · arXiv:2404.10192
LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) · arXiv:2302.13971

AI/ML Notes

Explorer

Fine-Tuning LLMs

Fine-Tuning LLMs

What

Why Fine-Tune

When to Fine-Tune vs Prompt

Fine-Tuning Approaches

Full Fine-Tuning

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

DoRA (Weight-Decomposed LoRA, 2024)

Dataset Preparation

Dataset sizes

Format: Instruction tuning

Quality matters more than quantity

Tools

Training Configuration

Key hyperparameters

LoRA-specific settings

Recommended Models (2025)

Training Tools

Evaluation

Benchmarks

Practical evaluation

LLM-as-judge

Full Pipeline Example

Key Papers

Links

Graph View

Table of Contents

Backlinks