Fine-Tuning LLMs

What

Adapting a pretrained large language model to a specific task or domain by continuing training on your data. Unlike prompt engineering (which conditions the model without changing weights), fine-tuning modifies the model’s weights directly.

Why Fine-Tune

Limitation of PromptingWhat Fine-Tuning Adds
Context window is limitedCompressed into weights
Same prompt repeated = same costZero marginal inference cost
Model must learn task from examples in contextTask internalized in weights
Prompt injection riskHarder to manipulate
Slow for complex formatsFast, consistent output

When to Fine-Tune vs Prompt

Start with prompting → add RAG for knowledge → fine-tune for behavior/style
ScenarioBest Approach
General tasks, few examplesPrompt engineering
Specific format/output structureFine-tuning
Domain-specific knowledge (facts)RAG (fresher, easier)
Writing style, persona, toneFine-tuning
Task-specific reasoning patternsFine-tuning
Large labeled dataset availableFine-tuning

Fine-Tuning Approaches

Full Fine-Tuning

Update all parameters. Produces the best results but requires:

  • Significant GPU memory (70B model → ~280GB fp16)
  • Large dataset (10K+ examples recommended)
  • Careful learning rate (typically 10x lower than pretraining)
  • Distributed training across GPUs

LoRA (Low-Rank Adaptation)

Only train small adapter matrices, freeze the original weights. 10-100x less memory.

from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=16,                    # rank of adapter matrices
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
# Only ~0.1-1% of parameters are trainable

QLoRA (Quantized LoRA)

LoRA on a 4-bit quantized base model. Fine-tune 70B models on a single 48GB GPU.

from transformers import BitsAndBytesConfig
import torch
 
# Quantize base model to 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

DoRA (Weight-Decomposed LoRA, 2024)

Decomposes weights into magnitude + direction, applies LoRA only to direction. Consistently outperforms LoRA at same parameter budget.

Dataset Preparation

Dataset sizes

Task ComplexityMinimumGoodExcellent
Simple classification1001K10K
Entity extraction5005K50K
Instruction following1K10K100K
Code generation1K10K100K

Format: Instruction tuning

{
  "instruction": "Extract the name and price from this product description.",
  "input": "Apple MacBook Pro 14-inch costs $1999.",
  "output": "Name: Apple MacBook Pro 14-inch\nPrice: $1999"
}

Quality matters more than quantity

  • 1K high-quality examples > 100K noisy examples
  • Remove duplicates, fix errors, be consistent in output format
  • Include edge cases the model struggles with

Tools

  • Argilla: collaborative data labeling
  • FastChat: conversation format conversion
  • LlamaFactory: data preprocessing

Training Configuration

Key hyperparameters

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,           # 1-3 for LoRA, 1-2 for full ft
    per_device_train_batch_size=4,  # Reduce if OOM
    gradient_accumulation_steps=4,  # Effective batch = 16
    learning_rate=1e-4,           # LoRA: 1e-4 to 3e-4
                                  # Full:  1e-6 to 5e-6
    warmup_ratio=0.03,           # 3% of steps
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,                   # Better than fp16, use if supported
    gradient_checkpointing=True,  # Trade compute for memory
)

LoRA-specific settings

config = LoraConfig(
    r=16,                        # Higher = more capacity, more params
    lora_alpha=32,               # Higher = stronger adaptation
    target_modules=["q_proj", "v_proj"],  # Minimal for most tasks
    modules_to_save=None,        # Add if you modify embeddings/lm_head
    bias="none",                 # Don't train biases
    task_type="CAUSAL_LM"
)
ModelParamsVRAM (QLoRA)QualityNotes
Qwen 2.5 3B3B~6GBGoodBest sub-3B, most capable
Llama 3.2 3B3B~6GBGoodGood community support
Gemma 2 2B2B~4GBGoodGoogle’s model, well-documented
Qwen 2.5 7B7B~10GBVery goodStrong for price
Llama 3.1 8B8B~12GBVery goodOpen weights, large community
Mistral Nemo 12B12B~16GBExcellentBest quality under 20B
Qwen 2.5 14B14B~20GBExcellentNeeds high-end consumer GPU

Training Tools

ToolBest For
AxolotlAll-in-one, yaml configs, active community
LLaMA FactoryWeb UI, many algorithms, easy
Unsloth2-5x faster, 50% less memory, free
TRL (HuggingFace)SFT, DPO, PPO trainers
PEFTLoRA/QLoRA/DoRA implementations

Evaluation

Benchmarks

  • MT-Bench: Multi-turn dialogue (8 categories)
  • AlpacaEval: Instruction following vs GPT-4
  • HumanEval: Code generation
  • MMLU: 57-task knowledge benchmark

Practical evaluation

from evaluate import load
import json
 
# Load test set
test_set = load_dataset("json", data_files="test.jsonl")["train"]
 
# Generate and compare
predictions = []
references = []
for example in test_set:
    pred = generate(example["instruction"], example["input"])
    predictions.append(pred)
    references.append(example["output"])
 
# Compute metrics
exact_match = load("exact_match")
results = exact_match.compute(predictions=predictions, references=references)

LLM-as-judge

Use GPT-4 or Claude to evaluate quality when ground truth is unavailable:

Rate this response from 1-5 on:
- Helpfulness: Does it answer the user's question?
- Accuracy: Is the information correct?
- Coherence: Is it well-structured and clear?

Full Pipeline Example

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
 
# 1. Load and prepare data
dataset = load_dataset("json", data_files="training_data.jsonl")["train"]
dataset = dataset.map(lambda x: {
    "text": f"### Instruction:\n{x['instruction']}\n\n### Input:\n{x['input']}\n\n### Response:\n{x['output']}"
})
 
# 2. Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")
 
# 3. Prepare for k-bit training and add LoRA
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, LoraConfig(r=16, lora_alpha=32))
 
# 4. Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=512,
)
trainer.train()
 
# 5. Save and use
model.save_pretrained("./fine_tuned_model")

Key Papers

  • LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021, ICLR) · arXiv:2106.09685
  • QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023, NeurIPS) · arXiv:2305.14314
  • DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024, ICML) · arXiv:2404.10192
  • LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) · arXiv:2302.13971