LoRA and PEFT
What
Parameter-Efficient Fine-Tuning (PEFT) methods adapt large models by training only a tiny fraction of parameters. LoRA is the most popular: freeze the base weights, inject small trainable low-rank matrices, get 95%+ of full fine-tuning quality at a fraction of the cost.
The problem
Full fine-tuning copies ALL model weights per task. A 70B model needs ~140GB just for the weights in fp16. Fine-tuning requires gradients and optimizer states — easily 3-4x that in memory. Want to fine-tune for 5 tasks? That’s 5 separate 140GB models.
LoRA
Freeze the pretrained weight matrix W. Add two small matrices A (d x r) and B (r x d) where r is much smaller than d:
W_new = W + A @ B # W is frozen, only A and B are trained
# W: (4096 x 4096) = 16M params (frozen)
# A: (4096 x 16) = 65k params (trainable)
# B: (16 x 4096) = 65k params (trainable)
# Total trainable: 130k vs 16M = 0.8% of original
Typical rank r = 4 to 64. Higher rank = more capacity but more parameters. Applied to attention layers (Q, K, V, O projections) — that’s where the action is.
At inference, merge: W_final = W + A @ B. Single matrix, no extra latency. Swap adapters by swapping A and B — one base model serves many tasks.
QLoRA
Quantize the base model to 4-bit (NF4 data type), then apply LoRA adapters in fp16 on top. This lets you fine-tune a 70B model on a single 48GB GPU. The quantization adds minimal degradation because the adapters compensate.
Other PEFT methods
| Method | Idea | Trainable params | Notes |
|---|---|---|---|
| LoRA | Low-rank matrices on attention | ~0.1-1% | Most popular, no inference cost |
| Prefix Tuning | Learnable tokens prepended to keys/values | ~0.1% | No weight modification |
| Adapters | Small bottleneck layers inserted after attention | ~1-3% | Adds slight latency |
| IA3 | Learned scaling vectors on K, V, and FFN | ~0.01% | Even fewer params than LoRA |
Code example
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || 0.06%Beyond LoRA
DoRA (Weight-Decomposed Low-Rank Adaptation, 2024, ICML):
- Decomposes weights into magnitude + direction, applies LoRA only to direction
- Consistently outperforms LoRA at same parameter budget
- Same training cost, better results
QDoRA: Combines DoRA + quantization. Reduces trainable parameters by ~10,000x while maintaining ~99% of full fine-tuning quality.
Other advances:
- LoRA+: different learning rates for A and B matrices
- GaLore: gradient low-rank projection (operates on gradients, not weights)
- Adaptive rank allocation: different ranks per layer based on importance
Practical guidance (2025):
- LoRA achieves 90-95% of full fine-tuning quality
- Minimum viable dataset: 1,000-5,000 high-quality examples
- Production baseline: 10,000-50,000 examples
Key paper
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
Links
- Fine-Tuning LLMs — full fine-tuning that LoRA replaces
- Quantization — QLoRA combines quantization with LoRA
- Knowledge Distillation — another efficiency technique
- Key Papers