LoRA and PEFT

What

Parameter-Efficient Fine-Tuning (PEFT) methods adapt large models by training only a tiny fraction of parameters. LoRA is the most popular: freeze the base weights, inject small trainable low-rank matrices, get 95%+ of full fine-tuning quality at a fraction of the cost.

The problem

Full fine-tuning copies ALL model weights per task. A 70B model needs ~140GB just for the weights in fp16. Fine-tuning requires gradients and optimizer states — easily 3-4x that in memory. Want to fine-tune for 5 tasks? That’s 5 separate 140GB models.

LoRA

Freeze the pretrained weight matrix W. Add two small matrices A (d x r) and B (r x d) where r is much smaller than d:

W_new = W + A @ B    # W is frozen, only A and B are trained
# W: (4096 x 4096) = 16M params (frozen)
# A: (4096 x 16)   = 65k params (trainable)
# B: (16 x 4096)   = 65k params (trainable)
# Total trainable: 130k vs 16M = 0.8% of original

Typical rank r = 4 to 64. Higher rank = more capacity but more parameters. Applied to attention layers (Q, K, V, O projections) — that’s where the action is.

At inference, merge: W_final = W + A @ B. Single matrix, no extra latency. Swap adapters by swapping A and B — one base model serves many tasks.

QLoRA

Quantize the base model to 4-bit (NF4 data type), then apply LoRA adapters in fp16 on top. This lets you fine-tune a 70B model on a single 48GB GPU. The quantization adds minimal degradation because the adapters compensate.

Other PEFT methods

MethodIdeaTrainable paramsNotes
LoRALow-rank matrices on attention~0.1-1%Most popular, no inference cost
Prefix TuningLearnable tokens prepended to keys/values~0.1%No weight modification
AdaptersSmall bottleneck layers inserted after attention~1-3%Adds slight latency
IA3Learned scaling vectors on K, V, and FFN~0.01%Even fewer params than LoRA

Code example

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
 
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
 
config = LoraConfig(
    r=16,                          # rank
    lora_alpha=32,                 # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || 0.06%

Beyond LoRA

DoRA (Weight-Decomposed Low-Rank Adaptation, 2024, ICML):

  • Decomposes weights into magnitude + direction, applies LoRA only to direction
  • Consistently outperforms LoRA at same parameter budget
  • Same training cost, better results

QDoRA: Combines DoRA + quantization. Reduces trainable parameters by ~10,000x while maintaining ~99% of full fine-tuning quality.

Other advances:

  • LoRA+: different learning rates for A and B matrices
  • GaLore: gradient low-rank projection (operates on gradients, not weights)
  • Adaptive rank allocation: different ranks per layer based on importance

Practical guidance (2025):

  • LoRA achieves 90-95% of full fine-tuning quality
  • Minimum viable dataset: 1,000-5,000 high-quality examples
  • Production baseline: 10,000-50,000 examples

Key paper

  • LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)