Quantization
What
Quantization reduces the precision of neural network weights and activations from 32-bit or 16-bit floating-point to lower bit-width representations (INT8, INT4, or even INT2). This shrinks model size, reduces memory bandwidth, and speeds up inference — often with minimal quality degradation.
The core challenge: neural network weights and activations naturally take a wide range of values. Reducing this to a small number of discrete levels introduces quantization error. The goal is to minimize this error while maximizing compression.
Why Quantization Matters
Modern LLMs are too large to run on consumer hardware:
- GPT-4 (estimated 1.8T params): Would need 3.6TB just to store weights in FP16
- Llama 2 70B: 140GB in FP16 — requires multiple A100s
- Llama 2 7B: 14GB in FP16 — barely fits on a single consumer GPU
Quantization enables:
| Model | FP16 | INT8 | INT4 | INT4 + QLoRA |
|---|---|---|---|---|
| 7B | 14GB | 7GB | 3.5GB | ~5GB VRAM |
| 13B | 26GB | 13GB | 6.5GB | ~8GB VRAM |
| 70B | 140GB | 70GB | 35GB | ~40GB VRAM |
A 7B model in INT4 fits on a laptop with integrated graphics. A 70B model in INT4 fits on a single 80GB A100.
Number Formats
Floating-Point Formats
| Format | Bits | Range | Use |
|---|---|---|---|
| FP32 | 32 | ±3.4e38 | Full precision (training) |
| FP16 | 16 | ±65504 | Mixed precision (inference) |
| BF16 | 16 | ±3.4e38 | Better range than FP16 (mixed precision training) |
BF16 (Brain Float) has the same exponent range as FP32 but truncates precision. It was designed specifically for ML and is now standard for training.
Integer Formats
| Format | Bits | Range | Use |
|---|---|---|---|
| UINT8 | 8 | 0 to 255 | Activations (always positive) |
| INT8 | 8 | -128 to 127 | Weights (can be negative) |
| NF4 | 4 | 16 discrete values | QLoRA, optimized for LLMs |
| INT4 | 4 | -8 to 7 | Weights, extreme compression |
Quantization Granularity
Per-Tensor (Global)
One scale factor for the entire weight tensor. Fast but coarse — outliers (unusually large values) force the scale up, crushing precision for most values.
Per-Channel (Per-Channel)
One scale factor per output channel (e.g., per neural network weight vector). Better handles outliers within a tensor. Standard in modern quantization methods.
Per-Group (Grouped)
One scale factor per group of elements (e.g., 128 elements). The best quality/speed tradeoff for LLMs. Groups can share scale factors with minimal quality loss.
Post-Training Quantization (PTQ)
Quantize an already-trained model without any retraining. Fast and simple but introduces quantization error that can’t be corrected.
INT8 Weight-Only Quantization
The simplest approach: quantize weights to INT8, keep activations in FP16. Only weights benefit from compression.
# Example: naive per-tensor quantization
def quantize_tensor(x, bits=8):
# Find scale: map [-max, max] to [-128, 127]
scale = x.abs().max() / 127
x_int = (x / scale).round().clamp(-128, 127)
return x_int.to(torch.int8), scale
def dequantize_tensor(x_int, scale):
return x_int.float() * scaleGPTQ (2023)
One-shot weight quantization with minimal accuracy loss:
- Uses second-order information (Hessian) to correct quantization error
- Per-channel quantization
- 4-bit weight-only quantization (W4)
- Works on individual weight matrices independently
# Using auto-gptq
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained("llama-7b", quantize_config)
# ... calibration with representative data
model.quantize_model()AWQ (Activation-Aware Weight Quantization, 2024)
Observation: not all weights matter equally. Weights with larger activations contribute more to model quality.
- Identifies important weight channels based on activation magnitudes
- Protects these channels with higher precision
- Results in better quality than naive GPTQ at same bit-width
# Using awq
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("llama-7b")
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)SpQR (2023)
Outlier-aware quantization: some values (activations in specific channels) are kept at higher precision while most weights are aggressively quantized.
Quantization-Aware Training (QAT)
Instead of quantizing after training, simulate quantization effects during training. The model learns to be robust to quantization, resulting in better post-quantization quality.
# In PyTorch with qat
from torch.quantization import prepare_qat, convert
# Insert fake quantization nodes
model = prepare_qat(model)
# Train normally — gradients adjust to be robust to quantization
convert(model) # Convert to quantized modelQAT quality is better than PTQ but requires full training — expensive for large models.
QLoRA: Quantization + Fine-Tuning
The breakthrough that enabled fine-tuning 70B models on consumer GPUs:
1. Quantize base model to 4-bit NF4 (doesn't require retraining)
2. Freeze quantized weights
3. Add LoRA adapter layers (trained in FP16)
4. Fine-tune adapter on downstream task
Key insight: LoRA adapters are small (1-5% of model size) and can be trained in FP16. The quantized base model stays frozen.
NF4: Normal Float 4-bit
NF4 is specifically designed for normally-distributed weights (neural network weights). It uses non-uniform quantization levels optimized for the normal distribution, rather than uniform levels like INT4.
# In transformers with bitsandbytes
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # Quantize the scale factors too
bnb_4bit_quant_type="nf4", # Normal Float 4
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained("llama-70b", quantization_config=bnb_config)GGUF: llama.cpp Format
GGUF (formerly GGML) is the quantization format used by llama.cpp and tools built on it (Ollama, Jan, etc.):
| Type | Bits | Memory (7B) | Quality | Notes |
|---|---|---|---|---|
| Q8_0 | 8 | ~7GB | ~99% | Near-lossless, large |
| Q6_K | 6 | ~5.5GB | ~97% | Good balance |
| Q5_K_M | 5 | ~4.8GB | ~95% | Good for most use |
| Q4_K_M | 4 | ~4GB | ~92% | Most popular choice |
| Q4_0 | 4 | ~3.9GB | ~90% | Simpler, slightly worse |
| Q3_K_M | 3 | ~3.3GB | ~87% | Memory constrained |
| Q2_K | 2 | ~2.8GB | ~85% | Extreme compression |
The _K_M variants use group-wise quantization with mixed precision (some sensitive layers kept at higher precision).
Inference Speed
Quantization speeds up inference through:
- Reduced memory bandwidth (fewer bytes to read)
- Faster integer arithmetic (INT8 multiply-accumulate)
- Better cache utilization (more weights fit in cache)
Speedup depends on hardware:
| Hardware | INT8 Speedup vs FP16 |
|---|---|
| CPU (AVX2) | 2-4x |
| CPU (AMX/AVX-512) | 3-6x |
| NVIDIA T4 | 2-3x |
| NVIDIA A100/H100 | 2-4x (with TensorCores) |
Practical Guide: Choosing Quantization
For local inference (llama.cpp/Ollama)
Use Q4_K_M as the default. Good balance of quality and size. Q5_K_M if you have the extra RAM and want ~3% quality improvement.
For GPU inference with limited VRAM
- 24GB VRAM (3090/A6000): Q5_K_M for 13B, Q4_K_M for 33B
- 16GB VRAM (3080): Q4_K_M for 7B, Q3_K_M for 13B
- 8GB VRAM (3070/2080): Q4_K_M for 3B, Q3_K_M for 7B
For fine-tuning with QLoRA
Always use NF4 (bitsandbytes) for the base model. Train LoRA adapters in FP16 or BF16.
For server deployment
GPTQ or AWQ at INT4 with tensor parallelism across multiple GPUs. AWQ generally gives better quality per bit.
Key Papers
- QLoRA: Efficient Finetuning of Quantized Language Models (Dettmers et al., 2023, NeurIPS) · arXiv:2305.14314
- GPTQ: Accurate Post-Training Quantization for Generative Large Language Models (Frantar et al., 2023) · arXiv:2210.17323
- AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2024) · arXiv:2306.00978
- SpQR: A Sparsity-Quantization Code for LLMs (Lepikhin et al., 2023) · arXiv:2306.06103
Links
- Fine-Tuning LLMs — QLoRA combines quantization with fine-tuning
- Knowledge Distillation — another model compression technique
- LoRA and PEFT — the fine-tuning adapter approach
- Key Papers