Quantization

What

Quantization reduces the precision of neural network weights and activations from 32-bit or 16-bit floating-point to lower bit-width representations (INT8, INT4, or even INT2). This shrinks model size, reduces memory bandwidth, and speeds up inference — often with minimal quality degradation.

The core challenge: neural network weights and activations naturally take a wide range of values. Reducing this to a small number of discrete levels introduces quantization error. The goal is to minimize this error while maximizing compression.

Why Quantization Matters

Modern LLMs are too large to run on consumer hardware:

  • GPT-4 (estimated 1.8T params): Would need 3.6TB just to store weights in FP16
  • Llama 2 70B: 140GB in FP16 — requires multiple A100s
  • Llama 2 7B: 14GB in FP16 — barely fits on a single consumer GPU

Quantization enables:

ModelFP16INT8INT4INT4 + QLoRA
7B14GB7GB3.5GB~5GB VRAM
13B26GB13GB6.5GB~8GB VRAM
70B140GB70GB35GB~40GB VRAM

A 7B model in INT4 fits on a laptop with integrated graphics. A 70B model in INT4 fits on a single 80GB A100.

Number Formats

Floating-Point Formats

FormatBitsRangeUse
FP3232±3.4e38Full precision (training)
FP1616±65504Mixed precision (inference)
BF1616±3.4e38Better range than FP16 (mixed precision training)

BF16 (Brain Float) has the same exponent range as FP32 but truncates precision. It was designed specifically for ML and is now standard for training.

Integer Formats

FormatBitsRangeUse
UINT880 to 255Activations (always positive)
INT88-128 to 127Weights (can be negative)
NF4416 discrete valuesQLoRA, optimized for LLMs
INT44-8 to 7Weights, extreme compression

Quantization Granularity

Per-Tensor (Global)

One scale factor for the entire weight tensor. Fast but coarse — outliers (unusually large values) force the scale up, crushing precision for most values.

Per-Channel (Per-Channel)

One scale factor per output channel (e.g., per neural network weight vector). Better handles outliers within a tensor. Standard in modern quantization methods.

Per-Group (Grouped)

One scale factor per group of elements (e.g., 128 elements). The best quality/speed tradeoff for LLMs. Groups can share scale factors with minimal quality loss.

Post-Training Quantization (PTQ)

Quantize an already-trained model without any retraining. Fast and simple but introduces quantization error that can’t be corrected.

INT8 Weight-Only Quantization

The simplest approach: quantize weights to INT8, keep activations in FP16. Only weights benefit from compression.

# Example: naive per-tensor quantization
def quantize_tensor(x, bits=8):
    # Find scale: map [-max, max] to [-128, 127]
    scale = x.abs().max() / 127
    x_int = (x / scale).round().clamp(-128, 127)
    return x_int.to(torch.int8), scale
 
def dequantize_tensor(x_int, scale):
    return x_int.float() * scale

GPTQ (2023)

One-shot weight quantization with minimal accuracy loss:

  • Uses second-order information (Hessian) to correct quantization error
  • Per-channel quantization
  • 4-bit weight-only quantization (W4)
  • Works on individual weight matrices independently
# Using auto-gptq
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained("llama-7b", quantize_config)
# ... calibration with representative data
model.quantize_model()

AWQ (Activation-Aware Weight Quantization, 2024)

Observation: not all weights matter equally. Weights with larger activations contribute more to model quality.

  • Identifies important weight channels based on activation magnitudes
  • Protects these channels with higher precision
  • Results in better quality than naive GPTQ at same bit-width
# Using awq
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("llama-7b")
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)

SpQR (2023)

Outlier-aware quantization: some values (activations in specific channels) are kept at higher precision while most weights are aggressively quantized.

Quantization-Aware Training (QAT)

Instead of quantizing after training, simulate quantization effects during training. The model learns to be robust to quantization, resulting in better post-quantization quality.

# In PyTorch with qat
from torch.quantization import prepare_qat, convert
 
# Insert fake quantization nodes
model = prepare_qat(model)
# Train normally — gradients adjust to be robust to quantization
convert(model)  # Convert to quantized model

QAT quality is better than PTQ but requires full training — expensive for large models.

QLoRA: Quantization + Fine-Tuning

The breakthrough that enabled fine-tuning 70B models on consumer GPUs:

1. Quantize base model to 4-bit NF4 (doesn't require retraining)
2. Freeze quantized weights
3. Add LoRA adapter layers (trained in FP16)
4. Fine-tune adapter on downstream task

Key insight: LoRA adapters are small (1-5% of model size) and can be trained in FP16. The quantized base model stays frozen.

NF4: Normal Float 4-bit

NF4 is specifically designed for normally-distributed weights (neural network weights). It uses non-uniform quantization levels optimized for the normal distribution, rather than uniform levels like INT4.

# In transformers with bitsandbytes
from transformers import BitsAndBytesConfig
 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,    # Quantize the scale factors too
    bnb_4bit_quant_type="nf4",        # Normal Float 4
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained("llama-70b", quantization_config=bnb_config)

GGUF: llama.cpp Format

GGUF (formerly GGML) is the quantization format used by llama.cpp and tools built on it (Ollama, Jan, etc.):

TypeBitsMemory (7B)QualityNotes
Q8_08~7GB~99%Near-lossless, large
Q6_K6~5.5GB~97%Good balance
Q5_K_M5~4.8GB~95%Good for most use
Q4_K_M4~4GB~92%Most popular choice
Q4_04~3.9GB~90%Simpler, slightly worse
Q3_K_M3~3.3GB~87%Memory constrained
Q2_K2~2.8GB~85%Extreme compression

The _K_M variants use group-wise quantization with mixed precision (some sensitive layers kept at higher precision).

Inference Speed

Quantization speeds up inference through:

  • Reduced memory bandwidth (fewer bytes to read)
  • Faster integer arithmetic (INT8 multiply-accumulate)
  • Better cache utilization (more weights fit in cache)

Speedup depends on hardware:

HardwareINT8 Speedup vs FP16
CPU (AVX2)2-4x
CPU (AMX/AVX-512)3-6x
NVIDIA T42-3x
NVIDIA A100/H1002-4x (with TensorCores)

Practical Guide: Choosing Quantization

For local inference (llama.cpp/Ollama)

Use Q4_K_M as the default. Good balance of quality and size. Q5_K_M if you have the extra RAM and want ~3% quality improvement.

For GPU inference with limited VRAM

  • 24GB VRAM (3090/A6000): Q5_K_M for 13B, Q4_K_M for 33B
  • 16GB VRAM (3080): Q4_K_M for 7B, Q3_K_M for 13B
  • 8GB VRAM (3070/2080): Q4_K_M for 3B, Q3_K_M for 7B

For fine-tuning with QLoRA

Always use NF4 (bitsandbytes) for the base model. Train LoRA adapters in FP16 or BF16.

For server deployment

GPTQ or AWQ at INT4 with tensor parallelism across multiple GPUs. AWQ generally gives better quality per bit.

Key Papers

  • QLoRA: Efficient Finetuning of Quantized Language Models (Dettmers et al., 2023, NeurIPS) · arXiv:2305.14314
  • GPTQ: Accurate Post-Training Quantization for Generative Large Language Models (Frantar et al., 2023) · arXiv:2210.17323
  • AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2024) · arXiv:2306.00978
  • SpQR: A Sparsity-Quantization Code for LLMs (Lepikhin et al., 2023) · arXiv:2306.06103