Quantization

What

Quantization reduces the precision of neural network weights and activations from 32-bit or 16-bit floating-point to lower bit-width representations (INT8, INT4, or even INT2). This shrinks model size, reduces memory bandwidth, and speeds up inference — often with minimal quality degradation.

The core challenge: neural network weights and activations naturally take a wide range of values. Reducing this to a small number of discrete levels introduces quantization error. The goal is to minimize this error while maximizing compression.

Why Quantization Matters

Modern LLMs are too large to run on consumer hardware:

GPT-4 (estimated 1.8T params): Would need 3.6TB just to store weights in FP16
Llama 2 70B: 140GB in FP16 — requires multiple A100s
Llama 2 7B: 14GB in FP16 — barely fits on a single consumer GPU

Quantization enables:

Model	FP16	INT8	INT4	INT4 + QLoRA
7B	14GB	7GB	3.5GB	~5GB VRAM
13B	26GB	13GB	6.5GB	~8GB VRAM
70B	140GB	70GB	35GB	~40GB VRAM

A 7B model in INT4 fits on a laptop with integrated graphics. A 70B model in INT4 fits on a single 80GB A100.

Number Formats

Floating-Point Formats

Format	Bits	Range	Use
FP32	32	±3.4e38	Full precision (training)
FP16	16	±65504	Mixed precision (inference)
BF16	16	±3.4e38	Better range than FP16 (mixed precision training)

BF16 (Brain Float) has the same exponent range as FP32 but truncates precision. It was designed specifically for ML and is now standard for training.

Integer Formats

Format	Bits	Range	Use
UINT8	8	0 to 255	Activations (always positive)
INT8	8	-128 to 127	Weights (can be negative)
NF4	4	16 discrete values	QLoRA, optimized for LLMs
INT4	4	-8 to 7	Weights, extreme compression

Quantization Granularity

Per-Tensor (Global)

One scale factor for the entire weight tensor. Fast but coarse — outliers (unusually large values) force the scale up, crushing precision for most values.

Per-Channel (Per-Channel)

One scale factor per output channel (e.g., per neural network weight vector). Better handles outliers within a tensor. Standard in modern quantization methods.

Per-Group (Grouped)

One scale factor per group of elements (e.g., 128 elements). The best quality/speed tradeoff for LLMs. Groups can share scale factors with minimal quality loss.

Post-Training Quantization (PTQ)

Quantize an already-trained model without any retraining. Fast and simple but introduces quantization error that can’t be corrected.

INT8 Weight-Only Quantization

The simplest approach: quantize weights to INT8, keep activations in FP16. Only weights benefit from compression.

# Example: naive per-tensor quantization
def quantize_tensor(x, bits=8):
    # Find scale: map [-max, max] to [-128, 127]
    scale = x.abs().max() / 127
    x_int = (x / scale).round().clamp(-128, 127)
    return x_int.to(torch.int8), scale
 
def dequantize_tensor(x_int, scale):
    return x_int.float() * scale

GPTQ (2023)

One-shot weight quantization with minimal accuracy loss:

Uses second-order information (Hessian) to correct quantization error
Per-channel quantization
4-bit weight-only quantization (W4)
Works on individual weight matrices independently

# Using auto-gptq
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained("llama-7b", quantize_config)
# ... calibration with representative data
model.quantize_model()

AWQ (Activation-Aware Weight Quantization, 2024)

Observation: not all weights matter equally. Weights with larger activations contribute more to model quality.

Identifies important weight channels based on activation magnitudes
Protects these channels with higher precision
Results in better quality than naive GPTQ at same bit-width

# Using awq
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("llama-7b")
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)

SpQR (2023)

Outlier-aware quantization: some values (activations in specific channels) are kept at higher precision while most weights are aggressively quantized.

Quantization-Aware Training (QAT)

Instead of quantizing after training, simulate quantization effects during training. The model learns to be robust to quantization, resulting in better post-quantization quality.

# In PyTorch with qat
from torch.quantization import prepare_qat, convert
 
# Insert fake quantization nodes
model = prepare_qat(model)
# Train normally — gradients adjust to be robust to quantization
convert(model)  # Convert to quantized model

QAT quality is better than PTQ but requires full training — expensive for large models.

QLoRA: Quantization + Fine-Tuning

The breakthrough that enabled fine-tuning 70B models on consumer GPUs:

1. Quantize base model to 4-bit NF4 (doesn't require retraining)
2. Freeze quantized weights
3. Add LoRA adapter layers (trained in FP16)
4. Fine-tune adapter on downstream task

Key insight: LoRA adapters are small (1-5% of model size) and can be trained in FP16. The quantized base model stays frozen.

NF4: Normal Float 4-bit

NF4 is specifically designed for normally-distributed weights (neural network weights). It uses non-uniform quantization levels optimized for the normal distribution, rather than uniform levels like INT4.

# In transformers with bitsandbytes
from transformers import BitsAndBytesConfig
 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,    # Quantize the scale factors too
    bnb_4bit_quant_type="nf4",        # Normal Float 4
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained("llama-70b", quantization_config=bnb_config)

GGUF: llama.cpp Format

GGUF (formerly GGML) is the quantization format used by llama.cpp and tools built on it (Ollama, Jan, etc.):

Type	Bits	Memory (7B)	Quality	Notes
Q8_0	8	~7GB	~99%	Near-lossless, large
Q6_K	6	~5.5GB	~97%	Good balance
Q5_K_M	5	~4.8GB	~95%	Good for most use
Q4_K_M	4	~4GB	~92%	Most popular choice
Q4_0	4	~3.9GB	~90%	Simpler, slightly worse
Q3_K_M	3	~3.3GB	~87%	Memory constrained
Q2_K	2	~2.8GB	~85%	Extreme compression

The _K_M variants use group-wise quantization with mixed precision (some sensitive layers kept at higher precision).

Inference Speed

Quantization speeds up inference through:

Reduced memory bandwidth (fewer bytes to read)
Faster integer arithmetic (INT8 multiply-accumulate)
Better cache utilization (more weights fit in cache)

Speedup depends on hardware:

Hardware	INT8 Speedup vs FP16
CPU (AVX2)	2-4x
CPU (AMX/AVX-512)	3-6x
NVIDIA T4	2-3x
NVIDIA A100/H100	2-4x (with TensorCores)

Practical Guide: Choosing Quantization

For local inference (llama.cpp/Ollama)

Use Q4_K_M as the default. Good balance of quality and size. Q5_K_M if you have the extra RAM and want ~3% quality improvement.

For GPU inference with limited VRAM

24GB VRAM (3090/A6000): Q5_K_M for 13B, Q4_K_M for 33B
16GB VRAM (3080): Q4_K_M for 7B, Q3_K_M for 13B
8GB VRAM (3070/2080): Q4_K_M for 3B, Q3_K_M for 7B

For fine-tuning with QLoRA

Always use NF4 (bitsandbytes) for the base model. Train LoRA adapters in FP16 or BF16.

For server deployment

GPTQ or AWQ at INT4 with tensor parallelism across multiple GPUs. AWQ generally gives better quality per bit.

Key Papers

QLoRA: Efficient Finetuning of Quantized Language Models (Dettmers et al., 2023, NeurIPS) · arXiv:2305.14314
GPTQ: Accurate Post-Training Quantization for Generative Large Language Models (Frantar et al., 2023) · arXiv:2210.17323
AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2024) · arXiv:2306.00978
SpQR: A Sparsity-Quantization Code for LLMs (Lepikhin et al., 2023) · arXiv:2306.06103

AI/ML Notes

Explorer

Quantization

Quantization

What

Why Quantization Matters

Number Formats

Floating-Point Formats

Integer Formats

Quantization Granularity

Per-Tensor (Global)

Per-Channel (Per-Channel)

Per-Group (Grouped)

Post-Training Quantization (PTQ)

INT8 Weight-Only Quantization

GPTQ (2023)

AWQ (Activation-Aware Weight Quantization, 2024)

SpQR (2023)

Quantization-Aware Training (QAT)

QLoRA: Quantization + Fine-Tuning

NF4: Normal Float 4-bit

GGUF: llama.cpp Format

Inference Speed

Practical Guide: Choosing Quantization

For local inference (llama.cpp/Ollama)

For GPU inference with limited VRAM

For fine-tuning with QLoRA

For server deployment

Key Papers

Links

Graph View

Table of Contents

Backlinks