Diffusion Models

What

Diffusion models are a class of generative models that learn to generate data by reversing a gradual noising process. Starting from pure Gaussian noise, they iteratively denoise to produce images, audio, video, or other data. Currently the dominant paradigm for image generation, surpassing GANs in quality and training stability.

The Two Processes

Forward Process (q)

Adds Gaussian noise to data over T timesteps until the data is indistinguishable from noise:

x_0 (real image) → x_1 → x_2 → ... → x_T (pure noise)

At each step: x_t = sqrt(1-β_t) * x_{t-1} + sqrt(β_t) * ε

where β_t increases from ~10⁻⁴ to ~0.02 (noise schedule). After ~1000 steps, x_T is isotropic Gaussian noise.

Key property: since the noise schedule is known, any x_t can be sampled directly from x_0 in closed form (no iterative simulation needed).

Reverse Process (p_θ)

A neural network learns to reverse this process — denoising:

x_T (noise) → x_{T-1} → x_{T-2} → ... → x_0 (generated image)

The network is trained to predict the noise ε_θ(x_t, t) that was added at timestep t. Given this prediction, we can compute x_{t-1}.

Training Objective

The network is trained to minimize:

L = E_{t, x_0, ε} [||ε - ε_θ(x_t, t)||²]

This is essentially mean squared error between true noise and predicted noise. Surprisingly simple — works extremely well.

Key Architectural Choices

U-Net backbone (original DDPM)

The denoising network is a U-Net with:

Downsampling / upsampling path with skip connections
Self-attention layers at low resolutions
Time embedding (sinusoidal or learned) conditioning on timestep

DiT: Diffusion Transformer (2023)

The shift from U-Net to transformer backbone:

Patchify the image (like ViT) into tokens
Apply standard transformer blocks with cross-attention on timestep and conditioning
Conditioning via adaptive layer norm (AdaLN)

DiT scales better with compute than U-Net, and larger DiT models consistently outperform smaller ones. This parallels the transformer scaling story in NLP.

Paper: Scalable Diffusion Models with Transformers

Latent Diffusion (Stable Diffusion)

Running diffusion in pixel space is expensive for high-resolution images (512×512 = 262K tokens). Latent diffusion compresses to a smaller latent space first:

Image → Encoder → latent z (e.g., 64×64×4 for SD) → diffuse in latent → Decoder → image

The VAE encoder compresses 8× spatially. Diffusion operates on 64×64 latents instead of 512×512 pixels — ~64× fewer tokens, massive speedup.

Stable Diffusion 1.x: VAE (8× compression) + CLIP text encoder + U-Net denoiser + 50 steps.

Text Conditioning

CLIP Guidance

Text prompts are encoded with a frozen CLIP text encoder. The CLIP embedding guides the denoising process via cross-attention:

Text embedding → cross-attention in UNet/DiT → "denoise toward this text"

CLIP was trained to align image and text embeddings, so its latent space naturally connects prompts to generated images.

Classifier-Free Guidance (CFG)

During training, text conditioning is sometimes dropped (replaced with null embedding). This creates an implicit classifier:

ε_θ(x_t | text) - ε_θ(x_t | ∅)

The difference between conditioned and unconditioned predictions indicates how strongly the text influences the generation. At inference, this difference is scaled and added:

ε_guided = ε_θ(x_t | ∅) + w * (ε_θ(x_t | text) - ε_θ(x_t | ∅))

Higher guidance weight w = more prompt adherence, less diversity. Typical values: w=7-12 for SD.

Sampling Strategies

DDPM (original)

1000 steps, stochastic. High quality but slow.

DDIM (Denoising Diffusion Implicit Models)

Deterministic sampling, fewer steps (50-100). Same architecture, different noise schedule.

EDM (Elucidating the Design Space of Diffusion Models)

重新 parameterization of noise schedule for better quality at any step count.

DPM-Solver++

Fast high-order solver. 10-20 steps for good quality, 25-50 for near-DDPM quality.

Summary: Step count vs quality

Steps	DDPM	DDIM	DPM-Solver++
10	poor	decent	good
20	decent	good	excellent
50	good	very good	excellent
100	very good	excellent	excellent

Modern Diffusion Models

Stable Diffusion 3 (2024)

Uses MMDiT (Multimodal DiT) — separate transformer stacks for text and image tokens
Flow matching instead of noise prediction
Significantly improved text rendering and prompt following

FLUX.1 (2024)

12B parameter model from former Stable Diffusion team
T5-based text encoder (larger than CLIP)
State-of-the-art quality, prompt adherence, and compositionality

Imagen 2/3 (Google)

Cascaded diffusion: base 64×64 → super-resolution → upsample
Better photorealism than earlier Imagen versions
Not publicly available

DALL-E 3 (OpenAI)

Generated images are revised by GPT-4V to ensure prompts are faithfully followed
Trained on synthetic image-caption pairs where the caption describes the actual image
Closed, API-only access

Video Diffusion

Extending the framework to video (time + 2D space):

W.A.L.T (Transformer-based)

Causal encoding of frames (future frames don’t affect past)
Trained on videos with compressed latent representation
GROOT pipeline: generates from image + text → video

Sora (OpenAI, 2024)

Video compression network (spacetime VAE)
Diffusion on compressed spacetime latent
Variable duration, resolution, aspect ratio

Lumiere (Google)

Diffusion transformer with temporal attention
Text-to-video and image-to-video
5 frames per second but with temporal super-resolution to 30 fps

Evaluation Metrics

Metric	What it measures	Notes
FID (Fréchet Inception Distance)	Distribution similarity to real images	Lower is better. Best for unconditional
IS (Inception Score)	Quality + diversity	Higher is better. Correlated with human judgment
CLIP Score	Prompt-image alignment	Measures how well generated image matches text prompt
Human preference	Subjective quality	Still the ground truth
DrawBench, HPSv2	Prompt adherence	Specialized benchmarks for text-to-image

Current State (2025)

Diffusion models are the dominant image generation paradigm:

GANs still used for style transfer and real-time applications (faster)
autoregressive models (LlamaGen) are competitive but require more compute
Diffusion transformers (DiT) are the standard backbone going forward
Scaling laws for diffusion models are established: compute, model size, and dataset size all improve quality predictably

Key Papers

Denoising Diffusion Probabilistic Models (Ho et al., 2020, NeurIPS) — DDPM, the foundational paper · arXiv:2006.11239
Denoising Diffusion Implicit Models (Song et al., 2021, ICLR) — DDIM, faster sampling · arXiv:2010.02502
High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022, CVPR) — Stable Diffusion · arXiv:2112.10752
Scalable Diffusion Models with Transformers (Peebles & Xie, 2023, ICCV) — DiT · arXiv:2212.09748
Comprehensive exploration of diffusion models in image generation: a survey (Zhang, 2025, AI Review) — survey covering 2020-2025 · doi:10.1007/s10462-025-11110-3
Conditional Image Synthesis with Diffusion Models: A Survey (Zhan et al., 2024) — conditioning mechanisms taxonomy · arXiv:2409.19365

AI/ML Notes

Explorer

Diffusion Models

Diffusion Models

What

The Two Processes

Forward Process (q)

Reverse Process (p_θ)

Training Objective

Key Architectural Choices

U-Net backbone (original DDPM)

DiT: Diffusion Transformer (2023)

Latent Diffusion (Stable Diffusion)

Text Conditioning

CLIP Guidance

Classifier-Free Guidance (CFG)

Sampling Strategies

DDPM (original)

DDIM (Denoising Diffusion Implicit Models)

EDM (Elucidating the Design Space of Diffusion Models)

DPM-Solver++

Summary: Step count vs quality

Modern Diffusion Models

Stable Diffusion 3 (2024)

FLUX.1 (2024)

Imagen 2/3 (Google)

DALL-E 3 (OpenAI)

Video Diffusion

W.A.L.T (Transformer-based)

Sora (OpenAI, 2024)

Lumiere (Google)

Evaluation Metrics

Current State (2025)

Key Papers

Links

Graph View

Table of Contents

Backlinks