Diffusion Models

What

Diffusion models are a class of generative models that learn to generate data by reversing a gradual noising process. Starting from pure Gaussian noise, they iteratively denoise to produce images, audio, video, or other data. Currently the dominant paradigm for image generation, surpassing GANs in quality and training stability.

The Two Processes

Forward Process (q)

Adds Gaussian noise to data over T timesteps until the data is indistinguishable from noise:

x_0 (real image) → x_1 → x_2 → ... → x_T (pure noise)

At each step: x_t = sqrt(1-β_t) * x_{t-1} + sqrt(β_t) * ε

where β_t increases from ~10⁻⁴ to ~0.02 (noise schedule). After ~1000 steps, x_T is isotropic Gaussian noise.

Key property: since the noise schedule is known, any x_t can be sampled directly from x_0 in closed form (no iterative simulation needed).

Reverse Process (p_θ)

A neural network learns to reverse this process — denoising:

x_T (noise) → x_{T-1} → x_{T-2} → ... → x_0 (generated image)

The network is trained to predict the noise ε_θ(x_t, t) that was added at timestep t. Given this prediction, we can compute x_{t-1}.

Training Objective

The network is trained to minimize:

L = E_{t, x_0, ε} [||ε - ε_θ(x_t, t)||²]

This is essentially mean squared error between true noise and predicted noise. Surprisingly simple — works extremely well.

Key Architectural Choices

U-Net backbone (original DDPM)

The denoising network is a U-Net with:

  • Downsampling / upsampling path with skip connections
  • Self-attention layers at low resolutions
  • Time embedding (sinusoidal or learned) conditioning on timestep

DiT: Diffusion Transformer (2023)

The shift from U-Net to transformer backbone:

  • Patchify the image (like ViT) into tokens
  • Apply standard transformer blocks with cross-attention on timestep and conditioning
  • Conditioning via adaptive layer norm (AdaLN)

DiT scales better with compute than U-Net, and larger DiT models consistently outperform smaller ones. This parallels the transformer scaling story in NLP.

Paper: Scalable Diffusion Models with Transformers

Latent Diffusion (Stable Diffusion)

Running diffusion in pixel space is expensive for high-resolution images (512×512 = 262K tokens). Latent diffusion compresses to a smaller latent space first:

Image → Encoder → latent z (e.g., 64×64×4 for SD) → diffuse in latent → Decoder → image

The VAE encoder compresses 8× spatially. Diffusion operates on 64×64 latents instead of 512×512 pixels — ~64× fewer tokens, massive speedup.

Stable Diffusion 1.x: VAE (8× compression) + CLIP text encoder + U-Net denoiser + 50 steps.

Text Conditioning

CLIP Guidance

Text prompts are encoded with a frozen CLIP text encoder. The CLIP embedding guides the denoising process via cross-attention:

Text embedding → cross-attention in UNet/DiT → "denoise toward this text"

CLIP was trained to align image and text embeddings, so its latent space naturally connects prompts to generated images.

Classifier-Free Guidance (CFG)

During training, text conditioning is sometimes dropped (replaced with null embedding). This creates an implicit classifier:

ε_θ(x_t | text) - ε_θ(x_t | ∅)

The difference between conditioned and unconditioned predictions indicates how strongly the text influences the generation. At inference, this difference is scaled and added:

ε_guided = ε_θ(x_t | ∅) + w * (ε_θ(x_t | text) - ε_θ(x_t | ∅))

Higher guidance weight w = more prompt adherence, less diversity. Typical values: w=7-12 for SD.

Sampling Strategies

DDPM (original)

1000 steps, stochastic. High quality but slow.

DDIM (Denoising Diffusion Implicit Models)

Deterministic sampling, fewer steps (50-100). Same architecture, different noise schedule.

EDM (Elucidating the Design Space of Diffusion Models)

重新 parameterization of noise schedule for better quality at any step count.

DPM-Solver++

Fast high-order solver. 10-20 steps for good quality, 25-50 for near-DDPM quality.

Summary: Step count vs quality

StepsDDPMDDIMDPM-Solver++
10poordecentgood
20decentgoodexcellent
50goodvery goodexcellent
100very goodexcellentexcellent

Modern Diffusion Models

Stable Diffusion 3 (2024)

  • Uses MMDiT (Multimodal DiT) — separate transformer stacks for text and image tokens
  • Flow matching instead of noise prediction
  • Significantly improved text rendering and prompt following

FLUX.1 (2024)

  • 12B parameter model from former Stable Diffusion team
  • T5-based text encoder (larger than CLIP)
  • State-of-the-art quality, prompt adherence, and compositionality

Imagen 2/3 (Google)

  • Cascaded diffusion: base 64×64 → super-resolution → upsample
  • Better photorealism than earlier Imagen versions
  • Not publicly available

DALL-E 3 (OpenAI)

  • Generated images are revised by GPT-4V to ensure prompts are faithfully followed
  • Trained on synthetic image-caption pairs where the caption describes the actual image
  • Closed, API-only access

Video Diffusion

Extending the framework to video (time + 2D space):

W.A.L.T (Transformer-based)

  • Causal encoding of frames (future frames don’t affect past)
  • Trained on videos with compressed latent representation
  • GROOT pipeline: generates from image + text → video

Sora (OpenAI, 2024)

  • Video compression network (spacetime VAE)
  • Diffusion on compressed spacetime latent
  • Variable duration, resolution, aspect ratio

Lumiere (Google)

  • Diffusion transformer with temporal attention
  • Text-to-video and image-to-video
  • 5 frames per second but with temporal super-resolution to 30 fps

Evaluation Metrics

MetricWhat it measuresNotes
FID (Fréchet Inception Distance)Distribution similarity to real imagesLower is better. Best for unconditional
IS (Inception Score)Quality + diversityHigher is better. Correlated with human judgment
CLIP ScorePrompt-image alignmentMeasures how well generated image matches text prompt
Human preferenceSubjective qualityStill the ground truth
DrawBench, HPSv2Prompt adherenceSpecialized benchmarks for text-to-image

Current State (2025)

Diffusion models are the dominant image generation paradigm:

  • GANs still used for style transfer and real-time applications (faster)
  • autoregressive models (LlamaGen) are competitive but require more compute
  • Diffusion transformers (DiT) are the standard backbone going forward
  • Scaling laws for diffusion models are established: compute, model size, and dataset size all improve quality predictably

Key Papers

  • Denoising Diffusion Probabilistic Models (Ho et al., 2020, NeurIPS) — DDPM, the foundational paper · arXiv:2006.11239
  • Denoising Diffusion Implicit Models (Song et al., 2021, ICLR) — DDIM, faster sampling · arXiv:2010.02502
  • High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022, CVPR) — Stable Diffusion · arXiv:2112.10752
  • Scalable Diffusion Models with Transformers (Peebles & Xie, 2023, ICCV) — DiT · arXiv:2212.09748
  • Comprehensive exploration of diffusion models in image generation: a survey (Zhang, 2025, AI Review) — survey covering 2020-2025 · doi:10.1007/s10462-025-11110-3
  • Conditional Image Synthesis with Diffusion Models: A Survey (Zhan et al., 2024) — conditioning mechanisms taxonomy · arXiv:2409.19365