Image Generation

What

Creating new images from scratch or from text descriptions. One of the fastest-moving areas in deep learning.

Approaches

GANs (Generative Adversarial Networks)

Two networks competing: generator creates fake images, discriminator tries to tell real from fake. They improve each other. Historically dominant (StyleGAN, BigGAN) but tricky to train (mode collapse, training instability).

Diffusion Models (current state of the art)

Start with noise, gradually remove it to create an image. Guided by text prompts or other conditions. More stable training than GANs, better diversity.

Examples: Stable Diffusion, DALL-E, Midjourney, Flux

VAEs (Variational Autoencoders)

Learn a compressed representation, sample from it to generate. See Autoencoders.

Flow-based models (Normalizing Flows)

Learn an invertible mapping between data and a simple distribution (e.g., Gaussian). The key constraint: every transformation must be invertible with a tractable Jacobian. Exact likelihood computation, but architecturally limited. Glow was the landmark model.

Key concepts

  • Latent space: compressed representation where images are manipulated
  • Conditioning: guiding generation with text, class labels, or other images
  • Classifier-free guidance: control the tradeoff between quality and prompt adherence
  • NeRF (Neural Radiance Fields): represent 3D scenes as neural networks. Given 2D photos from different angles, synthesize novel views. Not pixel generation per se, but a generative approach to 3D

Controllable generation

Beyond text prompts, you can control generation with structural inputs:

  • ControlNet: add spatial conditioning (edges, depth maps, poses) to diffusion models
  • IP-Adapter: condition on reference images for style/content transfer
  • Inpainting: regenerate parts of an image while keeping the rest
  • img2img: start from an existing image instead of pure noise

Evaluation metrics

MetricWhat it measuresNotes
FID (Frechet Inception Distance)Distance between real and generated distributionsLower is better. The standard metric
IS (Inception Score)Quality + diversity via Inception classifierHigher is better. Less reliable than FID
CLIP ScoreText-image alignmentHow well the image matches the prompt

FID is the most widely used, but it has flaws — it uses InceptionV3 features which may not capture everything humans care about.

Ethical considerations

This technology enables deepfakes, non-consensual imagery, and misinformation. Models now include safety filters, watermarking, and content provenance (C2PA). The technical capability outpaces regulation.

Current landscape

Diffusion models dominate text-to-image. Video generation is the active frontier (Sora, Runway, Kling). Image generation is increasingly a commodity — the differentiator is controllability, consistency, and speed.