Image Generation
What
Creating new images from scratch or from text descriptions. One of the fastest-moving areas in deep learning.
Approaches
GANs (Generative Adversarial Networks)
Two networks competing: generator creates fake images, discriminator tries to tell real from fake. They improve each other. Historically dominant (StyleGAN, BigGAN) but tricky to train (mode collapse, training instability).
Diffusion Models (current state of the art)
Start with noise, gradually remove it to create an image. Guided by text prompts or other conditions. More stable training than GANs, better diversity.
Examples: Stable Diffusion, DALL-E, Midjourney, Flux
VAEs (Variational Autoencoders)
Learn a compressed representation, sample from it to generate. See Autoencoders.
Flow-based models (Normalizing Flows)
Learn an invertible mapping between data and a simple distribution (e.g., Gaussian). The key constraint: every transformation must be invertible with a tractable Jacobian. Exact likelihood computation, but architecturally limited. Glow was the landmark model.
Key concepts
- Latent space: compressed representation where images are manipulated
- Conditioning: guiding generation with text, class labels, or other images
- Classifier-free guidance: control the tradeoff between quality and prompt adherence
- NeRF (Neural Radiance Fields): represent 3D scenes as neural networks. Given 2D photos from different angles, synthesize novel views. Not pixel generation per se, but a generative approach to 3D
Controllable generation
Beyond text prompts, you can control generation with structural inputs:
- ControlNet: add spatial conditioning (edges, depth maps, poses) to diffusion models
- IP-Adapter: condition on reference images for style/content transfer
- Inpainting: regenerate parts of an image while keeping the rest
- img2img: start from an existing image instead of pure noise
Evaluation metrics
| Metric | What it measures | Notes |
|---|---|---|
| FID (Frechet Inception Distance) | Distance between real and generated distributions | Lower is better. The standard metric |
| IS (Inception Score) | Quality + diversity via Inception classifier | Higher is better. Less reliable than FID |
| CLIP Score | Text-image alignment | How well the image matches the prompt |
FID is the most widely used, but it has flaws — it uses InceptionV3 features which may not capture everything humans care about.
Ethical considerations
This technology enables deepfakes, non-consensual imagery, and misinformation. Models now include safety filters, watermarking, and content provenance (C2PA). The technical capability outpaces regulation.
Current landscape
Diffusion models dominate text-to-image. Video generation is the active frontier (Sora, Runway, Kling). Image generation is increasingly a commodity — the differentiator is controllability, consistency, and speed.
Links
- Autoencoders — VAE foundations
- Deep Learning Roadmap — broader context
- Computer Vision Roadmap — where generation fits in CV