Image Generation

What

Creating new images from scratch or from text descriptions. One of the fastest-moving areas in deep learning.

Approaches

GANs (Generative Adversarial Networks)

Two networks competing: generator creates fake images, discriminator tries to tell real from fake. They improve each other. Historically dominant (StyleGAN, BigGAN) but tricky to train (mode collapse, training instability).

Diffusion Models (current state of the art)

Start with noise, gradually remove it to create an image. Guided by text prompts or other conditions. More stable training than GANs, better diversity.

Examples: Stable Diffusion, DALL-E, Midjourney, Flux

VAEs (Variational Autoencoders)

Learn a compressed representation, sample from it to generate. See Autoencoders.

Flow-based models (Normalizing Flows)

Learn an invertible mapping between data and a simple distribution (e.g., Gaussian). The key constraint: every transformation must be invertible with a tractable Jacobian. Exact likelihood computation, but architecturally limited. Glow was the landmark model.

Key concepts

Latent space: compressed representation where images are manipulated
Conditioning: guiding generation with text, class labels, or other images
Classifier-free guidance: control the tradeoff between quality and prompt adherence
NeRF (Neural Radiance Fields): represent 3D scenes as neural networks. Given 2D photos from different angles, synthesize novel views. Not pixel generation per se, but a generative approach to 3D

Controllable generation

Beyond text prompts, you can control generation with structural inputs:

ControlNet: add spatial conditioning (edges, depth maps, poses) to diffusion models
IP-Adapter: condition on reference images for style/content transfer
Inpainting: regenerate parts of an image while keeping the rest
img2img: start from an existing image instead of pure noise

Evaluation metrics

Metric	What it measures	Notes
FID (Frechet Inception Distance)	Distance between real and generated distributions	Lower is better. The standard metric
IS (Inception Score)	Quality + diversity via Inception classifier	Higher is better. Less reliable than FID
CLIP Score	Text-image alignment	How well the image matches the prompt

FID is the most widely used, but it has flaws — it uses InceptionV3 features which may not capture everything humans care about.

Ethical considerations

This technology enables deepfakes, non-consensual imagery, and misinformation. Models now include safety filters, watermarking, and content provenance (C2PA). The technical capability outpaces regulation.

Current landscape

Diffusion models dominate text-to-image. Video generation is the active frontier (Sora, Runway, Kling). Image generation is increasingly a commodity — the differentiator is controllability, consistency, and speed.

AI/ML Notes

Explorer

Image Generation

Image Generation

What

Approaches

GANs (Generative Adversarial Networks)

Diffusion Models (current state of the art)

VAEs (Variational Autoencoders)

Flow-based models (Normalizing Flows)

Key concepts

Controllable generation

Evaluation metrics

Ethical considerations

Current landscape

Links

Graph View

Table of Contents

Backlinks