Generative Adversarial Networks
What
GANs frame generative modeling as a two-player game between a generator G (creates fake data) and a discriminator D (distinguishes real from fake). The generator learns to produce outputs so realistic that D can’t distinguish them from real data.
Generator G(z): random noise z → fake data G(z)
Discriminator D(x): real or fake? → probability D(x) is real
D is trained to maximize: log(D(x_real)) + log(1 - D(G(z)))
G is trained to minimize: log(1 - D(G(z)))
At equilibrium, G produces perfect fakes, and D outputs 0.5 for everything (can’t tell real from fake).
Training Dynamics
The minimax game has a unique equilibrium when:
- D is optimal for the current G
- G minimizes the Jensen-Shannon divergence between p_data and p_model
In practice, balancing G and D is tricky:
- If D too weak: G finds degenerate solutions (mode collapse)
- If D too strong: G gradient vanishes (saturates, stops learning)
- If G too weak: D easily distinguishes real from fake, G gets no gradient
Practical training tips
- Use spectral normalization on D (controls Lipschitz constant)
- Alternate: 1 D step per G step (D needs to stay close to optimal)
- Use soft labels (0.9 instead of 1.0 for real) to prevent gradient vanishing
- Monitor D loss: if it goes to 0 too fast, D is too strong
Mode Collapse
The generator finds a small subset of the data distribution that fools D well, then only produces that subset. D can’t distinguish these fakes, so G has no incentive to diversify.
Solutions:
- Unrolled GANs: D’s optimization is simulated for several steps before computing G’s gradient
- Wasserstein GAN (WGAN): earth mover distance instead of Jensen-Shannon
- Mixed strategies: aggregate multiple G outputs
Architecture: DCGAN
The Deep Convolutional GAN (2016) established stable architecture patterns:
- Strided convolutions instead of pooling (learns its own spatial downsampling)
- Batch normalization in both G and D
- LeakyReLU in D (allows gradient flow from D to early layers)
- No FC layers in G (transposed convolutions handle spatial structure)
StyleGAN: Disentangling Style and Content
StyleGAN (2018, 2019) introduced the mapping network and style injection:
z → mapping network (8 FC layers) → w (style code)
w → AdaIN (Adaptive Instance Normalization) → each layer of synthesis network
This separates high-level style (from w) from stochastic variation (from independent noise inputs). Mixing styles at different layers controls coarse vs fine attributes.
StyleGAN2 (2020)
- Weight demodulation instead of AdaIN (more stable, better gradient flow)
- Path length regularization (encourages smooth interpolation)
- No progressive growing (was used in StyleGAN1)
StyleGAN3 (2021)
- Alias-free convolution (prevents phase artifacts in animated images)
- Boundless generation (eliminates boundaries in tiled generation)
Conditional GANs
Condition both G and D on a class label or other input:
D(x, c): "is this real image of class c?"
G(z, c): "generate fake image of class c"
This enables:
- Class-conditional generation (class-specific outputs)
- Image-to-image translation (Pix2Pix, CycleGAN)
- Text-to-image (CLIP-guided generation)
Image-to-Image Translation
Pix2Pix (2017)
Paired translation: (input image, output image) pairs required.
G: input domain → output domain
D: (input, output) → is this a real pair?
Example: edges → photo, satellite → map, day → night.
CycleGAN (2017)
Unpaired translation: no paired examples needed.
G: X → Y, F: Y → X
D_X: is this from domain X?
D_Y: is this from domain Y?
+ Cycle consistency: F(G(x)) ≈ x, G(F(y)) ≈ y
This enables translation between domains where paired data doesn’t exist.
Evaluation Metrics
| Metric | What it measures | Limitation |
|---|---|---|
| FID (Fréchet Inception Distance) | Distribution similarity to real images | Needs large sample |
| IS (Inception Score) | Quality × diversity | Doesn’t compare to real data |
| Precision/Recall | Quality vs coverage of distribution | Computationally expensive |
| LPIPS | Perceptual similarity | Needs reference network |
| Human evaluation | Subjective quality | Expensive, inconsistent |
FID is the standard: lower is better. FID < 10 is photorealistic.
GANs vs Diffusion Models
| Aspect | GANs | Diffusion Models |
|---|---|---|
| Training stability | Unstable, mode collapse | Stable, clear objective |
| Output quality | High quality when working | Excellent, less mode collapse |
| Diversity | Can collapse | Full distribution coverage |
| Inference speed | 1 forward pass | 20-100 steps (slow) |
| Mode collapse | Problematic | Rare |
| Controllability | Style mixing, latent arithmetic | Guidance-based |
GANs produce excellent samples but are notoriously hard to train. Diffusion models sacrifice inference speed for training stability and mode coverage.
Modern Status
Largely superseded by diffusion models for image generation (Stable Diffusion, DALL-E 3). Still widely used for:
- Real-time applications (game assets, video frames)
- Domain-specific translation where paired data exists
- Style transfer with perceptual quality requirements
- Research on representation learning and disentanglement
Key Papers
- Generative Adversarial Nets (Goodfellow et al., 2014, NeurIPS) — the original GAN · arXiv:1406.2661
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (Radford et al., 2016) — DCGAN · arXiv:1511.06434
- Image-to-Image Translation with Conditional Adversarial Networks (Isola et al., 2017, CVPR) — Pix2Pix · arXiv:1611.07004
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (Zhu et al., 2017, ICCV) — CycleGAN · arXiv:1703.10593
- A Style-Based Generator Architecture for Generative Adversarial Networks (Karras et al., 2019, CVPR) — StyleGAN · arXiv:1812.04948
- Analyzing and Improving the Image Quality of StyleGAN (Karras et al., 2020, CVPR) — StyleGAN2 · arXiv:1912.04958
Links
- Image Generation — the task GANs are used for
- Diffusion Models — the current state-of-the-art for image generation
- Variational Autoencoders — another generative approach
- Key Papers