Generative Adversarial Networks

What

GANs frame generative modeling as a two-player game between a generator G (creates fake data) and a discriminator D (distinguishes real from fake). The generator learns to produce outputs so realistic that D can’t distinguish them from real data.

Generator G(z): random noise z → fake data G(z)
Discriminator D(x): real or fake? → probability D(x) is real

D is trained to maximize:    log(D(x_real)) + log(1 - D(G(z)))
G is trained to minimize:     log(1 - D(G(z)))

At equilibrium, G produces perfect fakes, and D outputs 0.5 for everything (can’t tell real from fake).

Training Dynamics

The minimax game has a unique equilibrium when:

  • D is optimal for the current G
  • G minimizes the Jensen-Shannon divergence between p_data and p_model

In practice, balancing G and D is tricky:

  • If D too weak: G finds degenerate solutions (mode collapse)
  • If D too strong: G gradient vanishes (saturates, stops learning)
  • If G too weak: D easily distinguishes real from fake, G gets no gradient

Practical training tips

  • Use spectral normalization on D (controls Lipschitz constant)
  • Alternate: 1 D step per G step (D needs to stay close to optimal)
  • Use soft labels (0.9 instead of 1.0 for real) to prevent gradient vanishing
  • Monitor D loss: if it goes to 0 too fast, D is too strong

Mode Collapse

The generator finds a small subset of the data distribution that fools D well, then only produces that subset. D can’t distinguish these fakes, so G has no incentive to diversify.

Solutions:

  • Unrolled GANs: D’s optimization is simulated for several steps before computing G’s gradient
  • Wasserstein GAN (WGAN): earth mover distance instead of Jensen-Shannon
  • Mixed strategies: aggregate multiple G outputs

Architecture: DCGAN

The Deep Convolutional GAN (2016) established stable architecture patterns:

  • Strided convolutions instead of pooling (learns its own spatial downsampling)
  • Batch normalization in both G and D
  • LeakyReLU in D (allows gradient flow from D to early layers)
  • No FC layers in G (transposed convolutions handle spatial structure)

StyleGAN: Disentangling Style and Content

StyleGAN (2018, 2019) introduced the mapping network and style injection:

z → mapping network (8 FC layers) → w (style code)
w → AdaIN (Adaptive Instance Normalization) → each layer of synthesis network

This separates high-level style (from w) from stochastic variation (from independent noise inputs). Mixing styles at different layers controls coarse vs fine attributes.

StyleGAN2 (2020)

  • Weight demodulation instead of AdaIN (more stable, better gradient flow)
  • Path length regularization (encourages smooth interpolation)
  • No progressive growing (was used in StyleGAN1)

StyleGAN3 (2021)

  • Alias-free convolution (prevents phase artifacts in animated images)
  • Boundless generation (eliminates boundaries in tiled generation)

Conditional GANs

Condition both G and D on a class label or other input:

D(x, c): "is this real image of class c?"
G(z, c): "generate fake image of class c"

This enables:

  • Class-conditional generation (class-specific outputs)
  • Image-to-image translation (Pix2Pix, CycleGAN)
  • Text-to-image (CLIP-guided generation)

Image-to-Image Translation

Pix2Pix (2017)

Paired translation: (input image, output image) pairs required.

G: input domain → output domain
D: (input, output) → is this a real pair?

Example: edges → photo, satellite → map, day → night.

CycleGAN (2017)

Unpaired translation: no paired examples needed.

G: X → Y, F: Y → X
D_X: is this from domain X?
D_Y: is this from domain Y?
+ Cycle consistency: F(G(x)) ≈ x, G(F(y)) ≈ y

This enables translation between domains where paired data doesn’t exist.

Evaluation Metrics

MetricWhat it measuresLimitation
FID (Fréchet Inception Distance)Distribution similarity to real imagesNeeds large sample
IS (Inception Score)Quality × diversityDoesn’t compare to real data
Precision/RecallQuality vs coverage of distributionComputationally expensive
LPIPSPerceptual similarityNeeds reference network
Human evaluationSubjective qualityExpensive, inconsistent

FID is the standard: lower is better. FID < 10 is photorealistic.

GANs vs Diffusion Models

AspectGANsDiffusion Models
Training stabilityUnstable, mode collapseStable, clear objective
Output qualityHigh quality when workingExcellent, less mode collapse
DiversityCan collapseFull distribution coverage
Inference speed1 forward pass20-100 steps (slow)
Mode collapseProblematicRare
ControllabilityStyle mixing, latent arithmeticGuidance-based

GANs produce excellent samples but are notoriously hard to train. Diffusion models sacrifice inference speed for training stability and mode coverage.

Modern Status

Largely superseded by diffusion models for image generation (Stable Diffusion, DALL-E 3). Still widely used for:

  • Real-time applications (game assets, video frames)
  • Domain-specific translation where paired data exists
  • Style transfer with perceptual quality requirements
  • Research on representation learning and disentanglement

Key Papers

  • Generative Adversarial Nets (Goodfellow et al., 2014, NeurIPS) — the original GAN · arXiv:1406.2661
  • Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (Radford et al., 2016) — DCGAN · arXiv:1511.06434
  • Image-to-Image Translation with Conditional Adversarial Networks (Isola et al., 2017, CVPR) — Pix2Pix · arXiv:1611.07004
  • Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (Zhu et al., 2017, ICCV) — CycleGAN · arXiv:1703.10593
  • A Style-Based Generator Architecture for Generative Adversarial Networks (Karras et al., 2019, CVPR) — StyleGAN · arXiv:1812.04948
  • Analyzing and Improving the Image Quality of StyleGAN (Karras et al., 2020, CVPR) — StyleGAN2 · arXiv:1912.04958