Background
This page provides deeper technical background on GANs and distributed training for students who want to understand the underlying concepts.
How GANs work
Generative Adversarial Networks (GANs) are a class of deep learning models that learn to generate new data by pitting two neural networks against each other in a game-theoretic framework.
The adversarial game
A GAN consists of two competing networks:
Random Noise ──> [Generator] ──> Generated Image ──┐
├──> [Discriminator] ──> Real or Fake?
Real Image (CelebA) ──────────┘
Generator (G): Takes random noise as input and produces synthetic images. Its goal is to create images realistic enough to fool the discriminator.
Discriminator (D): Receives both real images (from the training dataset) and fake images (from the generator). Its goal is to correctly classify which images are real and which are fake.
The training objective
GANs are trained using a minimax game where:
The discriminator tries to maximize its ability to distinguish real from fake
The generator tries to minimize the discriminator’s success
We use Binary Cross-Entropy (BCE) loss to train both networks. BCE is the standard loss function for binary classification problems - it measures how well a model’s predicted probabilities match the true labels (0 or 1).
Training the Discriminator: The discriminator is trained to output values close to 1 for real images and close to 0 for fake images. We compute BCE loss twice per step:
Feed real images with label=1, penalizing when D outputs low values
Feed generated (fake) images with label=0, penalizing when D outputs high values
The discriminator’s gradients push it to better distinguish real from fake.
Training the Generator: The generator is trained to fool the discriminator. We feed generated images through the discriminator, but use label=1 (pretending they’re real). The BCE loss penalizes the generator when the discriminator correctly identifies fakes as fake.
The key insight: gradients flow backward through the frozen discriminator into the generator, teaching it what features make images look “more real” to the discriminator.
Training dynamics
Each training step involves:
Train Discriminator:
Show it real images (label = 1) and compute loss
Show it generated images (label = 0) and compute loss
Backpropagate and update discriminator weights
Train Generator:
Generate fake images
Ask discriminator to classify them (but freeze D’s weights)
Backpropagate through D into G, update generator weights
┌─────────────────────────────────────────────────────────┐
│ GAN Training Loop │
├─────────────────────────────────────────────────────────┤
│ Step 1: Train Discriminator │
│ ┌─────────┐ ┌─────────┐ │
│ │ Real │──────│ D(real) │──> Loss: -log(D(real)) │
│ │ Images │ └─────────┘ │
│ └─────────┘ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Noise │──────│ G │──────│ D(fake) │──> Loss │
│ └─────────┘ └─────────┘ └─────────┘ │
│ -log(1-D(fake)) │
│ │
│ Backprop: ∂L/∂θ_D computed, D weights updated │
├─────────────────────────────────────────────────────────┤
│ Step 2: Train Generator │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Noise │──────│ G │──────│ D │──> Loss │
│ └─────────┘ └─────────┘ └─────────┘ │
│ -log(D(G(z))) │
│ │
│ Backprop: gradients flow through frozen D into G │
│ Only G weights updated (D frozen) │
└─────────────────────────────────────────────────────────┘
GAN applications beyond image generation
While this project focuses on generating faces, GANs have many other applications:
Computer Vision
Super-resolution: Enhance low-resolution images (SRGAN, ESRGAN)
Image inpainting: Fill in missing or damaged regions
Style transfer: Apply artistic styles to photos (CycleGAN)
Image-to-image translation: Convert sketches to photos, day to night, etc. (pix2pix)
Audio & Music
Voice synthesis: Generate realistic speech (WaveGAN)
Music generation: Create novel musical compositions
Voice conversion: Transform one voice to sound like another
Science & Medicine
Drug discovery: Generate novel molecular structures
Medical imaging: Synthesize training data, augment datasets
Protein structure: Generate plausible protein conformations
Other Domains
Text generation: Though transformers dominate, GANs have been explored (SeqGAN)
Video synthesis: Generate realistic video sequences
3D object generation: Create 3D models from 2D images
Data augmentation: Generate synthetic training data to improve classifiers
Anomaly detection: Discriminator learns normal data distribution
Why GANs are challenging
GANs can be difficult to train due to several factors:
Mode collapse: Generator produces limited variety of outputs
Training instability: Loss oscillates instead of converging
Vanishing gradients: Discriminator becomes too good, giving generator no useful signal
Hyperparameter sensitivity: Learning rate, architecture, and batch size all matter significantly
The DCGAN architecture we use incorporates best practices that help stabilize training:
Batch normalization in generator and discriminator
LeakyReLU activations in discriminator
Strided convolutions instead of pooling
Adam optimizer with specific β parameters
Distributed training for machine learning
Training deep learning models requires enormous computation. Distributing this work across multiple machines dramatically reduces training time.
Why distribute training?
Faster iteration: Train models in hours instead of days
Larger models: Fit models that don’t fit on a single GPU
Better utilization: Use idle compute resources efficiently
Collaboration: Multiple participants contribute to a shared goal
Types of distributed training
Data parallelism (what we use)
Each worker processes different data with the same model:
┌─────────────────────────────────────────────────────────┐
│ Data Parallelism │
├─────────────────────────────────────────────────────────┤
│ │
│ Training Data: [Batch 1] [Batch 2] [Batch 3] [Batch 4] │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │Worker1│ │Worker2│ │Worker3│ │Worker4│ │
│ │(GPU 1)│ │(GPU 2)│ │(GPU 3)│ │(GPU 4)│ │
│ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ [Grad 1] [Grad 2] [Grad 3] [Grad 4] │
│ │ │ │ │ │
│ └────────┬┴─────────┴─────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Average/Reduce │ │
│ └────────┬────────┘ │
│ ▼ │
│ [Aggregated Gradient] │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Update Weights │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────┘
Advantages:
Simple to implement
Scales well with more workers
Each worker sees different data, improving gradient estimates
Model parallelism
Different workers hold different parts of the model:
Input ──> [Worker 1: Layers 1-4] ──> [Worker 2: Layers 5-8] ──> Output
Used when models are too large for a single GPU (e.g., large language models with billions of parameters).
Challenges with model parallelism:
Pipeline bubbles: Workers sit idle waiting for activations from previous stages. With 4 pipeline stages, up to 75% of compute can be wasted during the filling/draining phases.
Communication overhead: Activations between layers must be sent between workers on every forward pass, and gradients on every backward pass.
Memory imbalance: Different layers have different sizes; balancing memory across workers is non-trivial.
Complex implementation: Requires careful partitioning of the model and coordination of forward/backward passes.
Debugging difficulty: Errors can be hard to trace across distributed model segments.
Modern large model training (GPT-4, LLaMA, etc.) combines multiple strategies: tensor parallelism within nodes, pipeline parallelism across nodes, and data parallelism across groups of nodes.
Gradient aggregation strategies
Synchronous (what we use)
All workers must complete before updating:
Time ──────────────────────────────────────────→
Worker 1: ████████░░░░░░░░░░████████░░░░░░░░░░
Worker 2: ██████████████░░░░████████████░░░░░░
Worker 3: ████████████░░░░░░██████████████░░░░
↑ ↑
Sync barrier Sync barrier
(aggregate) (aggregate)
Advantages: Consistent, predictable convergence
Disadvantages: Slowest worker determines pace
Asynchronous (alternative approach)
Workers update independently without waiting:
Worker 1: ████↑████↑████↑████↑
Worker 2: ██████↑██████↑██████↑
Worker 3: ████████↑████████↑████
↑ = update weights
Advantages: No idle time, faster wall-clock
Disadvantages: Stale gradients, less stable convergence
Our approach: Database-coordinated synchronous training
This project uses a hybrid approach:
Partial synchronization: Wait for N workers (not all) before updating
Database as message queue: No direct worker-to-worker communication
Weighted averaging: Workers contribute proportionally to samples processed
┌─────────────────────────────────────────────────────────┐
│ Database-Coordinated Training │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Worker 1│ │Worker 2│ │Worker 3│ │Worker 4│ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ PostgreSQL Database │ │
│ │ ┌────────┐ ┌──────────┐ ┌────────────┐ │ │
│ │ │ Work │ │ Gradients│ │ Model │ │ │
│ │ │ Units │ │ │ │ Weights │ │ │
│ │ └────────┘ └──────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────┘ │
│ ▲ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Coordinator │ │
│ │ (Aggregates│ │
│ │ gradients)│ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
This design is ideal for educational settings:
No complex networking: Students don’t need port forwarding or VPNs
Fault tolerant: Workers can join/leave without disruption
Transparent: All state visible in database for debugging
Next steps
Installation - Set up your environment
Quick start - Get running in 5 minutes