Local training setup

Train the GAN on a single GPU without the distributed setup.

When to use local training

Perfect for:

Experimentation and development
Testing hyperparameters
Baseline performance comparison
Learning GANs independently
No database access needed

Trade-offs

Advantages:

Simpler setup (no database)
Faster iteration
Complete control
Immediate feedback

Disadvantages:

Single GPU only
Miss distributed systems lessons
Can’t collaborate with class

Prerequisites

Any installation path completed (Colab, dev container, native, or conda)
CelebA dataset downloaded
No database required

Quick start

1. Skip database configuration

You don’t need config.yaml for local training.

2. Start training

python src/train_local.py --epochs 50 --batch-size 128

Training begins immediately:

Starting local DCGAN training...
Device: cuda
Dataset: 202,599 images
Generator parameters: 4,683,971
Discriminator parameters: 2,765,249

Epoch 1/50
[1/1,582] Loss_D: 1.234 Loss_G: 2.345
[2/1,582] Loss_D: 1.123 Loss_G: 2.234
...

Command-line options

Basic options

python src/train_local.py \
  --epochs 50 \                    # Number of epochs
  --batch-size 128 \               # Batch size
  --dataset-path data/celeba \     # Dataset location
  --output-dir data/outputs_local  # Output directory

Advanced options

python src/train_local.py \
  --epochs 100 \
  --batch-size 64 \
  --lr-g 0.0002 \                  # Generator learning rate
  --lr-d 0.0002 \                  # Discriminator learning rate
  --latent-dim 100 \               # Latent space dimension
  --image-size 64 \                # Image resolution
  --sample-interval 1 \            # Generate samples every N epochs
  --checkpoint-interval 5 \        # Save checkpoint every N epochs
  --num-workers 4                  # DataLoader workers

Resume training

Continue from a checkpoint:

python src/train_local.py \
  --resume data/outputs_local/checkpoints/checkpoint_epoch_0025.pth \
  --epochs 100

Monitoring progress

Generated samples

View generated images during training:

ls data/outputs_local/samples/
# epoch_001.png
# epoch_002.png
# ...

Open these images to see the generator improving.

Checkpoints

Model checkpoints saved periodically:

ls data/outputs_local/checkpoints/
# checkpoint_epoch_0005.pth
# checkpoint_epoch_0010.pth
# checkpoint_latest.pth

Console output

Training prints loss values:

Epoch 5/50
Avg Generator Loss: 2.145
Avg Discriminator Loss: 0.823
Time: 45.2s

Viewing results

Use the demo notebook

After training (or during):

jupyter notebook notebooks/demo_trained_model.ipynb

The notebook can load local checkpoints:

# In notebook, point to local checkpoint
checkpoint_path = '../data/outputs_local/checkpoints/checkpoint_latest.pth'

Manual inspection

import torch
from src.models.dcgan import Generator

# Load checkpoint
checkpoint = torch.load('data/outputs_local/checkpoints/checkpoint_latest.pth')

# Create generator
generator = Generator(latent_dim=100)
generator.load_state_dict(checkpoint['generator_state_dict'])
generator.eval()

# Generate images
noise = torch.randn(16, 100, 1, 1)
with torch.no_grad():
    images = generator(noise)

Comparing with distributed training

Performance comparison

Local training (single GPU):

Full batch every step
Immediate gradient updates
Faster per-iteration

Distributed training:

Combined gradients from N workers
More diverse batches per update
Better generalization (often)

Try both

Train locally first to understand GANs
Then participate in distributed training
Compare results and learning experience

Hyperparameter tuning

Experiment with different settings:

Learning rates

# Default
python src/train_local.py --lr-g 0.0002 --lr-d 0.0002

# Lower (more stable)
python src/train_local.py --lr-g 0.0001 --lr-d 0.0001

# Higher (faster but risky)
python src/train_local.py --lr-g 0.0004 --lr-d 0.0004

Batch sizes

# Smaller (more updates)
python src/train_local.py --batch-size 64

# Default
python src/train_local.py --batch-size 128

# Larger (more stable)
python src/train_local.py --batch-size 256

Training duration

# Quick test (5 epochs)
python src/train_local.py --epochs 5

# Standard
python src/train_local.py --epochs 50

# Extended
python src/train_local.py --epochs 200

Troubleshooting

Out of memory:

Reduce --batch-size
Reduce --num-workers
Use --image-size 32 for smaller images

Training unstable:

Lower learning rates
Try different batch sizes
Check loss values (both should decrease)

Poor quality images:

Train longer (more epochs)
Adjust learning rates
Verify dataset loaded correctly

Slow training:

Increase --num-workers
Use GPU (verify with nvidia-smi)
Increase batch size if memory allows

Next steps

Architecture overview - Understand the models
Distributed training - Try collaborative training
Performance tips - Optimize training