Local training setup
Train the GAN on a single GPU without the distributed setup.
When to use local training
Perfect for:
Experimentation and development
Testing hyperparameters
Baseline performance comparison
Learning GANs independently
No database access needed
Trade-offs
Advantages:
Simpler setup (no database)
Faster iteration
Complete control
Immediate feedback
Disadvantages:
Single GPU only
Miss distributed systems lessons
Can’t collaborate with class
Prerequisites
Any installation path completed (Colab, dev container, native, or conda)
CelebA dataset downloaded
No database required
Quick start
1. Skip database configuration
You don’t need config.yaml for local training.
2. Start training
python src/train_local.py --epochs 50 --batch-size 128
Training begins immediately:
Starting local DCGAN training...
Device: cuda
Dataset: 202,599 images
Generator parameters: 4,683,971
Discriminator parameters: 2,765,249
Epoch 1/50
[1/1,582] Loss_D: 1.234 Loss_G: 2.345
[2/1,582] Loss_D: 1.123 Loss_G: 2.234
...
Command-line options
Basic options
python src/train_local.py \
--epochs 50 \ # Number of epochs
--batch-size 128 \ # Batch size
--dataset-path data/celeba \ # Dataset location
--output-dir data/outputs_local # Output directory
Advanced options
python src/train_local.py \
--epochs 100 \
--batch-size 64 \
--lr-g 0.0002 \ # Generator learning rate
--lr-d 0.0002 \ # Discriminator learning rate
--latent-dim 100 \ # Latent space dimension
--image-size 64 \ # Image resolution
--sample-interval 1 \ # Generate samples every N epochs
--checkpoint-interval 5 \ # Save checkpoint every N epochs
--num-workers 4 # DataLoader workers
Resume training
Continue from a checkpoint:
python src/train_local.py \
--resume data/outputs_local/checkpoints/checkpoint_epoch_0025.pth \
--epochs 100
Monitoring progress
Generated samples
View generated images during training:
ls data/outputs_local/samples/
# epoch_001.png
# epoch_002.png
# ...
Open these images to see the generator improving.
Checkpoints
Model checkpoints saved periodically:
ls data/outputs_local/checkpoints/
# checkpoint_epoch_0005.pth
# checkpoint_epoch_0010.pth
# checkpoint_latest.pth
Console output
Training prints loss values:
Epoch 5/50
Avg Generator Loss: 2.145
Avg Discriminator Loss: 0.823
Time: 45.2s
Viewing results
Use the demo notebook
After training (or during):
jupyter notebook notebooks/demo_trained_model.ipynb
The notebook can load local checkpoints:
# In notebook, point to local checkpoint
checkpoint_path = '../data/outputs_local/checkpoints/checkpoint_latest.pth'
Manual inspection
import torch
from src.models.dcgan import Generator
# Load checkpoint
checkpoint = torch.load('data/outputs_local/checkpoints/checkpoint_latest.pth')
# Create generator
generator = Generator(latent_dim=100)
generator.load_state_dict(checkpoint['generator_state_dict'])
generator.eval()
# Generate images
noise = torch.randn(16, 100, 1, 1)
with torch.no_grad():
images = generator(noise)
Comparing with distributed training
Performance comparison
Local training (single GPU):
Full batch every step
Immediate gradient updates
Faster per-iteration
Distributed training:
Combined gradients from N workers
More diverse batches per update
Better generalization (often)
Try both
Train locally first to understand GANs
Then participate in distributed training
Compare results and learning experience
Hyperparameter tuning
Experiment with different settings:
Learning rates
# Default
python src/train_local.py --lr-g 0.0002 --lr-d 0.0002
# Lower (more stable)
python src/train_local.py --lr-g 0.0001 --lr-d 0.0001
# Higher (faster but risky)
python src/train_local.py --lr-g 0.0004 --lr-d 0.0004
Batch sizes
# Smaller (more updates)
python src/train_local.py --batch-size 64
# Default
python src/train_local.py --batch-size 128
# Larger (more stable)
python src/train_local.py --batch-size 256
Training duration
# Quick test (5 epochs)
python src/train_local.py --epochs 5
# Standard
python src/train_local.py --epochs 50
# Extended
python src/train_local.py --epochs 200
Troubleshooting
Out of memory:
Reduce
--batch-sizeReduce
--num-workersUse
--image-size 32for smaller images
Training unstable:
Lower learning rates
Try different batch sizes
Check loss values (both should decrease)
Poor quality images:
Train longer (more epochs)
Adjust learning rates
Verify dataset loaded correctly
Slow training:
Increase
--num-workersUse GPU (verify with
nvidia-smi)Increase batch size if memory allows
Next steps
Architecture overview - Understand the models
Distributed training - Try collaborative training
Performance tips - Optimize training