FAQ

Frequently asked questions about distributed GAN training.

Getting started

What is this project about?

An educational distributed deep learning system where students collectively train a GAN to generate celebrity faces. It teaches distributed systems, GANs, and collaborative computing.

Do I need a GPU?

No. The system works with CPU-only workers. GPU is faster but optional.

Which installation path should I choose?

Google Colab: Easiest, no local setup, free GPU
Dev container: Best for development, requires Docker
Native Python: Direct control, no Docker needed
Conda: If you already use conda

See installation guide for details.

Do I need to download the dataset?

No, the CelebA dataset (~1.4 GB) is downloaded automatically from Hugging Face the first time you run the worker or coordinator. Just make sure you have enough disk space and internet access.

Where do I get database credentials?

Your instructor provides these. You need:

Host address
Database name
Username
Password

Can I use my own database?

Yes, but you’d be the coordinator, not a worker. See instructor guide.

Can I join training late?

Yes! Workers can join anytime. Just start your worker and it will begin contributing.

What if I disconnect?

No problem. Your work is saved. Restart your worker and continue. Training state is in the database.

Training and results

How long does training take?

Depends on:

Number of active workers
Hardware (GPU vs CPU)
Target quality

Typical: 2-6 hours with 5-10 GPU workers for decent results.

How do I know it’s working?

Your worker prints progress:

Processing work unit 42...
Completed work unit 42 in 15.3s

Check with instructor or view results on Hugging Face.

What do the loss values mean?

Generator loss: How badly the generator fools the discriminator (lower = better fooling)
Discriminator loss: How well discriminator distinguishes real vs fake (lower = better discrimination)

Both should generally trend downward but fluctuate.

Why are my loss values different from others?

Different work units have different batches. Loss varies across batches. Look at trends, not individual values.

When will I see results?

Samples generated every N iterations (instructor sets this)
Results improve over time
Check data/outputs/samples/ or Hugging Face
Realistic faces emerge after many iterations

How do I view generated faces?

Options:

Check data/outputs/samples/ (if coordinator)
Open demo notebook: notebooks/demo_trained_model.ipynb
View on Hugging Face (if enabled)

Why do images look blurry?

Early in training, images are noisy/blurry. Quality improves with more iterations. Check latest samples, not early ones.

Can I generate my own faces?

Yes! After training:

from src.models.dcgan import Generator
import torch

gen = Generator()
# Load trained weights
checkpoint = torch.load('checkpoint.pth')
gen.load_state_dict(checkpoint['generator_state_dict'])

# Generate
noise = torch.randn(16, 100, 1, 1)
faces = gen(noise)

How do I save my favorite generated faces?

from PIL import Image
import torchvision.transforms as T

# Convert tensor to image
to_pil = T.ToPILImage()
image = to_pil(face_tensor)
image.save('my_face.png')

Can I train my own model?

Yes! Use local training mode:

python src/train_local.py --epochs 50

No database needed. See local training guide.

System and performance

How does coordination work?

Workers poll the PostgreSQL database for work units, compute gradients, upload results. Coordinator aggregates gradients and updates weights. See architecture overview.

What’s the database storing?

Model weights (current and historical)
Computed gradients from workers
Work unit assignments and status
Worker registration and statistics

See database schema.

What happens if my worker crashes?

The work unit times out and is automatically reclaimed. Another worker (or you after restarting) will process it. No data loss.

Can I run multiple workers?

Yes, if you have multiple GPUs. Create separate config files or run in different directories.

How is this different from PyTorch DDP?

PyTorch DDP requires direct network communication. This uses database coordination, making it easier for distributed educational setups across networks/firewalls.

Why is training slow?

Need more workers
Workers may have slow hardware
Network latency to database
Check batch sizes and configuration

How can I speed it up?

Recruit more workers
Use GPU not CPU
Optimize database location (closer geographically)
Increase batch sizes (if memory allows)

What’s the optimal number of workers?

No hard limit. More workers = faster training (with diminishing returns). Typical: 5-15 workers for class project.

Does CPU training help?

Yes! CPU workers contribute, though slower than GPU. Every worker helps.

Troubleshooting questions

Worker says “no work units available”

Training may not have started
Current iteration completed, next not yet created
Check with instructor

See troubleshooting guide.

Getting connection errors

Verify credentials in config.yaml
Check network connection
Database may be down (ask instructor)

Out of memory errors

Reduce batch size in config.yaml
Close other GPU applications
Try CPU-only mode

Loss is NaN

Lower learning rates
Restart from last checkpoint
May indicate training instability

Learning and contributing

What will I learn?

Distributed system architecture
Database-coordinated computing
GAN training and theory
Collaborative problem solving
Python and PyTorch

Do I need to understand all the code?

No. Workers can participate with basic understanding. Deeper learning comes from exploration.

Can I modify the code?

Yes! This is open source and educational. Try:

Different model architectures
Alternative optimizers
Custom loss functions
Enhanced monitoring

Where can I learn more about GANs?

DCGAN paper
GAN training tips
PyTorch GAN tutorials
This project’s architecture docs

Can I contribute improvements?

Yes! This is an educational project. Contributions welcome:

Bug fixes
Documentation improvements
New features
Performance optimizations

See contributing guide.

I found a bug, what do I do?

Create GitHub issue with details
Include error messages and steps to reproduce
Or submit a pull request with fix

Can I use this for my research?

Yes, with attribution. See LICENSE file. Consider citing if you publish results.

Advanced topics

How do I add Hugging Face integration?

Instructor sets this up. See Hugging Face integration guide.

Can I use different datasets?

Yes, but requires code modifications:

Implement custom dataset class
Update data loader
Adjust image size if needed

How do I modify the GAN architecture?

Edit src/models/dcgan.py:

Change layer sizes
Add/remove layers
Try different architectures (StyleGAN, etc.)

Can this scale to 100+ workers?

Yes, but may need:

Database optimization
Connection pooling
Higher-capacity database server
Gradient compression

Still have questions?

Check troubleshooting guide
Ask your instructor
Create GitHub issue
Explore the architecture documentation