GANNs with friends
Getting started
Overview
What makes this project unique?
Database-coordinated distributed training
Multiple participation paths
Educational focus
Key concepts
Distributed data parallel training
GAN architecture
Database as message queue
Project outcomes
Next steps
Background
How GANs work
The adversarial game
The training objective
Training dynamics
GAN applications beyond image generation
Why GANs are challenging
Distributed training for machine learning
Why distribute training?
Types of distributed training
Data parallelism (what we use)
Model parallelism
Gradient aggregation strategies
Synchronous (what we use)
Asynchronous (alternative approach)
Our approach: Database-coordinated synchronous training
Next steps
Installation
Comparison of installation paths
Prerequisites by path
All paths require:
Path-specific requirements:
CPU vs GPU training
Verification
Next steps
Quick start
For students (workers)
Step 1: Choose your path
Step 2: Get database credentials
Step 3: Configure and run
For instructors (coordinator)
Step 1: Set up database
Step 2: Initialize the training system
Step 3: Start coordinator
Step 4: Monitor progress
Optional: Hugging Face integration
Viewing results
Troubleshooting
Next steps
Setup paths
Google Colab setup
Advantages
Limitations
Setup steps
1. Open the Colab notebook
2. Enable GPU runtime
3. Run the setup cells
4. Configure database connection
5. Start worker
Keeping the worker running
Monitoring your contribution
Stopping the worker
Tips for Colab users
Troubleshooting
Next steps
Dev container setup
GPU configuration
Advantages
Prerequisites
CPU configuration
Advantages
Prerequisites
Installation steps
1. Install prerequisites
2. Clone repository
3. Open in container
4. Configure database
5. Start worker
Container features
Hardware detection
Jupyter notebooks
Development tools
Working with the container
Restart container
Access terminal
Install additional packages
File access
Troubleshooting
Next steps
Native Python setup
Advantages
Disadvantages
Prerequisites
Installation steps
1. Verify Python version
2. Clone repository
3. Create virtual environment
4. Install dependencies
5. Verify installation
6. Configure database
7. Start worker
Managing the environment
Activate environment
Deactivate environment
Update dependencies
Freeze current environment
CPU vs GPU
Troubleshooting
Next steps
Conda environment setup
Advantages
Prerequisites
Installation steps
1. Install Conda
2. Clone repository
3. Create conda environment
4. Activate environment
5. Verify installation
6. Configure database
7. Start worker
Managing conda environments
List environments
Activate/deactivate
Update environment
Export environment
Remove environment
Environment files explained
environment.yml (GPU)
.devcontainer/cpu/environment.yml (CPU-only)
Troubleshooting
Next steps
Local training setup
When to use local training
Trade-offs
Prerequisites
Quick start
1. Skip database configuration
2. Start training
Command-line options
Basic options
Advanced options
Resume training
Monitoring progress
Generated samples
Checkpoints
Console output
Viewing results
Use the demo notebook
Manual inspection
Comparing with distributed training
Performance comparison
Try both
Hyperparameter tuning
Learning rates
Batch sizes
Training duration
Troubleshooting
Next steps
User guides
Student guide
Your role
Getting started
1. Choose your setup path
2. Get credentials
3. Configure and run
4. Set your name (optional)
What the worker does
Automatic workflow
What you’ll see
Understanding the output
Initialization
During training
Monitoring your contribution
In the console
Database queries
Leaderboard (if available)
Best practices
Maximize contribution
Resource management
When to stop
Troubleshooting
No work units available
Connection errors
Out of memory
Worker crashes
Slow performance
FAQ
Learning outcomes
Next steps
Instructor guide
Your role
Pre-training setup
1. Deploy database
2. Initialize database
3. Create student accounts
4. Configure coordinator
Running training
Start coordinator
Monitor progress
Managing the training session
Pause training
Resume training
Adjust parameters mid-training
Handle stalled workers
Monitoring tools
Real-time visualization
Generated samples
Hugging Face integration
Best practices
Before class
During class
After class
Troubleshooting
No workers connecting
Training very slow
Unstable training
Database full
Grading and assessment
Track individual contributions
Metrics to consider
Export results
Advanced topics
Multiple coordinator instances
Custom work unit creation
Gradient verification
Learning objectives
Next steps
Configuration reference
Configuration file structure
Database configuration
Options
Training configuration
Options
Worker configuration
Options
Model configuration
Options
Data configuration
Options
Hugging Face configuration
Default behavior
Running your own training
Options
Understanding distributed training tradeoffs
Work unit size vs. database overhead
Worker batch size
Aggregation threshold vs. gradient quality
Update frequency vs. convergence speed
Worker coordination patterns
Monitoring and adjustment
Example configurations
Small class (2-5 students)
Large class (10+ students)
CPU-only mode
Quick testing
High quality (long training)
Environment variables
Security best practices
Validation
Next steps
Monitoring
Overview
Monitoring methods
1. Console output
2. Database queries
3. Generated samples
4. Hugging Face
Database monitoring
Active workers
Training progress
Work unit status
Worker contribution leaderboard
Stalled work units
Python monitoring script
Visual monitoring
Watch generated samples
Create GIF of progress
Plot loss curves
Hugging Face monitoring
Performance metrics
Worker efficiency
Database performance
Alerts and notifications
Email on completion
Slack notifications
Troubleshooting with monitoring
Issue: No workers active
Issue: Work units stuck
Issue: Slow progress
Dashboard ideas
Simple web dashboard
Real-time updates with WebSockets
Best practices
Next steps
Architecture
Architecture overview
Distributed deep learning fundamentals
Data parallelism
Synchronous vs asynchronous training
System components
High-level architecture
Data flow
1. Initialization (Coordinator)
2. Worker claims and processes
3. Coordinator aggregates
4. Iteration continues
Database schema
training_state table
work_units table
gradients table
workers table
Coordination mechanism
Atomic work claiming
Timeout and reclamation
Heartbeat monitoring
Communication patterns
Pull-based architecture
Stateless workers
Centralized coordination
Fault tolerance
Worker failures
Coordinator failures
Database failures
Scalability
Horizontal scaling
Vertical scaling
Database optimization
Security considerations
Database access
Data validation
Next steps
Development approach
Development approach and AI assistance
Transparency about AI usage
Development timeline
Collaborative workflow
Why disclose this
Example: Fixing the stale work unit bug
Example: Catching AI hallucinations
Additional resources
Troubleshooting
Installation issues
Python version too old
PyTorch CUDA mismatch
Database connection fails
Runtime issues
Out of memory
Worker can’t find dataset
No work units available
Work units timeout
Training issues
Loss values are NaN
Poor image quality
Training very slow
Discriminator dominates
Database issues
Database full
Too many connections
Slow queries
Network issues
Timeouts
Slow uploads
Colab-specific issues
Session disconnects
GPU quota exceeded
Files disappear
Development issues
Import errors
Git issues
Debugging techniques
Enable debug logging
Check GPU utilization
Verify data loading
Test database connection
Check model initialization
Getting help
Check logs
Create minimal example
Report issue
Preventive measures
Next steps
Performance Tips
Configuration-Based Optimizations
Batch Size Tuning
DataLoader Workers
Work Unit Configuration
Worker Polling
Monitoring performance
Check GPU Utilization
Worker Throughput
Database Performance
Best practices
Performance targets
Example configurations
Small class (3-5 workers)
Medium class (10-20 workers)
Large class (30+ workers)
Next steps
Contributing
Ways to contribute
Getting started
Fork and clone
Make your changes
Pull request guidelines
Reporting bugs
Development philosophy
Advanced performance optimization ideas
Database Optimizations
Add indexes for faster queries
Connection pooling
Regular maintenance
Training Optimizations
Mixed precision training
Gradient accumulation
Network Optimizations
Gradient compression
Batched uploads
Local weight caching
Monitoring Optimizations
Async heartbeats
Reduced logging overhead
Profiling Tools
Measure performance
Code profiling
Resource Management
CPU affinity
Multiple workers per GPU
Implementation Notes
License
Questions?
FAQ
Getting started
What is this project about?
Do I need a GPU?
Which installation path should I choose?
Do I need to download the dataset?
Where do I get database credentials?
Can I use my own database?
Can I join training late?
What if I disconnect?
Training and results
How long does training take?
How do I know it’s working?
What do the loss values mean?
Why are my loss values different from others?
When will I see results?
How do I view generated faces?
Why do images look blurry?
Can I generate my own faces?
How do I save my favorite generated faces?
Can I train my own model?
System and performance
How does coordination work?
What’s the database storing?
What happens if my worker crashes?
Can I run multiple workers?
How is this different from PyTorch DDP?
Why is training slow?
How can I speed it up?
What’s the optimal number of workers?
Does CPU training help?
Troubleshooting questions
Worker says “no work units available”
Getting connection errors
Out of memory errors
Loss is NaN
Learning and contributing
What will I learn?
Do I need to understand all the code?
Can I modify the code?
Where can I learn more about GANs?
Can I contribute improvements?
I found a bug, what do I do?
Can I use this for my research?
Advanced topics
How do I add Hugging Face integration?
Can I use different datasets?
How do I modify the GAN architecture?
Can this scale to 100+ workers?
Still have questions?
GANNs with friends
Index
Edit on GitHub
Index