Instructor guide
Guide for instructors coordinating distributed GAN training.
Your role
As the instructor/coordinator, you:
Set up and manage the PostgreSQL database
Run the main coordinator process
Monitor worker participation
Track training progress
Share results with students
Pre-training setup
1. Deploy database
Choose a cloud database provider:
Recommended options:
ElephantSQL - Free tier available, easy setup
AWS RDS - Reliable, scalable
Google Cloud SQL - Good integration with Colab
Azure Database for PostgreSQL - Enterprise features
Self-hosted - Full control, requires server
Database requirements:
PostgreSQL 12 or later
Publicly accessible
At least 1GB storage
Support for BLOB storage
2. Initialize database
# Create database
createdb distributed_gan
# Initialize schema
python src/database/init_db.py --config config.yaml
This creates tables:
training_state- Current iteration, epoch, weightswork_units- Individual batch assignmentsworkers- Worker registration and statsgradients- Uploaded gradient arrays
3. Create student accounts
For security, create individual accounts:
-- Create user
CREATE USER student1 WITH PASSWORD 'secure_password';
-- Grant permissions
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA public TO student1;
GRANT USAGE ON ALL SEQUENCES IN SCHEMA public TO student1;
Distribute credentials securely (email, LMS, etc.).
4. Configure coordinator
Edit config.yaml:
database:
host: your-database.provider.com
port: 54321
database: distributed_gan
user: coordinator # Your admin account
password: your_secure_password
training:
images_per_work_unit: 320
num_workunits_per_update: 3 # Wait for N work units before updating
huggingface: # Optional but recommended for your own training runs
enabled: true
repo_id: your-username/your-repo # Create at huggingface.co/new
token: your_hf_write_token # From huggingface.co/settings/tokens
push_interval: 5
Running training
Start coordinator
python src/main.py --epochs 50 --sample-interval 1
The coordinator will:
Initialize model weights
Create work units for epoch 1
Wait for workers to claim and complete units
Aggregate gradients when enough workers finish
Update models and create next batch of work
Generate sample images periodically
Push to Hugging Face (if enabled)
Monitor progress
Console output:
Initializing coordinator...
Database initialized
Created 1,582 work units for epoch 1
Waiting for workers... (0/3 completed)
Worker abc123 completed work unit 1
Worker def456 completed work unit 2
Worker ghi789 completed work unit 3
Aggregating gradients from 3 workers...
Applied gradient update
Generator loss: 2.345
Discriminator loss: 1.234
Generated samples saved to data/outputs/samples/iteration_0001.png
Pushed checkpoint to Hugging Face
Creating work units for iteration 2...
Database queries:
-- Check active workers
SELECT worker_id, gpu_name, total_work_units, last_heartbeat
FROM workers
WHERE last_heartbeat > NOW() - INTERVAL '2 minutes'
ORDER BY total_work_units DESC;
-- Training progress
SELECT current_iteration, current_epoch,
generator_loss, discriminator_loss
FROM training_state;
-- Work unit status
SELECT status, COUNT(*)
FROM work_units
WHERE iteration = (SELECT current_iteration FROM training_state)
GROUP BY status;
Managing the training session
Pause training
Press Ctrl+C - coordinator stops gracefully after current iteration.
Resume training
Just restart:
python src/main.py --resume
Loads latest state from database and continues.
Adjust parameters mid-training
Edit config.yaml and restart coordinator. Changes take effect:
batch_size- affects new work unitsnum_workunits_per_update- how many work units to wait forsample_interval- frequency of sample generation
Handle stalled workers
Workers that crash leave work units as “in_progress”. The coordinator automatically reclaims units after timeout (default 5 minutes).
Manual reclaim:
UPDATE work_units
SET status = 'pending', claimed_by = NULL
WHERE status = 'in_progress'
AND claimed_at < NOW() - INTERVAL '10 minutes';
Monitoring tools
Real-time visualization
Create a simple dashboard:
# monitor.py
import psycopg2
import time
while True:
# Connect and query
stats = get_training_stats()
print(f"Iteration: {stats['iteration']}")
print(f"Active workers: {stats['active_workers']}")
print(f"Work units completed: {stats['completed']}/{stats['total']}")
print(f"Estimated time remaining: {stats['eta']}")
time.sleep(10)
Generated samples
Check data/outputs/samples/ for periodic image samples. Share with class to show progress.
Hugging Face integration
If enabled, students can:
View latest model on Hugging Face
Download and generate faces
See training progress in real-time
Best practices
Before class
Test full workflow yourself
Verify database is accessible from various networks
Prepare student credentials in advance
Set up Hugging Face repo (optional)
Create demo notebook showing results
During class
Start coordinator before students join
Monitor database for first few students
Be available for setup troubleshooting
Share generated samples periodically
Track who’s participating
After class
Save final checkpoint
Export worker statistics
Create visualizations of results
Share model on Hugging Face
Get student feedback
Troubleshooting
No workers connecting
Verify database is publicly accessible
Check firewall rules
Test with your own worker
Verify student credentials
Training very slow
Need more workers (encourage participation)
Increase
num_workunits_per_updateto wait for more work unit gradientsCheck database performance
Verify network isn’t bottleneck
Unstable training
GAN training is inherently unstable
Try lowering learning rates
Check for bad gradients from workers
May need to restart and adjust hyperparameters
Database full
Gradients table grows large
Add cleanup: delete old gradients after aggregation
Increase database storage
Archive old iterations
Grading and assessment
Track individual contributions
SELECT worker_id,
COUNT(*) as work_units,
SUM(processing_time) as total_time,
MIN(completed_at) as first_contribution,
MAX(completed_at) as last_contribution
FROM work_units
WHERE status = 'completed'
GROUP BY worker_id
ORDER BY work_units DESC;
Metrics to consider
Number of work units completed
Total processing time contributed
Consistency (spread over time vs burst)
Quality (check for errors)
Export results
# Save statistics
psql -h HOST -U USER -d DATABASE -c \
"COPY (SELECT * FROM workers) TO STDOUT CSV HEADER" \
> worker_stats.csv
Advanced topics
Multiple coordinator instances
For very large classes, run multiple coordinators:
Each handles different epoch ranges
Coordinate via database flags
Requires careful synchronization
Custom work unit creation
Modify work unit generation:
Stratified sampling
Hard example mining
Progressive difficulty
Gradient verification
Add checks to detect malicious/broken workers:
Gradient magnitude limits
Statistical outlier detection
Comparison across workers
Learning objectives
This project teaches students:
Technical skills:
Distributed system architecture
Database-coordinated computing
GAN training dynamics
Python development
Soft skills:
Collaboration at scale
Troubleshooting
Reading documentation
Contributing to shared goals
Next steps
Configuration reference - All config options
Monitoring guide - Advanced monitoring
Architecture overview - How coordinator works