Instructor guide

Guide for instructors coordinating distributed GAN training.

Your role

As the instructor/coordinator, you:

Set up and manage the PostgreSQL database
Run the main coordinator process
Monitor worker participation
Track training progress
Share results with students

Pre-training setup

1. Deploy database

Choose a cloud database provider:

Recommended options:

ElephantSQL - Free tier available, easy setup
AWS RDS - Reliable, scalable
Google Cloud SQL - Good integration with Colab
Azure Database for PostgreSQL - Enterprise features
Self-hosted - Full control, requires server

Database requirements:

PostgreSQL 12 or later
Publicly accessible
At least 1GB storage
Support for BLOB storage

2. Initialize database

# Create database
createdb distributed_gan

# Initialize schema
python src/database/init_db.py --config config.yaml

This creates tables:

training_state - Current iteration, epoch, weights
work_units - Individual batch assignments
workers - Worker registration and stats
gradients - Uploaded gradient arrays

3. Create student accounts

For security, create individual accounts:

-- Create user
CREATE USER student1 WITH PASSWORD 'secure_password';

-- Grant permissions
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA public TO student1;
GRANT USAGE ON ALL SEQUENCES IN SCHEMA public TO student1;

Distribute credentials securely (email, LMS, etc.).

4. Configure coordinator

Edit config.yaml:

database:
  host: your-database.provider.com
  port: 54321
  database: distributed_gan
  user: coordinator  # Your admin account
  password: your_secure_password

training:
  images_per_work_unit: 320 
  num_workunits_per_update: 3  # Wait for N work units before updating

huggingface:  # Optional but recommended for your own training runs
  enabled: true
  repo_id: your-username/your-repo  # Create at huggingface.co/new
  token: your_hf_write_token        # From huggingface.co/settings/tokens
  push_interval: 5

Running training

Start coordinator

python src/main.py --epochs 50 --sample-interval 1

The coordinator will:

Initialize model weights
Create work units for epoch 1
Wait for workers to claim and complete units
Aggregate gradients when enough workers finish
Update models and create next batch of work
Generate sample images periodically
Push to Hugging Face (if enabled)

Monitor progress

Console output:

Initializing coordinator...
Database initialized
Created 1,582 work units for epoch 1

Waiting for workers... (0/3 completed)
Worker abc123 completed work unit 1
Worker def456 completed work unit 2
Worker ghi789 completed work unit 3

Aggregating gradients from 3 workers...
Applied gradient update
Generator loss: 2.345
Discriminator loss: 1.234

Generated samples saved to data/outputs/samples/iteration_0001.png
Pushed checkpoint to Hugging Face

Creating work units for iteration 2...

Database queries:

-- Check active workers
SELECT worker_id, gpu_name, total_work_units, last_heartbeat
FROM workers
WHERE last_heartbeat > NOW() - INTERVAL '2 minutes'
ORDER BY total_work_units DESC;

-- Training progress
SELECT current_iteration, current_epoch, 
       generator_loss, discriminator_loss
FROM training_state;

-- Work unit status
SELECT status, COUNT(*)
FROM work_units
WHERE iteration = (SELECT current_iteration FROM training_state)
GROUP BY status;

Managing the training session

Pause training

Press Ctrl+C - coordinator stops gracefully after current iteration.

Resume training

Just restart:

python src/main.py --resume

Loads latest state from database and continues.

Adjust parameters mid-training

Edit config.yaml and restart coordinator. Changes take effect:

batch_size - affects new work units
num_workunits_per_update - how many work units to wait for
sample_interval - frequency of sample generation

Handle stalled workers

Workers that crash leave work units as “in_progress”. The coordinator automatically reclaims units after timeout (default 5 minutes).

Manual reclaim:

UPDATE work_units
SET status = 'pending', claimed_by = NULL
WHERE status = 'in_progress' 
  AND claimed_at < NOW() - INTERVAL '10 minutes';

Monitoring tools

Real-time visualization

Create a simple dashboard:

# monitor.py
import psycopg2
import time

while True:
    # Connect and query
    stats = get_training_stats()
    print(f"Iteration: {stats['iteration']}")
    print(f"Active workers: {stats['active_workers']}")
    print(f"Work units completed: {stats['completed']}/{stats['total']}")
    print(f"Estimated time remaining: {stats['eta']}")
    time.sleep(10)

Generated samples

Check data/outputs/samples/ for periodic image samples. Share with class to show progress.

Hugging Face integration

If enabled, students can:

View latest model on Hugging Face
Download and generate faces
See training progress in real-time

Best practices

Before class

Test full workflow yourself
Verify database is accessible from various networks
Prepare student credentials in advance
Set up Hugging Face repo (optional)
Create demo notebook showing results

During class

Start coordinator before students join
Monitor database for first few students
Be available for setup troubleshooting
Share generated samples periodically
Track who’s participating

After class

Save final checkpoint
Export worker statistics
Create visualizations of results
Share model on Hugging Face
Get student feedback

Troubleshooting

No workers connecting

Verify database is publicly accessible
Check firewall rules
Test with your own worker
Verify student credentials

Training very slow

Need more workers (encourage participation)
Increase num_workunits_per_update to wait for more work unit gradients
Check database performance
Verify network isn’t bottleneck

Unstable training

GAN training is inherently unstable
Try lowering learning rates
Check for bad gradients from workers
May need to restart and adjust hyperparameters

Database full

Gradients table grows large
Add cleanup: delete old gradients after aggregation
Increase database storage
Archive old iterations

Grading and assessment

Track individual contributions

SELECT worker_id, 
       COUNT(*) as work_units,
       SUM(processing_time) as total_time,
       MIN(completed_at) as first_contribution,
       MAX(completed_at) as last_contribution
FROM work_units
WHERE status = 'completed'
GROUP BY worker_id
ORDER BY work_units DESC;

Metrics to consider

Number of work units completed
Total processing time contributed
Consistency (spread over time vs burst)
Quality (check for errors)

Export results

# Save statistics
psql -h HOST -U USER -d DATABASE -c \
  "COPY (SELECT * FROM workers) TO STDOUT CSV HEADER" \
  > worker_stats.csv

Advanced topics

Multiple coordinator instances

For very large classes, run multiple coordinators:

Each handles different epoch ranges
Coordinate via database flags
Requires careful synchronization

Custom work unit creation

Modify work unit generation:

Stratified sampling
Hard example mining
Progressive difficulty

Gradient verification

Add checks to detect malicious/broken workers:

Gradient magnitude limits
Statistical outlier detection
Comparison across workers

Learning objectives

This project teaches students:

Technical skills:

Distributed system architecture
Database-coordinated computing
GAN training dynamics
Python development

Soft skills:

Collaboration at scale
Troubleshooting
Reading documentation
Contributing to shared goals

Next steps

Configuration reference - All config options
Monitoring guide - Advanced monitoring
Architecture overview - How coordinator works