Configuration reference
Complete reference for all configuration options in config.yaml.
Configuration file structure
database:
# Database connection settings
training:
# Training hyperparameters
worker:
# Worker behavior settings
model:
# Model architecture settings
data:
# Dataset configuration
huggingface:
# Hugging Face integration (optional)
Database configuration
Connection settings for PostgreSQL database.
database:
host: localhost
port: 54321
database: distributed_gan
user: username
password: password
Options
host (string, required)
Database server address
Can be IP address or hostname
Examples:
localhost,db.example.com,192.168.1.100
port (integer, default: 54321)
PostgreSQL port number
Standard PostgreSQL port is 5432, but this project uses 54321 by default
database (string, required)
Database name
Must exist before running
user (string, required)
Database username
Needs SELECT, INSERT, UPDATE permissions
password (string, required)
Database password
Keep secure, don’t commit to version control
Training configuration
Hyperparameters for training process (coordinator only).
training:
latent_dim: 100
image_size: 64
images_per_work_unit: 320
num_workunits_per_update: 3
learning_rate: 0.0002
beta1: 0.5
beta2: 0.999
Options
latent_dim (integer, default: 100)
Dimension of generator input noise vector
Standard DCGAN uses 100
image_size (integer, default: 64)
Output image size (64x64 pixels)
Must match model architecture
images_per_work_unit (integer, default: 320)
Number of images assigned to each work unit
Workers process this many images before uploading gradients
Larger = less database overhead, more work per claim
num_workunits_per_update (integer, default: 3)
Wait for N work unit gradients before aggregating and updating models
Higher = more gradient samples, better convergence, slower updates
Lower = faster iteration, but potentially noisier gradients
Should be set based on your total number of workers
learning_rate (float, default: 0.0002)
Adam optimizer learning rate for both generator and discriminator
Lower = more stable, slower learning
Typical range: 0.0001-0.0004
beta1 (float, default: 0.5)
Adam optimizer beta1 parameter
Controls momentum
0.5 is standard for GAN training
beta2 (float, default: 0.999)
Adam optimizer beta2 parameter
Controls variance
Usually keep at 0.999
Worker configuration
Settings for worker behavior.
worker:
name: YourName
batch_size: 32
poll_interval: 5
heartbeat_interval: 30
work_unit_timeout: 300
Options
name (string, optional)
Your name or identifier shown on the dashboard leaderboard
If not set, uses hostname
Great for classroom competitions!
batch_size (integer, default: 32)
Number of images per training batch
Adjust based on your GPU memory
Colab T4 can typically handle 64
Reduce if you get out-of-memory errors
poll_interval (integer, default: 5)
Seconds between database polls when no work is available
Lower = more responsive, more database load
Typical range: 1-10
heartbeat_interval (integer, default: 30)
Seconds between heartbeat updates
Shows worker is still alive
Typical range: 15-60
work_unit_timeout (integer, default: 300)
Seconds before uncompleted work is reclaimed
Should exceed normal processing time
Typical range: 120-600
Model configuration
Model architecture parameters (set in training section).
training:
latent_dim: 100
image_size: 64
Options
latent_dim (integer, default: 100)
Dimension of random noise vector
Standard DCGAN uses 100
Higher = more capacity, slower
generator_features (integer, default: 64)
Base number of generator feature maps
Determines model size
Higher = more capacity, more memory
discriminator_features (integer, default: 64)
Base number of discriminator feature maps
Usually same as generator
Higher = better discrimination, more memory
Data configuration
Dataset settings.
data:
dataset_path: data/celeba_torchvision/data/img_align_celeba
num_workers_dataloader: 4
Options
dataset_path (string, default: data/celeba_torchvision/data/img_align_celeba)
Path to CelebA dataset images
If the dataset is not found at this path, it will be automatically downloaded from Hugging Face
Can be relative or absolute
num_workers_dataloader (integer, default: 4)
Number of dataloader workers for parallel data loading
Set to 0 to disable multiprocessing
Hugging Face configuration
Integration for dataset downloads and model sharing.
huggingface:
enabled: false
repo_id: gperdrizet/GANNs-with-friends
token: ''
push_interval: 5
Default behavior
The default repo_id points to the project’s public repository. This allows:
Workers/students: Download the CelebA dataset automatically (no token needed)
Demo mode: Download pre-trained models for testing
Running your own training
To run as coordinator with your own team, you need your own Hugging Face repository:
Create a new repo at huggingface.co/new
Get a write token from huggingface.co/settings/tokens
Update your config:
huggingface:
enabled: true
repo_id: YOUR_USERNAME/YOUR_REPO_NAME
token: YOUR_WRITE_TOKEN
push_interval: 5
Options
enabled (boolean, default: false)
Enable pushing checkpoints to Hugging Face
Only needed for coordinators running their own training
Workers don’t need this enabled
repo_id (string, default: gperdrizet/GANNs-with-friends)
Hugging Face repository ID
Format: username/repo-name
Default repo works for downloading data/models (read-only)
token (string, default: ‘’)
Hugging Face access token
Only needed when
enabled: trueGet from huggingface.co/settings/tokens
Needs write permissions for your repo
push_interval (integer, default: 5)
Push checkpoint every N iterations
Lower = more frequent updates, more uploads
Typical range: 1-10
Understanding distributed training tradeoffs
The distributed training system coordinates multiple workers through work units and gradient aggregation. Understanding the tradeoffs helps you configure the system effectively.
Work unit size vs. database overhead
images_per_work_unit controls how many images each work unit contains:
training:
images_per_work_unit: 320 # Each work unit = 320 images
Larger work units (500-1000 images):
Less database overhead (fewer queries per epoch)
Fewer work units to manage
Better for stable, persistent workers
Longer processing time per work unit
Slower feedback if workers disconnect
Smaller work units (100-200 images):
Faster completion times
Better for unstable workers (less wasted work if disconnected)
More granular progress tracking
More database operations
Higher coordination overhead with many workers
Recommendation: Start with 320, increase if you have stable workers or high database latency.
Worker batch size
Workers can tune their own batch_size based on GPU memory:
worker:
batch_size: 64 # Larger for GPUs with more VRAM
This is independent of images_per_work_unit. A work unit with 320 images:
Worker with batch_size=32: processes 10 batches
Worker with batch_size=64: processes 5 batches
Both contribute equally (gradients are averaged).
Aggregation threshold vs. gradient quality
num_workunits_per_update controls how many work unit gradients are collected before updating the model:
training:
num_workunits_per_update: 5 # Wait for 5 work unit gradients
This is one of the most important parameters for distributed training quality.
Higher values (8-20+ work units):
More gradient samples = better quality, less noisy updates
More robust training (similar to larger batch sizes)
Less risk of mode collapse
Slower iterations (wait for more workers to finish)
More wasted work units if not all are used
Can accumulate stale work units
Lower values (1-3 work units):
Faster iterations (update as soon as possible)
Less wasted computation
Quick feedback during development/testing
Noisier gradients (more variance in updates)
Higher risk of training instability
May not benefit from parallel workers
The stale work unit problem:
When num_workunits_per_update is less than the total workers, some work units will be “left behind” when the coordinator aggregates and moves to the next iteration:
Iteration 1: Create 100 work units
- Wait for 5 to complete (num_workunits_per_update=5)
- Aggregate and move to iteration 2
- 95 work units are now "stale" (cancelled automatically)
The system automatically cancels pending work units when advancing iterations to prevent workers from processing stale data. Workers who claim cancelled work units will skip them and move to the next one.
Guidelines for setting num_workunits_per_update:
Class Size |
Workers Expected |
Recommended Value |
Rationale |
|---|---|---|---|
Small (2-5) |
2-5 |
2-3 |
Get updates quickly, most workers contribute |
Medium (10-20) |
10-15 |
5-8 |
Balance quality and speed |
Large (30+) |
20-30 |
10-20 |
Higher quality gradients, can afford to wait |
Testing/Development: Set to 1 for fastest feedback, but expect noisy training.
Production training: Set based on expected worker count and desired gradient quality. A good rule: 30-50% of your typical concurrent workers.
Update frequency vs. convergence speed
The actual update frequency depends on both parameters:
Images per update = images_per_work_unit × num_workunits_per_update
Example with defaults:
320 × 3 = 960 images per model update
More frequent updates (fewer images):
Faster iterations through the epoch
More opportunities to correct course
Higher overhead from weight synchronization
Can be noisier
Less frequent updates (more images):
More stable gradient estimates
Less synchronization overhead
Slower to respond to training issues
Similar to traditional large-batch training
Finding the balance:
Start with defaults for your class size
Monitor training metrics (loss, sample quality)
If training is unstable: increase
num_workunits_per_updateIf training is too slow: decrease
num_workunits_per_updateorimages_per_work_unitAdjust based on worker reliability and network conditions
Worker coordination patterns
Different parameter combinations create different worker coordination patterns:
Pattern 1: Fast iteration (testing)
training:
images_per_work_unit: 160
num_workunits_per_update: 1
Single worker can drive training
Fast feedback, noisy gradients
Good for debugging, not production
Pattern 2: Balanced (small/medium class)
training:
images_per_work_unit: 320
num_workunits_per_update: 3
3 workers contribute per update
Good balance of quality and speed
Default configuration
Pattern 3: High quality (large class)
training:
images_per_work_unit: 480
num_workunits_per_update: 10
Wait for many gradient samples
Best gradient quality
Slower but more stable training
Pattern 4: Efficient (stable workers)
batches_per_work_unit: 20
num_workunits_per_update: 10
Maximize work per database operation
Assumes workers can handle larger units
Good for low-latency networks
Monitoring and adjustment
Watch these metrics to tune your configuration:
Work unit completion rate: If workers finish faster than coordinator aggregates, you might want higher
num_workunits_per_updateCancelled work units: High cancellation rate means too many work units created or
num_workunits_per_updatetoo lowWorker idle time: If workers wait often for new work units, reduce
batches_per_work_unitornum_workunits_per_updateTraining stability: If loss oscillates wildly, increase
num_workunits_per_updatefor better gradientsSample quality: If samples don’t improve, try different aggregation thresholds
Example configurations
Small class (2-5 students)
training:
images_per_work_unit: 320
num_workunits_per_update: 2
worker:
batch_size: 64
Large class (10+ students)
training:
images_per_work_unit: 480
num_workunits_per_update: 5
worker:
batch_size: 32
CPU-only mode
worker:
batch_size: 16
Quick testing
training:
images_per_work_unit: 160
num_workunits_per_update: 1
worker:
batch_size: 16
High quality (long training)
training:
images_per_work_unit: 640
num_workunits_per_update: 10
learning_rate: 0.0001
worker:
batch_size: 64
Environment variables
Some settings can be overridden with environment variables:
# Database password (more secure than config file)
export DB_PASSWORD=secret
# Hugging Face token
export HF_TOKEN=hf_...
# Override config file
export CONFIG_PATH=/path/to/custom/config.yaml
Security best practices
Don’t commit secrets:
# Add to .gitignore
config.yaml
.env
Use environment variables:
database:
password: ${DB_PASSWORD}
huggingface:
token: ${HF_TOKEN}
Create template:
# Provide template without secrets
cp config.yaml config.yaml.template
# Remove sensitive data from template
sed -i 's/password: .*/password: YOUR_PASSWORD/' config.yaml.template
Validation
Validate your config:
python -c "from src.utils import load_config; load_config('config.yaml'); print('Config OK')"
Next steps
Quick start - Use your configuration
Troubleshooting - Fix config issues -Performance tuning - Optimize settings