Configuration reference

Complete reference for all configuration options in config.yaml.

Configuration file structure

database:
  # Database connection settings
  
training:
  # Training hyperparameters
  
worker:
  # Worker behavior settings
  
model:
  # Model architecture settings
  
data:
  # Dataset configuration
  
huggingface:
  # Hugging Face integration (optional)

Database configuration

Connection settings for PostgreSQL database.

database:
  host: localhost
  port: 54321
  database: distributed_gan
  user: username
  password: password

Options

host (string, required)

  • Database server address

  • Can be IP address or hostname

  • Examples: localhost, db.example.com, 192.168.1.100

port (integer, default: 54321)

  • PostgreSQL port number

  • Standard PostgreSQL port is 5432, but this project uses 54321 by default

database (string, required)

  • Database name

  • Must exist before running

user (string, required)

  • Database username

  • Needs SELECT, INSERT, UPDATE permissions

password (string, required)

  • Database password

  • Keep secure, don’t commit to version control

Training configuration

Hyperparameters for training process (coordinator only).

training:
  latent_dim: 100
  image_size: 64
  images_per_work_unit: 320
  num_workunits_per_update: 3
  learning_rate: 0.0002
  beta1: 0.5
  beta2: 0.999

Options

latent_dim (integer, default: 100)

  • Dimension of generator input noise vector

  • Standard DCGAN uses 100

image_size (integer, default: 64)

  • Output image size (64x64 pixels)

  • Must match model architecture

images_per_work_unit (integer, default: 320)

  • Number of images assigned to each work unit

  • Workers process this many images before uploading gradients

  • Larger = less database overhead, more work per claim

num_workunits_per_update (integer, default: 3)

  • Wait for N work unit gradients before aggregating and updating models

  • Higher = more gradient samples, better convergence, slower updates

  • Lower = faster iteration, but potentially noisier gradients

  • Should be set based on your total number of workers

learning_rate (float, default: 0.0002)

  • Adam optimizer learning rate for both generator and discriminator

  • Lower = more stable, slower learning

  • Typical range: 0.0001-0.0004

beta1 (float, default: 0.5)

  • Adam optimizer beta1 parameter

  • Controls momentum

  • 0.5 is standard for GAN training

beta2 (float, default: 0.999)

  • Adam optimizer beta2 parameter

  • Controls variance

  • Usually keep at 0.999

Worker configuration

Settings for worker behavior.

worker:
  name: YourName
  batch_size: 32
  poll_interval: 5
  heartbeat_interval: 30
  work_unit_timeout: 300

Options

name (string, optional)

  • Your name or identifier shown on the dashboard leaderboard

  • If not set, uses hostname

  • Great for classroom competitions!

batch_size (integer, default: 32)

  • Number of images per training batch

  • Adjust based on your GPU memory

  • Colab T4 can typically handle 64

  • Reduce if you get out-of-memory errors

poll_interval (integer, default: 5)

  • Seconds between database polls when no work is available

  • Lower = more responsive, more database load

  • Typical range: 1-10

heartbeat_interval (integer, default: 30)

  • Seconds between heartbeat updates

  • Shows worker is still alive

  • Typical range: 15-60

work_unit_timeout (integer, default: 300)

  • Seconds before uncompleted work is reclaimed

  • Should exceed normal processing time

  • Typical range: 120-600

Model configuration

Model architecture parameters (set in training section).

training:
  latent_dim: 100
  image_size: 64

Options

latent_dim (integer, default: 100)

  • Dimension of random noise vector

  • Standard DCGAN uses 100

  • Higher = more capacity, slower

generator_features (integer, default: 64)

  • Base number of generator feature maps

  • Determines model size

  • Higher = more capacity, more memory

discriminator_features (integer, default: 64)

  • Base number of discriminator feature maps

  • Usually same as generator

  • Higher = better discrimination, more memory

Data configuration

Dataset settings.

data:
  dataset_path: data/celeba_torchvision/data/img_align_celeba
  num_workers_dataloader: 4

Options

dataset_path (string, default: data/celeba_torchvision/data/img_align_celeba)

  • Path to CelebA dataset images

  • If the dataset is not found at this path, it will be automatically downloaded from Hugging Face

  • Can be relative or absolute

num_workers_dataloader (integer, default: 4)

  • Number of dataloader workers for parallel data loading

  • Set to 0 to disable multiprocessing

Hugging Face configuration

Integration for dataset downloads and model sharing.

huggingface:
  enabled: false
  repo_id: gperdrizet/GANNs-with-friends
  token: ''
  push_interval: 5

Default behavior

The default repo_id points to the project’s public repository. This allows:

  • Workers/students: Download the CelebA dataset automatically (no token needed)

  • Demo mode: Download pre-trained models for testing

Running your own training

To run as coordinator with your own team, you need your own Hugging Face repository:

  1. Create a new repo at huggingface.co/new

  2. Get a write token from huggingface.co/settings/tokens

  3. Update your config:

huggingface:
  enabled: true
  repo_id: YOUR_USERNAME/YOUR_REPO_NAME
  token: YOUR_WRITE_TOKEN
  push_interval: 5

Options

enabled (boolean, default: false)

  • Enable pushing checkpoints to Hugging Face

  • Only needed for coordinators running their own training

  • Workers don’t need this enabled

repo_id (string, default: gperdrizet/GANNs-with-friends)

  • Hugging Face repository ID

  • Format: username/repo-name

  • Default repo works for downloading data/models (read-only)

token (string, default: ‘’)

  • Hugging Face access token

  • Only needed when enabled: true

  • Get from huggingface.co/settings/tokens

  • Needs write permissions for your repo

push_interval (integer, default: 5)

  • Push checkpoint every N iterations

  • Lower = more frequent updates, more uploads

  • Typical range: 1-10

Understanding distributed training tradeoffs

The distributed training system coordinates multiple workers through work units and gradient aggregation. Understanding the tradeoffs helps you configure the system effectively.

Work unit size vs. database overhead

images_per_work_unit controls how many images each work unit contains:

training:
  images_per_work_unit: 320  # Each work unit = 320 images

Larger work units (500-1000 images):

  • Less database overhead (fewer queries per epoch)

  • Fewer work units to manage

  • Better for stable, persistent workers

  • Longer processing time per work unit

  • Slower feedback if workers disconnect

Smaller work units (100-200 images):

  • Faster completion times

  • Better for unstable workers (less wasted work if disconnected)

  • More granular progress tracking

  • More database operations

  • Higher coordination overhead with many workers

Recommendation: Start with 320, increase if you have stable workers or high database latency.

Worker batch size

Workers can tune their own batch_size based on GPU memory:

worker:
  batch_size: 64  # Larger for GPUs with more VRAM

This is independent of images_per_work_unit. A work unit with 320 images:

  • Worker with batch_size=32: processes 10 batches

  • Worker with batch_size=64: processes 5 batches

Both contribute equally (gradients are averaged).

Aggregation threshold vs. gradient quality

num_workunits_per_update controls how many work unit gradients are collected before updating the model:

training:
  num_workunits_per_update: 5  # Wait for 5 work unit gradients

This is one of the most important parameters for distributed training quality.

Higher values (8-20+ work units):

  • More gradient samples = better quality, less noisy updates

  • More robust training (similar to larger batch sizes)

  • Less risk of mode collapse

  • Slower iterations (wait for more workers to finish)

  • More wasted work units if not all are used

  • Can accumulate stale work units

Lower values (1-3 work units):

  • Faster iterations (update as soon as possible)

  • Less wasted computation

  • Quick feedback during development/testing

  • Noisier gradients (more variance in updates)

  • Higher risk of training instability

  • May not benefit from parallel workers

The stale work unit problem:

When num_workunits_per_update is less than the total workers, some work units will be “left behind” when the coordinator aggregates and moves to the next iteration:

Iteration 1: Create 100 work units
- Wait for 5 to complete (num_workunits_per_update=5)
- Aggregate and move to iteration 2
- 95 work units are now "stale" (cancelled automatically)

The system automatically cancels pending work units when advancing iterations to prevent workers from processing stale data. Workers who claim cancelled work units will skip them and move to the next one.

Guidelines for setting num_workunits_per_update:

Class Size

Workers Expected

Recommended Value

Rationale

Small (2-5)

2-5

2-3

Get updates quickly, most workers contribute

Medium (10-20)

10-15

5-8

Balance quality and speed

Large (30+)

20-30

10-20

Higher quality gradients, can afford to wait

Testing/Development: Set to 1 for fastest feedback, but expect noisy training.

Production training: Set based on expected worker count and desired gradient quality. A good rule: 30-50% of your typical concurrent workers.

Update frequency vs. convergence speed

The actual update frequency depends on both parameters:

Images per update = images_per_work_unit × num_workunits_per_update

Example with defaults:
320 × 3 = 960 images per model update

More frequent updates (fewer images):

  • Faster iterations through the epoch

  • More opportunities to correct course

  • Higher overhead from weight synchronization

  • Can be noisier

Less frequent updates (more images):

  • More stable gradient estimates

  • Less synchronization overhead

  • Slower to respond to training issues

  • Similar to traditional large-batch training

Finding the balance:

  1. Start with defaults for your class size

  2. Monitor training metrics (loss, sample quality)

  3. If training is unstable: increase num_workunits_per_update

  4. If training is too slow: decrease num_workunits_per_update or images_per_work_unit

  5. Adjust based on worker reliability and network conditions

Worker coordination patterns

Different parameter combinations create different worker coordination patterns:

Pattern 1: Fast iteration (testing)

training:
  images_per_work_unit: 160
  num_workunits_per_update: 1
  • Single worker can drive training

  • Fast feedback, noisy gradients

  • Good for debugging, not production

Pattern 2: Balanced (small/medium class)

training:
  images_per_work_unit: 320
  num_workunits_per_update: 3
  • 3 workers contribute per update

  • Good balance of quality and speed

  • Default configuration

Pattern 3: High quality (large class)

training:
  images_per_work_unit: 480
  num_workunits_per_update: 10
  • Wait for many gradient samples

  • Best gradient quality

  • Slower but more stable training

Pattern 4: Efficient (stable workers)

batches_per_work_unit: 20
num_workunits_per_update: 10
  • Maximize work per database operation

  • Assumes workers can handle larger units

  • Good for low-latency networks

Monitoring and adjustment

Watch these metrics to tune your configuration:

  1. Work unit completion rate: If workers finish faster than coordinator aggregates, you might want higher num_workunits_per_update

  2. Cancelled work units: High cancellation rate means too many work units created or num_workunits_per_update too low

  3. Worker idle time: If workers wait often for new work units, reduce batches_per_work_unit or num_workunits_per_update

  4. Training stability: If loss oscillates wildly, increase num_workunits_per_update for better gradients

  5. Sample quality: If samples don’t improve, try different aggregation thresholds

Example configurations

Small class (2-5 students)

training:
  images_per_work_unit: 320
  num_workunits_per_update: 2

worker:
  batch_size: 64

Large class (10+ students)

training:
  images_per_work_unit: 480
  num_workunits_per_update: 5

worker:
  batch_size: 32

CPU-only mode

worker:
  batch_size: 16

Quick testing

training:
  images_per_work_unit: 160
  num_workunits_per_update: 1

worker:
  batch_size: 16

High quality (long training)

training:
  images_per_work_unit: 640
  num_workunits_per_update: 10
  learning_rate: 0.0001

worker:
  batch_size: 64

Environment variables

Some settings can be overridden with environment variables:

# Database password (more secure than config file)
export DB_PASSWORD=secret

# Hugging Face token
export HF_TOKEN=hf_...

# Override config file
export CONFIG_PATH=/path/to/custom/config.yaml

Security best practices

Don’t commit secrets:

# Add to .gitignore
config.yaml
.env

Use environment variables:

database:
  password: ${DB_PASSWORD}

huggingface:
  token: ${HF_TOKEN}

Create template:

# Provide template without secrets
cp config.yaml config.yaml.template

# Remove sensitive data from template
sed -i 's/password: .*/password: YOUR_PASSWORD/' config.yaml.template

Validation

Validate your config:

python -c "from src.utils import load_config; load_config('config.yaml'); print('Config OK')"

Next steps