Troubleshooting

Common issues and solutions for distributed GAN training.

Installation issues

Python version too old

Problem: SyntaxError or ModuleNotFoundError

Solution:

# Check version
python --version

# Need Python 3.10+
# Install newer version or use pyenv

PyTorch CUDA mismatch

Problem: RuntimeError: CUDA not available despite having GPU

Solution:

# Check CUDA version
nvidia-smi

# Install matching PyTorch
# For CUDA 12.6:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

# For CPU-only:
pip install -r .devcontainer/cpu/requirements.txt

Database connection fails

Problem: psycopg2.OperationalError: could not connect

Solution:

# Verify credentials in config.yaml
# Test connection manually
psql -h HOST -p 54321 -U USER -d DATABASE

# Check firewall allows port 54321
# Verify database is publicly accessible

Runtime issues

Out of memory

Problem: RuntimeError: CUDA out of memory

Solution:

# Reduce batch size in config.yaml
worker:
  batch_size: 16  # or 8

# Close other GPU applications
# Check GPU memory: nvidia-smi
# Try CPU-only mode

Worker can’t find dataset

Problem: RuntimeError: Dataset not found or similar

Solution:

  • The dataset should download automatically from Hugging Face on first run

  • If download fails, check your internet connection

  • Verify huggingface.repo_id is set in config.yaml

  • Check you have ~1.5 GB free disk space

# Verify dataset zip exists after download
ls data/celeba_torchvision/data/img_align_celeba.zip

# Check config.yaml path matches
data:
  dataset_path: data/celeba_torchvision/data/img_align_celeba

No work units available

Problem: Worker polls but finds no work

Solution:

  • Wait for coordinator to start

  • Check coordinator is creating work units

  • Verify database connection

  • Check current iteration matches

-- Check work units exist
SELECT COUNT(*), status FROM work_units GROUP BY status;

Work units timeout

Problem: Work units marked as stalled and reclaimed

Solution:

# Increase timeout in config.yaml
worker:
  work_unit_timeout: 600  # 10 minutes

# Check worker performance
# May need to reduce batch size
# Check network speed

Training issues

Loss values are NaN

Problem: Generator or discriminator loss shows NaN

Solution:

# Reduce learning rate
training:
  learning_rate: 0.0001

# Check for bad gradients
# Restart training from checkpoint
# Verify dataset loaded correctly

Poor image quality

Problem: Generated images look like noise

Solution:

  • Train longer (more epochs)

  • Check loss values are decreasing

  • Verify dataset images look correct

  • Try different hyperparameters

# Check sample images
ls data/outputs/samples/

Training very slow

Problem: Iterations take very long

Solution:

  • Need more active workers

  • Check database performance

  • Verify network speed

  • Increase num_workunits_per_update to gather more work unit gradients

Discriminator dominates

Problem: Discriminator loss goes to 0, generator loss increases

Solution:

# Lower learning rate
training:
  learning_rate: 0.0001

# This affects both generator and discriminator
# Common in GAN training - discriminator can overpower

Database issues

Database full

Problem: ERROR: disk full

Solution:

-- Check table sizes
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public';

-- Clean old gradients
DELETE FROM gradients
WHERE uploaded_at < NOW() - INTERVAL '1 day';

-- Vacuum database
VACUUM FULL;

Too many connections

Problem: FATAL: too many connections

Solution:

-- Check current connections
SELECT COUNT(*) FROM pg_stat_activity;

-- Increase max_connections in postgresql.conf
max_connections = 100

-- Or use connection pooling

Slow queries

Problem: Database queries take long time

Solution:

-- Add indexes
CREATE INDEX idx_work_units_status ON work_units(status, iteration);
CREATE INDEX idx_workers_heartbeat ON workers(last_heartbeat);

-- Analyze tables
ANALYZE work_units;
ANALYZE workers;

Network issues

Timeouts

Problem: Frequent connection timeouts

Solution:

# Increase poll interval
worker:
  poll_interval: 10  # seconds

# Check network stability
# ping DATABASE_HOST

# Use closer database region

Slow uploads

Problem: Gradient upload takes very long

Solution:

  • Check network speed

  • Database may be far away geographically

  • Consider compressing gradients

  • Use closer database provider

Colab-specific issues

Session disconnects

Problem: Colab session times out

Solution:

  • Keep browser tab active

  • Don’t idle too long

  • Use Colab Pro for longer sessions

  • Re-run cells to resume

GPU quota exceeded

Problem: Can’t get GPU runtime

Solution:

  • Wait a few hours for quota reset

  • Use CPU runtime temporarily

  • Consider Colab Pro

  • Try at different time of day

Files disappear

Problem: Dataset or config lost after disconnect

Solution:

# Save important files to Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy config
!cp config.yaml /content/drive/MyDrive/

Development issues

Import errors

Problem: ModuleNotFoundError for project modules

Solution:

# Ensure in correct directory
import os
os.chdir('/path/to/GANNs-with-friends')

# Or add to path
import sys
sys.path.insert(0, '/path/to/GANNs-with-friends/src')

Git issues

Problem: Can’t push/pull changes

Solution:

# Stash local changes
git stash

# Pull updates
git pull

# Reapply changes
git stash pop

# Or create branch
git checkout -b my-changes

Debugging techniques

Enable debug logging

import logging
logging.basicConfig(level=logging.DEBUG)

Check GPU utilization

# Monitor GPU usage
watch -n 1 nvidia-smi

# Should show:
# - GPU utilization > 0%
# - Memory usage increasing during training

Verify data loading

from src.data.dataset import CelebADataset

dataset = CelebADataset('data/celeba', image_size=64)
print(f'Dataset size: {len(dataset)}')

# Load sample
image, _ = dataset[0]
print(f'Image shape: {image.shape}')  # Should be [3, 64, 64]

Test database connection

from src.utils import load_config, build_db_url
import psycopg2

config = load_config('config.yaml')
db_url = build_db_url(config['database'])

try:
    conn = psycopg2.connect(db_url)
    print('Database connection successful!')
    conn.close()
except Exception as e:
    print(f'Connection failed: {e}')

Check model initialization

from src.models.dcgan import Generator, Discriminator
import torch

gen = Generator(latent_dim=100)
disc = Discriminator()

# Test forward pass
noise = torch.randn(1, 100, 1, 1)
fake_images = gen(noise)
print(f'Generated image shape: {fake_images.shape}')'  # [1, 3, 64, 64]

output = disc(fake_images)
print(f'Discriminator output shape: {output.shape}')  # [1, 1]

Getting help

Check logs

# Worker logs
python src/worker.py 2>&1 | tee worker.log

# Coordinator logs
python src/main.py 2>&1 | tee coordinator.log

Create minimal example

Isolate the issue:

# Minimal reproduction
import torch
from src.models.dcgan import Generator

gen = Generator()
noise = torch.randn(1, 100, 1, 1)
output = gen(noise)

Report issue

Include:

  • Error message

  • Steps to reproduce

  • System information (OS, Python version, GPU)

  • Relevant config settings

Preventive measures

  • Start with small test run (few epochs)

  • Monitor initially before leaving overnight

  • Keep checkpoints frequently

  • Back up database regularly

  • Test with one worker before scaling

  • Verify dataset integrity

  • Use version control

Next steps