Troubleshooting

Common issues and solutions for distributed GAN training.

Installation issues

Python version too old

Problem: SyntaxError or ModuleNotFoundError

Solution:

# Check version
python --version

# Need Python 3.10+
# Install newer version or use pyenv

PyTorch CUDA mismatch

Problem: RuntimeError: CUDA not available despite having GPU

Solution:

# Check CUDA version
nvidia-smi

# Install matching PyTorch
# For CUDA 12.6:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

# For CPU-only:
pip install -r .devcontainer/cpu/requirements.txt

Database connection fails

Problem: psycopg2.OperationalError: could not connect

Solution:

# Verify credentials in config.yaml
# Test connection manually
psql -h HOST -p 54321 -U USER -d DATABASE

# Check firewall allows port 54321
# Verify database is publicly accessible

Runtime issues

Out of memory

Problem: RuntimeError: CUDA out of memory

Solution:

# Reduce batch size in config.yaml
worker:
  batch_size: 16  # or 8

# Close other GPU applications
# Check GPU memory: nvidia-smi
# Try CPU-only mode

Worker can’t find dataset

Problem: RuntimeError: Dataset not found or similar

Solution:

The dataset should download automatically from Hugging Face on first run
If download fails, check your internet connection
Verify huggingface.repo_id is set in config.yaml
Check you have ~1.5 GB free disk space

# Verify dataset zip exists after download
ls data/celeba_torchvision/data/img_align_celeba.zip

# Check config.yaml path matches
data:
  dataset_path: data/celeba_torchvision/data/img_align_celeba

No work units available

Problem: Worker polls but finds no work

Solution:

Wait for coordinator to start
Check coordinator is creating work units
Verify database connection
Check current iteration matches

-- Check work units exist
SELECT COUNT(*), status FROM work_units GROUP BY status;

Work units timeout

Problem: Work units marked as stalled and reclaimed

Solution:

# Increase timeout in config.yaml
worker:
  work_unit_timeout: 600  # 10 minutes

# Check worker performance
# May need to reduce batch size
# Check network speed

Training issues

Loss values are NaN

Problem: Generator or discriminator loss shows NaN

Solution:

# Reduce learning rate
training:
  learning_rate: 0.0001

# Check for bad gradients
# Restart training from checkpoint
# Verify dataset loaded correctly

Poor image quality

Problem: Generated images look like noise

Solution:

Train longer (more epochs)
Check loss values are decreasing
Verify dataset images look correct
Try different hyperparameters

# Check sample images
ls data/outputs/samples/

Training very slow

Problem: Iterations take very long

Solution:

Need more active workers
Check database performance
Verify network speed
Increase num_workunits_per_update to gather more work unit gradients

Discriminator dominates

Problem: Discriminator loss goes to 0, generator loss increases

Solution:

# Lower learning rate
training:
  learning_rate: 0.0001

# This affects both generator and discriminator
# Common in GAN training - discriminator can overpower

Database issues

Database full

Problem: ERROR: disk full

Solution:

-- Check table sizes
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public';

-- Clean old gradients
DELETE FROM gradients
WHERE uploaded_at < NOW() - INTERVAL '1 day';

-- Vacuum database
VACUUM FULL;

Too many connections

Problem: FATAL: too many connections

Solution:

-- Check current connections
SELECT COUNT(*) FROM pg_stat_activity;

-- Increase max_connections in postgresql.conf
max_connections = 100

-- Or use connection pooling

Slow queries

Problem: Database queries take long time

Solution:

-- Add indexes
CREATE INDEX idx_work_units_status ON work_units(status, iteration);
CREATE INDEX idx_workers_heartbeat ON workers(last_heartbeat);

-- Analyze tables
ANALYZE work_units;
ANALYZE workers;

Network issues

Timeouts

Problem: Frequent connection timeouts

Solution:

# Increase poll interval
worker:
  poll_interval: 10  # seconds

# Check network stability
# ping DATABASE_HOST

# Use closer database region

Slow uploads

Problem: Gradient upload takes very long

Solution:

Check network speed
Database may be far away geographically
Consider compressing gradients
Use closer database provider

Colab-specific issues

Session disconnects

Problem: Colab session times out

Solution:

Keep browser tab active
Don’t idle too long
Use Colab Pro for longer sessions
Re-run cells to resume

GPU quota exceeded

Problem: Can’t get GPU runtime

Solution:

Wait a few hours for quota reset
Use CPU runtime temporarily
Consider Colab Pro
Try at different time of day

Files disappear

Problem: Dataset or config lost after disconnect

Solution:

# Save important files to Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy config
!cp config.yaml /content/drive/MyDrive/

Development issues

Import errors

Problem: ModuleNotFoundError for project modules

Solution:

# Ensure in correct directory
import os
os.chdir('/path/to/GANNs-with-friends')

# Or add to path
import sys
sys.path.insert(0, '/path/to/GANNs-with-friends/src')

Git issues

Problem: Can’t push/pull changes

Solution:

# Stash local changes
git stash

# Pull updates
git pull

# Reapply changes
git stash pop

# Or create branch
git checkout -b my-changes

Debugging techniques

Enable debug logging

import logging
logging.basicConfig(level=logging.DEBUG)

Check GPU utilization

# Monitor GPU usage
watch -n 1 nvidia-smi

# Should show:
# - GPU utilization > 0%
# - Memory usage increasing during training

Verify data loading

from src.data.dataset import CelebADataset

dataset = CelebADataset('data/celeba', image_size=64)
print(f'Dataset size: {len(dataset)}')

# Load sample
image, _ = dataset[0]
print(f'Image shape: {image.shape}')  # Should be [3, 64, 64]

Test database connection

from src.utils import load_config, build_db_url
import psycopg2

config = load_config('config.yaml')
db_url = build_db_url(config['database'])

try:
    conn = psycopg2.connect(db_url)
    print('Database connection successful!')
    conn.close()
except Exception as e:
    print(f'Connection failed: {e}')

Check model initialization

from src.models.dcgan import Generator, Discriminator
import torch

gen = Generator(latent_dim=100)
disc = Discriminator()

# Test forward pass
noise = torch.randn(1, 100, 1, 1)
fake_images = gen(noise)
print(f'Generated image shape: {fake_images.shape}')'  # [1, 3, 64, 64]

output = disc(fake_images)
print(f'Discriminator output shape: {output.shape}')  # [1, 1]

Getting help

Check logs

# Worker logs
python src/worker.py 2>&1 | tee worker.log

# Coordinator logs
python src/main.py 2>&1 | tee coordinator.log

Create minimal example

Isolate the issue:

# Minimal reproduction
import torch
from src.models.dcgan import Generator

gen = Generator()
noise = torch.randn(1, 100, 1, 1)
output = gen(noise)

Report issue

Include:

Error message
Steps to reproduce
System information (OS, Python version, GPU)
Relevant config settings

Preventive measures

Start with small test run (few epochs)
Monitor initially before leaving overnight
Keep checkpoints frequently
Back up database regularly
Test with one worker before scaling
Verify dataset integrity
Use version control

Troubleshooting

Installation issues

Python version too old

PyTorch CUDA mismatch

Database connection fails

Runtime issues

Out of memory

Worker can’t find dataset

No work units available

Work units timeout

Training issues

Loss values are NaN

Poor image quality

Training very slow

Discriminator dominates

Database issues

Database full

Too many connections

Slow queries

Network issues

Timeouts

Slow uploads

Colab-specific issues

Session disconnects

GPU quota exceeded

Files disappear

Development issues

Import errors

Git issues

Debugging techniques

Enable debug logging

Check GPU utilization

Verify data loading

Test database connection

Check model initialization

Getting help

Check logs

Create minimal example

Report issue

Preventive measures

Next steps