GANNs with friends

Getting started

  • Overview
    • What makes this project unique?
      • Database-coordinated distributed training
      • Multiple participation paths
      • Educational focus
    • Key concepts
      • Distributed data parallel training
      • GAN architecture
      • Database as message queue
    • Project outcomes
    • Next steps
  • Background
    • How GANs work
      • The adversarial game
      • The training objective
      • Training dynamics
      • GAN applications beyond image generation
      • Why GANs are challenging
    • Distributed training for machine learning
      • Why distribute training?
      • Types of distributed training
        • Data parallelism (what we use)
        • Model parallelism
      • Gradient aggregation strategies
        • Synchronous (what we use)
        • Asynchronous (alternative approach)
      • Our approach: Database-coordinated synchronous training
    • Next steps
  • Installation
    • Comparison of installation paths
    • Prerequisites by path
      • All paths require:
      • Path-specific requirements:
    • CPU vs GPU training
    • Verification
    • Next steps
  • Quick start
    • For students (workers)
      • Step 1: Choose your path
      • Step 2: Get database credentials
      • Step 3: Configure and run
    • For instructors (coordinator)
      • Step 1: Set up database
      • Step 2: Initialize the training system
      • Step 3: Start coordinator
      • Step 4: Monitor progress
    • Optional: Hugging Face integration
    • Viewing results
    • Troubleshooting
    • Next steps

Setup paths

  • Google Colab setup
    • Advantages
    • Limitations
    • Setup steps
      • 1. Open the Colab notebook
      • 2. Enable GPU runtime
      • 3. Run the setup cells
      • 4. Configure database connection
      • 5. Start worker
    • Keeping the worker running
    • Monitoring your contribution
    • Stopping the worker
    • Tips for Colab users
    • Troubleshooting
    • Next steps
  • Dev container setup
    • GPU configuration
      • Advantages
      • Prerequisites
    • CPU configuration
      • Advantages
      • Prerequisites
    • Installation steps
      • 1. Install prerequisites
      • 2. Clone repository
      • 3. Open in container
      • 4. Configure database
      • 5. Start worker
    • Container features
      • Hardware detection
      • Jupyter notebooks
      • Development tools
    • Working with the container
      • Restart container
      • Access terminal
      • Install additional packages
      • File access
    • Troubleshooting
    • Next steps
  • Native Python setup
    • Advantages
    • Disadvantages
    • Prerequisites
    • Installation steps
      • 1. Verify Python version
      • 2. Clone repository
      • 3. Create virtual environment
      • 4. Install dependencies
      • 5. Verify installation
      • 6. Configure database
      • 7. Start worker
    • Managing the environment
      • Activate environment
      • Deactivate environment
      • Update dependencies
      • Freeze current environment
    • CPU vs GPU
    • Troubleshooting
    • Next steps
  • Conda environment setup
    • Advantages
    • Prerequisites
    • Installation steps
      • 1. Install Conda
      • 2. Clone repository
      • 3. Create conda environment
      • 4. Activate environment
      • 5. Verify installation
      • 6. Configure database
      • 7. Start worker
    • Managing conda environments
      • List environments
      • Activate/deactivate
      • Update environment
      • Export environment
      • Remove environment
    • Environment files explained
      • environment.yml (GPU)
      • .devcontainer/cpu/environment.yml (CPU-only)
    • Troubleshooting
    • Next steps
  • Local training setup
    • When to use local training
    • Trade-offs
    • Prerequisites
    • Quick start
      • 1. Skip database configuration
      • 2. Start training
    • Command-line options
      • Basic options
      • Advanced options
      • Resume training
    • Monitoring progress
      • Generated samples
      • Checkpoints
      • Console output
    • Viewing results
      • Use the demo notebook
      • Manual inspection
    • Comparing with distributed training
      • Performance comparison
      • Try both
    • Hyperparameter tuning
      • Learning rates
      • Batch sizes
      • Training duration
    • Troubleshooting
    • Next steps

User guides

  • Student guide
    • Your role
    • Getting started
      • 1. Choose your setup path
      • 2. Get credentials
      • 3. Configure and run
      • 4. Set your name (optional)
    • What the worker does
      • Automatic workflow
      • What you’ll see
    • Understanding the output
      • Initialization
      • During training
    • Monitoring your contribution
      • In the console
      • Database queries
      • Leaderboard (if available)
    • Best practices
      • Maximize contribution
      • Resource management
      • When to stop
    • Troubleshooting
      • No work units available
      • Connection errors
      • Out of memory
      • Worker crashes
      • Slow performance
    • FAQ
    • Learning outcomes
    • Next steps
  • Instructor guide
    • Your role
    • Pre-training setup
      • 1. Deploy database
      • 2. Initialize database
      • 3. Create student accounts
      • 4. Configure coordinator
    • Running training
      • Start coordinator
      • Monitor progress
    • Managing the training session
      • Pause training
      • Resume training
      • Adjust parameters mid-training
      • Handle stalled workers
    • Monitoring tools
      • Real-time visualization
      • Generated samples
      • Hugging Face integration
    • Best practices
      • Before class
      • During class
      • After class
    • Troubleshooting
      • No workers connecting
      • Training very slow
      • Unstable training
      • Database full
    • Grading and assessment
      • Track individual contributions
      • Metrics to consider
      • Export results
    • Advanced topics
      • Multiple coordinator instances
      • Custom work unit creation
      • Gradient verification
    • Learning objectives
    • Next steps
  • Configuration reference
    • Configuration file structure
    • Database configuration
      • Options
    • Training configuration
      • Options
    • Worker configuration
      • Options
    • Model configuration
      • Options
    • Data configuration
      • Options
    • Hugging Face configuration
      • Default behavior
      • Running your own training
      • Options
    • Understanding distributed training tradeoffs
      • Work unit size vs. database overhead
      • Worker batch size
      • Aggregation threshold vs. gradient quality
      • Update frequency vs. convergence speed
      • Worker coordination patterns
      • Monitoring and adjustment
    • Example configurations
      • Small class (2-5 students)
      • Large class (10+ students)
      • CPU-only mode
      • Quick testing
      • High quality (long training)
    • Environment variables
    • Security best practices
    • Validation
    • Next steps
  • Monitoring
    • Overview
    • Monitoring methods
      • 1. Console output
      • 2. Database queries
      • 3. Generated samples
      • 4. Hugging Face
    • Database monitoring
      • Active workers
      • Training progress
      • Work unit status
      • Worker contribution leaderboard
      • Stalled work units
    • Python monitoring script
    • Visual monitoring
      • Watch generated samples
      • Create GIF of progress
      • Plot loss curves
    • Hugging Face monitoring
    • Performance metrics
      • Worker efficiency
      • Database performance
    • Alerts and notifications
      • Email on completion
      • Slack notifications
    • Troubleshooting with monitoring
      • Issue: No workers active
      • Issue: Work units stuck
      • Issue: Slow progress
    • Dashboard ideas
      • Simple web dashboard
      • Real-time updates with WebSockets
    • Best practices
    • Next steps

Architecture

  • Architecture overview
    • Distributed deep learning fundamentals
      • Data parallelism
      • Synchronous vs asynchronous training
    • System components
    • High-level architecture
    • Data flow
      • 1. Initialization (Coordinator)
      • 2. Worker claims and processes
      • 3. Coordinator aggregates
      • 4. Iteration continues
    • Database schema
      • training_state table
      • work_units table
      • gradients table
      • workers table
    • Coordination mechanism
      • Atomic work claiming
      • Timeout and reclamation
      • Heartbeat monitoring
    • Communication patterns
      • Pull-based architecture
      • Stateless workers
      • Centralized coordination
    • Fault tolerance
      • Worker failures
      • Coordinator failures
      • Database failures
    • Scalability
      • Horizontal scaling
      • Vertical scaling
      • Database optimization
    • Security considerations
      • Database access
      • Data validation
    • Next steps

Development approach

  • Development approach and AI assistance
    • Transparency about AI usage
    • Development timeline
    • Collaborative workflow
    • Why disclose this
    • Example: Fixing the stale work unit bug
    • Example: Catching AI hallucinations

Additional resources

  • Troubleshooting
    • Installation issues
      • Python version too old
      • PyTorch CUDA mismatch
      • Database connection fails
    • Runtime issues
      • Out of memory
      • Worker can’t find dataset
      • No work units available
      • Work units timeout
    • Training issues
      • Loss values are NaN
      • Poor image quality
      • Training very slow
      • Discriminator dominates
    • Database issues
      • Database full
      • Too many connections
      • Slow queries
    • Network issues
      • Timeouts
      • Slow uploads
    • Colab-specific issues
      • Session disconnects
      • GPU quota exceeded
      • Files disappear
    • Development issues
      • Import errors
      • Git issues
    • Debugging techniques
      • Enable debug logging
      • Check GPU utilization
      • Verify data loading
      • Test database connection
      • Check model initialization
    • Getting help
      • Check logs
      • Create minimal example
      • Report issue
    • Preventive measures
    • Next steps
  • Performance Tips
    • Configuration-Based Optimizations
      • Batch Size Tuning
      • DataLoader Workers
      • Work Unit Configuration
      • Worker Polling
    • Monitoring performance
      • Check GPU Utilization
      • Worker Throughput
      • Database Performance
    • Best practices
    • Performance targets
    • Example configurations
      • Small class (3-5 workers)
      • Medium class (10-20 workers)
      • Large class (30+ workers)
    • Next steps
  • Contributing
    • Ways to contribute
    • Getting started
      • Fork and clone
      • Make your changes
    • Pull request guidelines
    • Reporting bugs
    • Development philosophy
    • Advanced performance optimization ideas
      • Database Optimizations
        • Add indexes for faster queries
        • Connection pooling
        • Regular maintenance
      • Training Optimizations
        • Mixed precision training
        • Gradient accumulation
      • Network Optimizations
        • Gradient compression
        • Batched uploads
        • Local weight caching
      • Monitoring Optimizations
        • Async heartbeats
        • Reduced logging overhead
      • Profiling Tools
        • Measure performance
        • Code profiling
      • Resource Management
        • CPU affinity
        • Multiple workers per GPU
      • Implementation Notes
    • License
    • Questions?
  • FAQ
    • Getting started
      • What is this project about?
      • Do I need a GPU?
      • Which installation path should I choose?
      • Do I need to download the dataset?
      • Where do I get database credentials?
      • Can I use my own database?
      • Can I join training late?
      • What if I disconnect?
    • Training and results
      • How long does training take?
      • How do I know it’s working?
      • What do the loss values mean?
      • Why are my loss values different from others?
      • When will I see results?
      • How do I view generated faces?
      • Why do images look blurry?
      • Can I generate my own faces?
      • How do I save my favorite generated faces?
      • Can I train my own model?
    • System and performance
      • How does coordination work?
      • What’s the database storing?
      • What happens if my worker crashes?
      • Can I run multiple workers?
      • How is this different from PyTorch DDP?
      • Why is training slow?
      • How can I speed it up?
      • What’s the optimal number of workers?
      • Does CPU training help?
    • Troubleshooting questions
      • Worker says “no work units available”
      • Getting connection errors
      • Out of memory errors
      • Loss is NaN
    • Learning and contributing
      • What will I learn?
      • Do I need to understand all the code?
      • Can I modify the code?
      • Where can I learn more about GANs?
      • Can I contribute improvements?
      • I found a bug, what do I do?
      • Can I use this for my research?
    • Advanced topics
      • How do I add Hugging Face integration?
      • Can I use different datasets?
      • How do I modify the GAN architecture?
      • Can this scale to 100+ workers?
    • Still have questions?
GANNs with friends
  • Search


© Copyright 2026, GANNs with friends contributors.

Built with Sphinx using a theme provided by Read the Docs.