Image generation benchmark

This benchmark compares generation latency, peak GPU VRAM, and peak system RAM across nine text-to-image diffusion models, three execution modes, and two GPU tiers. Each model is run five times per (model, mode) combination after one untimed warmup run.

Prompt: "a turtle and a bird together in a forest"

Output resolution: 512 × 512  |  Replicates: 5 timed runs per combination

Hardware

Label GPU VRAM Architecture Results
gtx1070 GeForce GTX 1070 8 GB Pascal (sm_61) View results →
p100 Tesla P100 16 GB Pascal (sm_60) View results →

Models

Model Generation Architecture Gated Steps Paper
CompVis/stable-diffusion-v1-4 SD 1.x UNet + CLIP No 30 arXiv:2112.10752
sd2-community/stable-diffusion-2-1-base SD 2.x UNet + OpenCLIP No 30 arXiv:2112.10752
stabilityai/stable-diffusion-xl-base-1.0 SDXL UNet (2×) + dual encoders No 30 arXiv:2307.01952
stabilityai/sdxl-turbo SDXL (distilled) UNet (2×) + dual encoders No 4 arXiv:2311.17042
stabilityai/stable-diffusion-3.5-medium SD 3.5 DiT (MMDiT) + T5 Yes 28 arXiv:2403.03206
stabilityai/stable-diffusion-3.5-large-turbo SD 3.5 (distilled) DiT (MMDiT) + T5 Yes 4 arXiv:2403.03206
black-forest-labs/FLUX.1-schnell FLUX Flow matching + T5 Yes 4 GitHub
kandinsky-community/kandinsky-2-2-decoder Kandinsky 2.2 Two-stage prior + UNet No 30 arXiv:2310.03502
PixArt-alpha/PixArt-XL-2-512x512 PixArt DiT + T5 No 20 arXiv:2310.00426

Execution modes

Mode Description
gpu_only Full model loaded to GPU VRAM in fp16
model_offload enable_model_cpu_offload(), submodels paged to GPU one at a time
sequential_offload enable_sequential_cpu_offload(), individual layers moved layer-by-layer

Raw data: data/gtx1070/benchmark_results.json  |  data/p100/benchmark_results.json