This benchmark compares generation latency, peak GPU VRAM, and peak system RAM across nine text-to-image diffusion models, three execution modes, and two GPU tiers. Each model is run five times per (model, mode) combination after one untimed warmup run.
Prompt: "a turtle and a bird together in a forest"
Output resolution: 512 × 512 | Replicates: 5 timed runs per combination
| Label | GPU | VRAM | Architecture | Results |
|---|---|---|---|---|
| gtx1070 | GeForce GTX 1070 | 8 GB | Pascal (sm_61) | View results → |
| p100 | Tesla P100 | 16 GB | Pascal (sm_60) | View results → |
| Model | Generation | Architecture | Gated | Steps | Paper |
|---|---|---|---|---|---|
| CompVis/stable-diffusion-v1-4 | SD 1.x | UNet + CLIP | No | 30 | arXiv:2112.10752 |
| sd2-community/stable-diffusion-2-1-base | SD 2.x | UNet + OpenCLIP | No | 30 | arXiv:2112.10752 |
| stabilityai/stable-diffusion-xl-base-1.0 | SDXL | UNet (2×) + dual encoders | No | 30 | arXiv:2307.01952 |
| stabilityai/sdxl-turbo | SDXL (distilled) | UNet (2×) + dual encoders | No | 4 | arXiv:2311.17042 |
| stabilityai/stable-diffusion-3.5-medium | SD 3.5 | DiT (MMDiT) + T5 | Yes | 28 | arXiv:2403.03206 |
| stabilityai/stable-diffusion-3.5-large-turbo | SD 3.5 (distilled) | DiT (MMDiT) + T5 | Yes | 4 | arXiv:2403.03206 |
| black-forest-labs/FLUX.1-schnell | FLUX | Flow matching + T5 | Yes | 4 | GitHub |
| kandinsky-community/kandinsky-2-2-decoder | Kandinsky 2.2 | Two-stage prior + UNet | No | 30 | arXiv:2310.03502 |
| PixArt-alpha/PixArt-XL-2-512x512 | PixArt | DiT + T5 | No | 20 | arXiv:2310.00426 |
| Mode | Description |
|---|---|
gpu_only |
Full model loaded to GPU VRAM in fp16 |
model_offload |
enable_model_cpu_offload(), submodels paged to GPU one at a time |
sequential_offload |
enable_sequential_cpu_offload(), individual layers moved layer-by-layer |
Raw data: data/gtx1070/benchmark_results.json | data/p100/benchmark_results.json