Image generation benchmark

This benchmark compares generation latency, peak GPU VRAM, and peak system RAM across nine text-to-image diffusion models, three execution modes, and two GPU tiers. Each model is run five times per (model, mode) combination after one untimed warmup run.

Prompt: "a turtle and a bird together in a forest"

Output resolution: 512 × 512 | Replicates: 5 timed runs per combination

Hardware

Label	GPU	VRAM	Architecture	Results
gtx1070	GeForce GTX 1070	8 GB	Pascal (sm_61)	View results →
p100	Tesla P100	16 GB	Pascal (sm_60)	View results →

Models

Model	Generation	Architecture	Gated	Steps	Paper
CompVis/stable-diffusion-v1-4	SD 1.x	UNet + CLIP	No	30	arXiv:2112.10752
sd2-community/stable-diffusion-2-1-base	SD 2.x	UNet + OpenCLIP	No	30	arXiv:2112.10752
stabilityai/stable-diffusion-xl-base-1.0	SDXL	UNet (2×) + dual encoders	No	30	arXiv:2307.01952
stabilityai/sdxl-turbo	SDXL (distilled)	UNet (2×) + dual encoders	No	4	arXiv:2311.17042
stabilityai/stable-diffusion-3.5-medium	SD 3.5	DiT (MMDiT) + T5	Yes	28	arXiv:2403.03206
stabilityai/stable-diffusion-3.5-large-turbo	SD 3.5 (distilled)	DiT (MMDiT) + T5	Yes	4	arXiv:2403.03206
black-forest-labs/FLUX.1-schnell	FLUX	Flow matching + T5	Yes	4	GitHub
kandinsky-community/kandinsky-2-2-decoder	Kandinsky 2.2	Two-stage prior + UNet	No	30	arXiv:2310.03502
PixArt-alpha/PixArt-XL-2-512x512	PixArt	DiT + T5	No	20	arXiv:2310.00426

Execution modes

Mode	Description
`gpu_only`	Full model loaded to GPU VRAM in fp16
`model_offload`	`enable_model_cpu_offload()`, submodels paged to GPU one at a time
`sequential_offload`	`enable_sequential_cpu_offload()`, individual layers moved layer-by-layer

Raw data: data/gtx1070/benchmark_results.json | data/p100/benchmark_results.json