Data loading
Functions
- image_classification_tools.pytorch.data.load_dataset(data_source, transform, train=True, download=False, **dataset_kwargs)[source]
Load a single dataset from a directory or PyTorch dataset class.
This function provides a flexible interface for loading image classification datasets. It supports both PyTorch built-in datasets (CIFAR-10, CIFAR-100, MNIST, etc.) and custom datasets stored in directories following the ImageFolder structure.
- Parameters:
data_source (
Path|type) – Either a Path to a directory containing train/ or test/ subdirectory, or a PyTorch dataset class (e.g., datasets.CIFAR10)transform (
Compose) – Transforms to apply to the datatrain (
bool) – If True, load training data. If False, load test data (default: True)download (
bool) – Whether to download the dataset if using a PyTorch dataset class. Ignored for directory-based datasets.**dataset_kwargs – Additional keyword arguments passed to the dataset class (e.g., root=’data/pytorch/cifar10’)
- Return type:
- Returns:
Dataset object
Examples
# Load CIFAR-10 training data train_dataset = load_dataset(
data_source=datasets.CIFAR10, transform=transform, train=True, root=’data/cifar10’
)
# Load from ImageFolder train_dataset = load_dataset(
data_source=Path(‘data/my_dataset’), transform=transform, train=True
)
- image_classification_tools.pytorch.data.prepare_splits(train_dataset, test_dataset=None, val_size=10000, test_size=None)[source]
Split training dataset into train/val(/test) splits.
The splitting behavior depends on whether a separate test dataset is provided: - If test_dataset is provided: Split train_dataset into train/val only (2-way split) - If test_dataset is None: Split train_dataset into train/val/test (3-way split)
- Parameters:
train_dataset (
Dataset) – Training dataset to splittest_dataset (
Dataset|None) – Test dataset. If None, test set will be split from train_dataset.val_size (
int) – Number of images to use for validationtest_size (
int|None) – Number of images to reserve for testing when test_dataset is None. Only used when test_dataset is None. If None when test_dataset is None, raises ValueError.
- Return type:
- Returns:
Tuple of (train_dataset, val_dataset, test_dataset)
Examples
# 2-way split: Pass separate test set train_ds, val_ds, test_ds = prepare_splits(
train_dataset=my_train_data, test_dataset=my_test_data, # Use this for testing val_size=10000 # 10,000 images for validation
)
# 3-way split: No separate test set train_ds, val_ds, test_ds = prepare_splits(
train_dataset=my_full_data, test_dataset=None, # Will split test from train_dataset val_size=10000, # 10,000 for validation test_size=5000 # 5,000 for testing
)
- image_classification_tools.pytorch.data.create_dataloaders(train_dataset, val_dataset, test_dataset, batch_size, shuffle_train=True, num_workers=0, preload_to_memory=True, device=None, **kwargs)[source]
Create DataLoaders from prepared datasets with optional memory preloading.
This function provides three memory management strategies: 1. Lazy loading (preload_to_memory=False): Data stays on disk, loaded per batch 2. CPU preloading (preload_to_memory=True, device=cpu): Entire dataset in RAM 3. GPU preloading (preload_to_memory=True, device=cuda): Entire dataset in VRAM
- Parameters:
train_dataset (
Dataset) – Prepared training datasetval_dataset (
Dataset) – Prepared validation datasettest_dataset (
Dataset) – Prepared test datasetbatch_size (
int) – Batch size for all DataLoadersshuffle_train (
bool) – Whether to shuffle training data (default: True)num_workers (
int) – Number of subprocesses for data loading (default: 0 for single process). Note: num_workers is ignored when preload_to_memory=True.preload_to_memory (
bool) – If True, convert datasets to tensors and load into memory. If False, keep as lazy-loading Dataset objects (default: True).device (
device|None) – Device to preload tensors onto. Only used if preload_to_memory=True. If None with preload_to_memory=True, defaults to CPU. Common values: torch.device(‘cpu’), torch.device(‘cuda’)**kwargs – Additional keyword arguments passed to DataLoader (e.g., pin_memory=True, persistent_workers=True)
- Return type:
- Returns:
Tuple of (train_loader, val_loader, test_loader)
Examples
# Strategy 1: Lazy loading (large datasets) train_loader, val_loader, test_loader = create_dataloaders(
train_ds, val_ds, test_ds, batch_size=128, num_workers=4, pin_memory=True
)
# Strategy 2: CPU preloading (medium datasets) train_loader, val_loader, test_loader = create_dataloaders(
train_ds, val_ds, test_ds, batch_size=128, preload_to_memory=True, device=torch.device(‘cpu’)
)
# Strategy 3: GPU preloading (small datasets, fastest training) train_loader, val_loader, test_loader = create_dataloaders(
train_ds, val_ds, test_ds, batch_size=128, preload_to_memory=True, device=torch.device(‘cuda’)
)
- image_classification_tools.pytorch.data.generate_augmented_data(train_dataset, augmentation_transforms, augmentations_per_image, save_dir, class_names=None, chunk_size=5000, force_reaugment=False)[source]
Generate augmented training data and save as ImageFolder-compatible directory structure.
This function applies augmentation transforms to create multiple augmented versions of each training image and saves them to disk in ImageFolder format. Images are processed in chunks to avoid memory issues with large datasets.
- Parameters:
train_dataset (
Dataset) – PyTorch Dataset containing training imagesaugmentation_transforms (
Sequential) – nn.Sequential containing augmentation transforms to applyaugmentations_per_image (
int) – Number of augmented versions to create per imagesave_dir (
str|Path) – Directory path to save augmented images in ImageFolder format (will create class_0/, class_1/, etc. subdirectories)class_names (
list[str] |None) – Optional list of class names. If None, uses numeric class indices.chunk_size (
int) – Number of images to process per chunk (default: 5000)force_reaugment (
bool) – If True, regenerate even if saved data exists
- Return type:
- Returns:
None (saves images to disk)
Example
>>> generate_augmented_data( ... train_dataset=train_dataset, ... augmentation_transforms=augmentation_transforms, ... augmentations_per_image=3, ... save_dir='data/cifar10_augmented', ... class_names=['airplane', 'automobile', ...], ... chunk_size=5000 ... ) >>> # Then load with existing pipeline: >>> aug_dataset, _ = load_datasets( ... data_source=Path('data/cifar10_augmented'), ... transform=eval_transform ... )
Overview
The data module provides a flexible three-step data loading workflow:
Load datasets: Load individual train or test datasets from PyTorch dataset classes or directories
Prepare splits: Split data into train/val(/test) with configurable sizes
Create dataloaders: Create DataLoaders with optional memory preloading strategies
Key features:
Support for torchvision datasets (CIFAR-10, MNIST, etc.) and custom ImageFolder datasets
Single transform applied to both training and test data
Flexible splitting: 2-way (train/val) or 3-way (train/val/test) with integer sizes
Three memory strategies: lazy loading, CPU preloading, or GPU preloading
Data augmentation with chunking for large datasets
Configurable batch sizes and workers
Example usage
Basic workflow (CIFAR-10 with GPU preloading):
from pathlib import Path
import torch
from torchvision import datasets, transforms
from image_classification_tools.pytorch.data import (
load_dataset, prepare_splits, create_dataloaders
)
# Define transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Step 1: Load datasets
train_dataset = load_dataset(
data_source=datasets.CIFAR10,
transform=transform,
train=True,
download=True,
root=Path('./data/cifar10')
)
test_dataset = load_dataset(
data_source=datasets.CIFAR10,
transform=transform,
train=False,
download=True,
root=Path('./data/cifar10')
)
# Step 2: Prepare splits (2-way: train/val from train_dataset)
train_dataset, val_dataset, test_dataset = prepare_splits(
train_dataset=train_dataset,
test_dataset=test_dataset,
val_size=10000 # 10,000 images for validation
)
# Step 3: Create dataloaders with GPU preloading
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_loader, val_loader, test_loader = create_dataloaders(
train_dataset, val_dataset, test_dataset,
batch_size=128,
preload_to_memory=True,
device=device
)
With data augmentation (lazy loading):
# Define transform with augmentation for training
train_transform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Define transform without augmentation for evaluation
eval_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Load training data with augmentation
train_dataset = load_dataset(
data_source=datasets.CIFAR10,
transform=train_transform,
train=True,
root=Path('./data/cifar10')
)
# Load test data without augmentation
test_dataset = load_dataset(
data_source=datasets.CIFAR10,
transform=eval_transform,
train=False,
root=Path('./data/cifar10')
)
# Prepare splits
train_dataset, val_dataset, test_dataset = prepare_splits(
train_dataset=train_dataset,
test_dataset=test_dataset,
val_size=10000
)
# Create dataloaders with lazy loading (no preloading)
train_loader, val_loader, test_loader = create_dataloaders(
train_dataset, val_dataset, test_dataset,
batch_size=128,
preload_to_memory=False, # Lazy loading for augmentation
num_workers=4,
pin_memory=True
)
3-way split (no separate test set):
# Define transform
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Load only training data (no test set available)
train_dataset = load_dataset(
data_source=Path('./my_dataset'),
transform=transform,
train=True
)
# 3-way split: train/val/test all from train_dataset
train_dataset, val_dataset, test_dataset = prepare_splits(
train_dataset=train_dataset,
test_dataset=None, # Will split test from train_dataset
val_size=5000, # 5,000 images for validation
test_size=5000 # 5,000 images for testing
)
# Remaining images will be used for training