Quick Start Guide

Basic Usage

Here’s a simple example to get started with EnsembleSet:

import pandas as pd
import ensembleset.dataset as ds

# Create or load your training data
train_df = pd.DataFrame({
    'feature1': [1.0, 2.0, 3.0, 4.0, 5.0],
    'feature2': [10.0, 20.0, 30.0, 40.0, 50.0],
    'feature3': [100.0, 200.0, 300.0, 400.0, 500.0],
    'label': [0, 1, 0, 1, 0]
})

# Initialize the DataSet
data_ensemble = ds.DataSet(
    label='label',
    train_data=train_df,
    data_directory='./ensemble_output'
)

# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
    n_datasets=10,         # Generate 10 different datasets
    frac_features=0.1,     # Use 10% of features per step
    n_steps=5              # Apply 5 engineering steps per dataset
)

print(f"Datasets saved to: {output_file}")

With Test Data

Include test data to generate aligned training and testing datasets:

import ensembleset.dataset as ds

# Initialize with both training and testing data
data_ensemble = ds.DataSet(
    label='label_column_name',
    train_data=train_df,
    test_data=test_df,
    data_directory='./ensemble_output'
)

# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
    n_datasets=10,
    frac_features=0.1,
    n_steps=5
)

The same feature engineering pipeline applied to training data will be applied to testing data, with all transformations fitted on training data only to prevent data leakage.

With String Features

Specify categorical string features that need encoding:

import ensembleset.dataset as ds

# Initialize with string features specified
data_ensemble = ds.DataSet(
    label='target',
    train_data=train_df,
    test_data=test_df,
    string_features=['category_col', 'group_col'],
    data_directory='./ensemble_output',
    ensembleset_base_name='my_ensemble'
)

# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
    n_datasets=10,
    frac_features=0.2,
    n_steps=3
)

String features will be encoded using either one-hot or ordinal encoding before numerical feature engineering methods are applied.

Understanding Parameters

n_datasets

Number of dataset variations to generate. Each will have a unique random sequence of feature engineering methods applied.

frac_features

Fraction of features (0.0 to 1.0) to randomly select for each feature engineering step. For example, 0.1 means 10% of available features. The selection is re-randomized for each step.

n_steps

Number of feature engineering steps to apply in sequence for each dataset. Each step randomly selects a method from the available techniques.

Output Format

Generated datasets are saved to HDF5 format with the following structure:

ensembleset.h5
├── train
│   ├── labels          # Training labels array
│   ├── dataset_0       # First training dataset
│   ├── dataset_1       # Second training dataset
│   └── ...
└── test
    ├── labels          # Testing labels array
    ├── dataset_0       # First testing dataset
    ├── dataset_1       # Second testing dataset
    └── ...

Reading Generated Datasets

Load datasets from the HDF5 file using h5py:

import h5py
import numpy as np

with h5py.File('ensemble_output/ensembleset.h5', 'r') as f:
    # Load training data for first dataset
    train_labels = np.array(f['train/labels'])
    train_features = np.array(f['train/dataset_0'])

    # Load testing data for first dataset
    test_labels = np.array(f['test/labels'])
    test_features = np.array(f['test/dataset_0'])

Next Steps