Quick Start Guide

Basic Usage

Here’s a simple example to get started with EnsembleSet:

import pandas as pd
import ensembleset.dataset as ds

# Create or load your training data
train_df = pd.DataFrame({
    'feature1': [1.0, 2.0, 3.0, 4.0, 5.0],
    'feature2': [10.0, 20.0, 30.0, 40.0, 50.0],
    'feature3': [100.0, 200.0, 300.0, 400.0, 500.0],
    'label': [0, 1, 0, 1, 0]
})

# Initialize the DataSet
data_ensemble = ds.DataSet(
    label='label',
    train_data=train_df,
    data_directory='./ensemble_output'
)

# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
    n_datasets=10,         # Generate 10 different datasets
    frac_features=0.1,     # Use 10% of features per step
    n_steps=5              # Apply 5 engineering steps per dataset
)

print(f"Datasets saved to: {output_file}")

With Test Data

Include test data to generate aligned training and testing datasets:

import ensembleset.dataset as ds

# Initialize with both training and testing data
data_ensemble = ds.DataSet(
    label='label_column_name',
    train_data=train_df,
    test_data=test_df,
    data_directory='./ensemble_output'
)

# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
    n_datasets=10,
    frac_features=0.1,
    n_steps=5
)

The same feature engineering pipeline applied to training data will be applied to testing data, with all transformations fitted on training data only to prevent data leakage.

With String Features

Specify categorical string features that need encoding:

import ensembleset.dataset as ds

# Initialize with string features specified
data_ensemble = ds.DataSet(
    label='target',
    train_data=train_df,
    test_data=test_df,
    string_features=['category_col', 'group_col'],
    data_directory='./ensemble_output',
    ensembleset_base_name='my_ensemble'
)

# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
    n_datasets=10,
    frac_features=0.2,
    n_steps=3
)

String features will be encoded using either one-hot or ordinal encoding before numerical feature engineering methods are applied.

Understanding Parameters

n_datasets: Number of dataset variations to generate. Each will have a unique random sequence of feature engineering methods applied.
frac_features: Fraction of features (0.0 to 1.0) to randomly select for each feature engineering step. For example, 0.1 means 10% of available features. The selection is re-randomized for each step.
n_steps: Number of feature engineering steps to apply in sequence for each dataset. Each step randomly selects a method from the available techniques.

Output Format

Generated datasets are saved to HDF5 format with the following structure:

ensembleset.h5
├── train
│   ├── labels          # Training labels array
│   ├── dataset_0       # First training dataset
│   ├── dataset_1       # Second training dataset
│   └── ...
└── test
    ├── labels          # Testing labels array
    ├── dataset_0       # First testing dataset
    ├── dataset_1       # Second testing dataset
    └── ...

Reading Generated Datasets

Load datasets from the HDF5 file using h5py:

import h5py
import numpy as np

with h5py.File('ensemble_output/ensembleset.h5', 'r') as f:
    # Load training data for first dataset
    train_labels = np.array(f['train/labels'])
    train_features = np.array(f['train/dataset_0'])

    # Load testing data for first dataset
    test_labels = np.array(f['test/labels'])
    test_features = np.array(f['test/dataset_0'])

Next Steps

See API Reference for detailed API documentation
See Feature Engineering Catalog for descriptions of all feature engineering methods
See Configuration for configuration options