Quick Start Guide
Basic Usage
Here’s a simple example to get started with EnsembleSet:
import pandas as pd
import ensembleset.dataset as ds
# Create or load your training data
train_df = pd.DataFrame({
'feature1': [1.0, 2.0, 3.0, 4.0, 5.0],
'feature2': [10.0, 20.0, 30.0, 40.0, 50.0],
'feature3': [100.0, 200.0, 300.0, 400.0, 500.0],
'label': [0, 1, 0, 1, 0]
})
# Initialize the DataSet
data_ensemble = ds.DataSet(
label='label',
train_data=train_df,
data_directory='./ensemble_output'
)
# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
n_datasets=10, # Generate 10 different datasets
frac_features=0.1, # Use 10% of features per step
n_steps=5 # Apply 5 engineering steps per dataset
)
print(f"Datasets saved to: {output_file}")
With Test Data
Include test data to generate aligned training and testing datasets:
import ensembleset.dataset as ds
# Initialize with both training and testing data
data_ensemble = ds.DataSet(
label='label_column_name',
train_data=train_df,
test_data=test_df,
data_directory='./ensemble_output'
)
# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
n_datasets=10,
frac_features=0.1,
n_steps=5
)
The same feature engineering pipeline applied to training data will be applied to testing data, with all transformations fitted on training data only to prevent data leakage.
With String Features
Specify categorical string features that need encoding:
import ensembleset.dataset as ds
# Initialize with string features specified
data_ensemble = ds.DataSet(
label='target',
train_data=train_df,
test_data=test_df,
string_features=['category_col', 'group_col'],
data_directory='./ensemble_output',
ensembleset_base_name='my_ensemble'
)
# Generate ensemble datasets
output_file = data_ensemble.make_datasets(
n_datasets=10,
frac_features=0.2,
n_steps=3
)
String features will be encoded using either one-hot or ordinal encoding before numerical feature engineering methods are applied.
Understanding Parameters
n_datasetsNumber of dataset variations to generate. Each will have a unique random sequence of feature engineering methods applied.
frac_featuresFraction of features (0.0 to 1.0) to randomly select for each feature engineering step. For example, 0.1 means 10% of available features. The selection is re-randomized for each step.
n_stepsNumber of feature engineering steps to apply in sequence for each dataset. Each step randomly selects a method from the available techniques.
Output Format
Generated datasets are saved to HDF5 format with the following structure:
ensembleset.h5
├── train
│ ├── labels # Training labels array
│ ├── dataset_0 # First training dataset
│ ├── dataset_1 # Second training dataset
│ └── ...
└── test
├── labels # Testing labels array
├── dataset_0 # First testing dataset
├── dataset_1 # Second testing dataset
└── ...
Reading Generated Datasets
Load datasets from the HDF5 file using h5py:
import h5py
import numpy as np
with h5py.File('ensemble_output/ensembleset.h5', 'r') as f:
# Load training data for first dataset
train_labels = np.array(f['train/labels'])
train_features = np.array(f['train/dataset_0'])
# Load testing data for first dataset
test_labels = np.array(f['test/labels'])
test_features = np.array(f['test/dataset_0'])
Next Steps
See API Reference for detailed API documentation
See Feature Engineering Catalog for descriptions of all feature engineering methods
See Configuration for configuration options