API Reference

This page provides detailed API documentation for EnsembleSet.

DataSet Class

class ensembleset.dataset.DataSet(label, train_data, test_data=None, string_features=None, data_directory='ensembleset_data', ensembleset_base_name='ensembleset')[source]

Bases: object

Dataset ensemble generator using randomized feature engineering.

This class generates multiple dataset variations by applying randomized sequences of feature engineering methods to randomized subsets of input features. Each dataset in the ensemble undergoes a unique transformation pipeline, making it suitable for training diverse ensemble models.

The class handles both training and testing data with minimal data leakage by fitting transformations on training data and applying them to both training and testing sets. Generated datasets are saved to HDF5 format for efficient storage and retrieval.

Parameters:
  • label (str) – Name of the label column in the training and testing DataFrames. This column will be extracted and stored separately from the features.

  • train_data (pd.DataFrame) – Training dataset containing both features and the label column. Must include the column specified by label parameter.

  • test_data (pd.DataFrame, optional) – Testing dataset containing both features and the label column. If provided, transformations fitted on training data will be applied to this data. Default is None.

  • string_features (list of str, optional) – List of column names containing string/categorical features that require encoding (one-hot or ordinal) before numerical feature engineering. Default is None.

  • data_directory (str, optional) – Path to directory where generated ensemble datasets will be saved in HDF5 format. Directory will be created if it doesn’t exist. Default is ‘ensembleset_data’.

  • ensembleset_base_name (str, optional) – Base name for the output HDF5 file. The file will be named ‘{ensembleset_base_name}.h5’. Default is ‘ensembleset’.

label

Name of the label column.

Type:

str

train_data

Training features (without label column).

Type:

pd.DataFrame

test_data

Testing features (without label column).

Type:

pd.DataFrame or None

train_labels

Training labels extracted from train_data.

Type:

np.ndarray

test_labels

Testing labels extracted from test_data.

Type:

np.ndarray or None

string_features

List of string feature column names.

Type:

list of str or None

data_directory

Path to output directory.

Type:

str

ensembleset_base_name

Base name for output files.

Type:

str

string_encodings

Dictionary of available string encoding methods.

Type:

dict

numerical_methods

Dictionary of available numerical feature engineering methods.

Type:

dict

Examples

>>> import pandas as pd
>>> import ensembleset.dataset as ds
>>>
>>> # Create sample training and testing data
>>> train_df = pd.DataFrame({
...     'num_feature1': [1.0, 2.0, 3.0, 4.0],
...     'num_feature2': [10.0, 20.0, 30.0, 40.0],
...     'cat_feature': ['A', 'B', 'A', 'B'],
...     'target': [0, 1, 0, 1]
... })
>>> test_df = pd.DataFrame({
...     'num_feature1': [5.0, 6.0],
...     'num_feature2': [50.0, 60.0],
...     'cat_feature': ['A', 'B'],
...     'target': [1, 0]
... })
>>>
>>> # Initialize DataSet with string feature specification
>>> ensemble = ds.DataSet(
...     label='target',
...     train_data=train_df,
...     test_data=test_df,
...     string_features=['cat_feature'],
...     data_directory='./ensemble_output',
...     ensembleset_base_name='my_ensemble'
... )
>>>
>>> # Generate 5 datasets with 3 feature engineering steps each,
>>> # using 20% of features per step
>>> output_file = ensemble.make_datasets(
...     n_datasets=5,
...     frac_features=0.2,
...     n_steps=3
... )

Notes

  • Label columns are automatically removed from feature DataFrames and stored separately as numpy arrays.

  • String features are encoded before numerical transformations are applied.

  • All transformations are fitted on training data only to prevent data leakage, even when test data is provided.

  • Generated datasets are saved in HDF5 format with the structure: dataset.h5/train/labels, dataset.h5/train/dataset_N, dataset.h5/test/labels, dataset.h5/test/dataset_N

  • Uses multiprocessing for parallel dataset generation.

See also

make_datasets

Generate ensemble datasets with specified parameters

ensembleset.feature_engineerings

Configuration of available methods

__init__(label, train_data, test_data=None, string_features=None, data_directory='ensembleset_data', ensembleset_base_name='ensembleset')[source]
Parameters:
make_datasets(n_datasets, frac_features, n_steps)[source]

Generate ensemble datasets using randomized feature engineering pipelines.

Creates multiple dataset variations by applying randomized sequences of feature engineering methods to randomly selected subsets of features. Each dataset undergoes a unique transformation pipeline, with feature selection re-randomized at each engineering step.

The method uses multiprocessing to generate datasets in parallel, with the number of worker processes equal to half the available CPU cores. All generated datasets are saved to an HDF5 file in the specified data directory.

Parameters:
  • n_datasets (int) – Number of dataset variations to generate. Each dataset will have a unique sequence of feature engineering transformations applied. Must be a positive integer.

  • frac_features (float) – Fraction of features to randomly select for each feature engineering step. Must be between 0 and 1. For example, 0.1 means 10% of available features are selected at each step. The selection is re-randomized for each step in the pipeline.

  • n_steps (int) – Number of feature engineering steps to apply in sequence for each dataset. Each step randomly selects a method from the available feature engineering techniques. Must be a positive integer.

Returns:

Path to the generated HDF5 file containing all ensemble datasets. The file structure is: - train/labels: Training labels (np.ndarray) - train/dataset_0, train/dataset_1, …: Training feature sets - test/labels: Testing labels (np.ndarray, if test_data provided) - test/dataset_0, test/dataset_1, …: Testing feature sets

Return type:

str

Examples

>>> import pandas as pd
>>> import ensembleset.dataset as ds
>>>
>>> # Create sample data
>>> train_df = pd.DataFrame({
...     'feature1': [1, 2, 3, 4, 5],
...     'feature2': [10, 20, 30, 40, 50],
...     'feature3': [100, 200, 300, 400, 500],
...     'label': [0, 1, 0, 1, 0]
... })
>>>
>>> # Initialize ensemble
>>> ensemble = ds.DataSet(
...     label='label',
...     train_data=train_df,
...     data_directory='./output'
... )
>>>
>>> # Generate 10 datasets, using 20% of features per step,
>>> # with 5 engineering steps each
>>> output_file = ensemble.make_datasets(
...     n_datasets=10,
...     frac_features=0.2,
...     n_steps=5
... )
>>> print(f"Datasets saved to: {output_file}")
>>> # With test data included
>>> test_df = pd.DataFrame({
...     'feature1': [6, 7],
...     'feature2': [60, 70],
...     'feature3': [600, 700],
...     'label': [1, 0]
... })
>>>
>>> ensemble_with_test = ds.DataSet(
...     label='label',
...     train_data=train_df,
...     test_data=test_df
... )
>>>
>>> output_file = ensemble_with_test.make_datasets(
...     n_datasets=5,
...     frac_features=0.3,
...     n_steps=3
... )

Notes

  • Each dataset has a unique random sequence of feature engineering methods

  • Feature selection is re-randomized after each engineering step

  • At least 1 feature is always selected, even if frac_features * n_features < 1

  • String features are encoded first (if specified), then numerical methods are applied

  • All transformations are fitted on training data only to prevent leakage

  • When test_data is provided, fitted transformations are applied to both training and testing data

  • Uses multiprocessing with worker count = cpu_count() // 2

  • Progress and debugging information is logged during execution

See also

ensembleset.feature_methods

Individual feature engineering functions

ensembleset.preprocessing_methods

Data preprocessing utilities

Feature Engineering Methods

Collection of functions to run feature engineering operations.

This module provides feature engineering functions for generating ensemble datasets. Functions include polynomial features, splines, logarithmic and exponential transformations, ratio and arithmetic operations, Gaussian KDE smoothing, and binning. String features can be encoded using one-hot or ordinal encoding.

All functions follow a consistent interface accepting training and testing DataFrames, feature lists, and keyword arguments, returning the transformed DataFrames with minimal data leakage.

See also

ensembleset.preprocessing_methods

Data preprocessing utilities

ensembleset.feature_engineerings

Configuration dictionaries for methods

ensembleset.feature_methods.onehot_encoding(train_df, test_df, features, kwargs)[source]

Apply one-hot encoding to categorical string features.

Parameters:
  • train_df (pd.DataFrame) – Training data containing features to encode.

  • test_df (pd.DataFrame or None) – Testing data to apply fitted encoder to.

  • features (list of str or None) – Names of string feature columns to encode.

  • kwargs (dict) – Keyword arguments passed to sklearn.preprocessing.OneHotEncoder.

Returns:

  • train_df (pd.DataFrame) – Training data with one-hot encoded features replacing originals.

  • test_df (pd.DataFrame or None) – Testing data with one-hot encoded features, or None if input was None.

Return type:

Tuple[DataFrame, DataFrame]

See also

ordinal_encoding

Alternative ordinal encoding for string features

ensembleset.feature_methods.ordinal_encoding(train_df, test_df, features, kwargs)[source]

Apply ordinal encoding to categorical string features.

Parameters:
  • train_df (pd.DataFrame) – Training data containing features to encode.

  • test_df (pd.DataFrame or None) – Testing data to apply fitted encoder to.

  • features (list of str or None) – Names of string feature columns to encode.

  • kwargs (dict) – Keyword arguments passed to sklearn.preprocessing.OrdinalEncoder.

Returns:

  • train_df (pd.DataFrame) – Training data with ordinal encoded features in place.

  • test_df (pd.DataFrame or None) – Testing data with ordinal encoded features, or None if input was None.

Return type:

Tuple[DataFrame, DataFrame]

See also

onehot_encoding

Alternative one-hot encoding for string features

ensembleset.feature_methods.poly_features(train_df, test_df, features, kwargs, shortcircuit_preprocessing=False)[source]

Generate polynomial features from selected feature combinations.

Parameters:
  • train_df (pd.DataFrame) – Training data containing features to transform.

  • test_df (pd.DataFrame or None) – Testing data to apply fitted transformer to.

  • features (list of str) – Names of feature columns to generate polynomial features from.

  • kwargs (dict) – Keyword arguments passed to sklearn.preprocessing.PolynomialFeatures. Common keys: ‘degree’ (2 or 3).

  • shortcircuit_preprocessing (bool, default=False) – If True, skip preprocessing steps.

Returns:

  • train_df (pd.DataFrame) – Training data with polynomial features added.

  • test_df (pd.DataFrame or None) – Testing data with polynomial features added, or None if input was None.

Return type:

Tuple[DataFrame, DataFrame]

See also

spline_features

Alternative spline-based feature transformation

ensembleset.feature_methods.spline_features(train_df, test_df, features, kwargs, shortcircuit_preprocessing=False)[source]

Runs sklearn’s polynomial feature transformer.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

ensembleset.feature_methods.log_features(train_df, test_df, features, kwargs)[source]

Takes log of feature, uses sklearn min-max scaler if needed to avoid undefined log errors.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

ensembleset.feature_methods.ratio_features(train_df, test_df, features, kwargs)[source]

Adds every possible ratio feature, replaces divide by zero errors with np.nan.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

ensembleset.feature_methods.exponential_features(train_df, test_df, features, kwargs)[source]

Adds exponential features with base 2 or base e.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

ensembleset.feature_methods.sum_features(train_df, test_df, features, kwargs)[source]

Adds sum features for variable number of addends.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

ensembleset.feature_methods.difference_features(train_df, test_df, features, kwargs)[source]

Adds difference features for variable number of subtrahends.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

ensembleset.feature_methods.kde_smoothing(train_df, test_df, features, kwargs, shortcircuit_preprocessing=False)[source]

Uses kernel density estimation to smooth features.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

ensembleset.feature_methods.kbins_quantization(train_df, test_df, features, kwargs, shortcircuit_preprocessing=False)[source]

Discretizes feature with Kbins quantization.

Parameters:
Return type:

Tuple[DataFrame, DataFrame]

Preprocessing Methods

Data preprocessing utilities for feature engineering pipelines.

This module provides preprocessing functions used to clean and prepare data before and after feature engineering operations. Functions include handling missing values, removing constant features, scaling, type conversions, and managing extreme values.

See also

ensembleset.feature_methods

Feature engineering operations

ensembleset.preprocessing_methods.preprocess_features(features, train_df, test_df, preprocessing_steps)[source]

Runs feature preprocessing steps.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.exclude_string_features(features, train_df, test_df)[source]

Removes string features from features list.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.enforce_floats(features, train_df, test_df)[source]

Changes features to float dtype.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.remove_inf(features, train_df, test_df)[source]

Replaces any np.inf values with np.NAN.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.remove_large_nums(features, train_df, test_df)[source]

Replaces numbers larger than the cube root of the float64 limit with np.nan.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.remove_small_nums(features, train_df, test_df)[source]

Replaces values smaller than the float64 limit with zero.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.knn_impute(features, train_df, test_df)[source]

Uses SciKit-lean’s KNN imputer to fill np.nan.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.remove_constants(features, train_df, test_df)[source]

Removes constant valued features.

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.scale_to_range(features, train_df, test_df, min_val=0.0, max_val=1.0)[source]

Scales features into range

Parameters:
Return type:

Tuple[list, DataFrame, DataFrame]

ensembleset.preprocessing_methods.add_new_features(new_train_features, new_test_features, train_df, test_df)[source]

Adds new features to dataframes

Parameters:
Return type:

Tuple[DataFrame, DataFrame]