API Reference
This page provides detailed API documentation for EnsembleSet.
DataSet Class
- class ensembleset.dataset.DataSet(label, train_data, test_data=None, string_features=None, data_directory='ensembleset_data', ensembleset_base_name='ensembleset')[source]
Bases:
objectDataset ensemble generator using randomized feature engineering.
This class generates multiple dataset variations by applying randomized sequences of feature engineering methods to randomized subsets of input features. Each dataset in the ensemble undergoes a unique transformation pipeline, making it suitable for training diverse ensemble models.
The class handles both training and testing data with minimal data leakage by fitting transformations on training data and applying them to both training and testing sets. Generated datasets are saved to HDF5 format for efficient storage and retrieval.
- Parameters:
label (str) – Name of the label column in the training and testing DataFrames. This column will be extracted and stored separately from the features.
train_data (pd.DataFrame) – Training dataset containing both features and the label column. Must include the column specified by label parameter.
test_data (pd.DataFrame, optional) – Testing dataset containing both features and the label column. If provided, transformations fitted on training data will be applied to this data. Default is None.
string_features (list of str, optional) – List of column names containing string/categorical features that require encoding (one-hot or ordinal) before numerical feature engineering. Default is None.
data_directory (str, optional) – Path to directory where generated ensemble datasets will be saved in HDF5 format. Directory will be created if it doesn’t exist. Default is ‘ensembleset_data’.
ensembleset_base_name (str, optional) – Base name for the output HDF5 file. The file will be named ‘{ensembleset_base_name}.h5’. Default is ‘ensembleset’.
- train_data
Training features (without label column).
- Type:
pd.DataFrame
- test_data
Testing features (without label column).
- Type:
pd.DataFrame or None
- train_labels
Training labels extracted from train_data.
- Type:
np.ndarray
- test_labels
Testing labels extracted from test_data.
- Type:
np.ndarray or None
Examples
>>> import pandas as pd >>> import ensembleset.dataset as ds >>> >>> # Create sample training and testing data >>> train_df = pd.DataFrame({ ... 'num_feature1': [1.0, 2.0, 3.0, 4.0], ... 'num_feature2': [10.0, 20.0, 30.0, 40.0], ... 'cat_feature': ['A', 'B', 'A', 'B'], ... 'target': [0, 1, 0, 1] ... }) >>> test_df = pd.DataFrame({ ... 'num_feature1': [5.0, 6.0], ... 'num_feature2': [50.0, 60.0], ... 'cat_feature': ['A', 'B'], ... 'target': [1, 0] ... }) >>> >>> # Initialize DataSet with string feature specification >>> ensemble = ds.DataSet( ... label='target', ... train_data=train_df, ... test_data=test_df, ... string_features=['cat_feature'], ... data_directory='./ensemble_output', ... ensembleset_base_name='my_ensemble' ... ) >>> >>> # Generate 5 datasets with 3 feature engineering steps each, >>> # using 20% of features per step >>> output_file = ensemble.make_datasets( ... n_datasets=5, ... frac_features=0.2, ... n_steps=3 ... )
Notes
Label columns are automatically removed from feature DataFrames and stored separately as numpy arrays.
String features are encoded before numerical transformations are applied.
All transformations are fitted on training data only to prevent data leakage, even when test data is provided.
Generated datasets are saved in HDF5 format with the structure: dataset.h5/train/labels, dataset.h5/train/dataset_N, dataset.h5/test/labels, dataset.h5/test/dataset_N
Uses multiprocessing for parallel dataset generation.
See also
make_datasetsGenerate ensemble datasets with specified parameters
ensembleset.feature_engineeringsConfiguration of available methods
- __init__(label, train_data, test_data=None, string_features=None, data_directory='ensembleset_data', ensembleset_base_name='ensembleset')[source]
- make_datasets(n_datasets, frac_features, n_steps)[source]
Generate ensemble datasets using randomized feature engineering pipelines.
Creates multiple dataset variations by applying randomized sequences of feature engineering methods to randomly selected subsets of features. Each dataset undergoes a unique transformation pipeline, with feature selection re-randomized at each engineering step.
The method uses multiprocessing to generate datasets in parallel, with the number of worker processes equal to half the available CPU cores. All generated datasets are saved to an HDF5 file in the specified data directory.
- Parameters:
n_datasets (int) – Number of dataset variations to generate. Each dataset will have a unique sequence of feature engineering transformations applied. Must be a positive integer.
frac_features (float) – Fraction of features to randomly select for each feature engineering step. Must be between 0 and 1. For example, 0.1 means 10% of available features are selected at each step. The selection is re-randomized for each step in the pipeline.
n_steps (int) – Number of feature engineering steps to apply in sequence for each dataset. Each step randomly selects a method from the available feature engineering techniques. Must be a positive integer.
- Returns:
Path to the generated HDF5 file containing all ensemble datasets. The file structure is: - train/labels: Training labels (np.ndarray) - train/dataset_0, train/dataset_1, …: Training feature sets - test/labels: Testing labels (np.ndarray, if test_data provided) - test/dataset_0, test/dataset_1, …: Testing feature sets
- Return type:
Examples
>>> import pandas as pd >>> import ensembleset.dataset as ds >>> >>> # Create sample data >>> train_df = pd.DataFrame({ ... 'feature1': [1, 2, 3, 4, 5], ... 'feature2': [10, 20, 30, 40, 50], ... 'feature3': [100, 200, 300, 400, 500], ... 'label': [0, 1, 0, 1, 0] ... }) >>> >>> # Initialize ensemble >>> ensemble = ds.DataSet( ... label='label', ... train_data=train_df, ... data_directory='./output' ... ) >>> >>> # Generate 10 datasets, using 20% of features per step, >>> # with 5 engineering steps each >>> output_file = ensemble.make_datasets( ... n_datasets=10, ... frac_features=0.2, ... n_steps=5 ... ) >>> print(f"Datasets saved to: {output_file}")
>>> # With test data included >>> test_df = pd.DataFrame({ ... 'feature1': [6, 7], ... 'feature2': [60, 70], ... 'feature3': [600, 700], ... 'label': [1, 0] ... }) >>> >>> ensemble_with_test = ds.DataSet( ... label='label', ... train_data=train_df, ... test_data=test_df ... ) >>> >>> output_file = ensemble_with_test.make_datasets( ... n_datasets=5, ... frac_features=0.3, ... n_steps=3 ... )
Notes
Each dataset has a unique random sequence of feature engineering methods
Feature selection is re-randomized after each engineering step
At least 1 feature is always selected, even if frac_features * n_features < 1
String features are encoded first (if specified), then numerical methods are applied
All transformations are fitted on training data only to prevent leakage
When test_data is provided, fitted transformations are applied to both training and testing data
Uses multiprocessing with worker count = cpu_count() // 2
Progress and debugging information is logged during execution
See also
ensembleset.feature_methodsIndividual feature engineering functions
ensembleset.preprocessing_methodsData preprocessing utilities
Feature Engineering Methods
Collection of functions to run feature engineering operations.
This module provides feature engineering functions for generating ensemble datasets. Functions include polynomial features, splines, logarithmic and exponential transformations, ratio and arithmetic operations, Gaussian KDE smoothing, and binning. String features can be encoded using one-hot or ordinal encoding.
All functions follow a consistent interface accepting training and testing DataFrames, feature lists, and keyword arguments, returning the transformed DataFrames with minimal data leakage.
See also
ensembleset.preprocessing_methodsData preprocessing utilities
ensembleset.feature_engineeringsConfiguration dictionaries for methods
- ensembleset.feature_methods.onehot_encoding(train_df, test_df, features, kwargs)[source]
Apply one-hot encoding to categorical string features.
- Parameters:
train_df (pd.DataFrame) – Training data containing features to encode.
test_df (pd.DataFrame or None) – Testing data to apply fitted encoder to.
features (list of str or None) – Names of string feature columns to encode.
kwargs (dict) – Keyword arguments passed to sklearn.preprocessing.OneHotEncoder.
- Returns:
train_df (pd.DataFrame) – Training data with one-hot encoded features replacing originals.
test_df (pd.DataFrame or None) – Testing data with one-hot encoded features, or None if input was None.
- Return type:
See also
ordinal_encodingAlternative ordinal encoding for string features
- ensembleset.feature_methods.ordinal_encoding(train_df, test_df, features, kwargs)[source]
Apply ordinal encoding to categorical string features.
- Parameters:
train_df (pd.DataFrame) – Training data containing features to encode.
test_df (pd.DataFrame or None) – Testing data to apply fitted encoder to.
features (list of str or None) – Names of string feature columns to encode.
kwargs (dict) – Keyword arguments passed to sklearn.preprocessing.OrdinalEncoder.
- Returns:
train_df (pd.DataFrame) – Training data with ordinal encoded features in place.
test_df (pd.DataFrame or None) – Testing data with ordinal encoded features, or None if input was None.
- Return type:
See also
onehot_encodingAlternative one-hot encoding for string features
- ensembleset.feature_methods.poly_features(train_df, test_df, features, kwargs, shortcircuit_preprocessing=False)[source]
Generate polynomial features from selected feature combinations.
- Parameters:
train_df (pd.DataFrame) – Training data containing features to transform.
test_df (pd.DataFrame or None) – Testing data to apply fitted transformer to.
features (list of str) – Names of feature columns to generate polynomial features from.
kwargs (dict) – Keyword arguments passed to sklearn.preprocessing.PolynomialFeatures. Common keys: ‘degree’ (2 or 3).
shortcircuit_preprocessing (bool, default=False) – If True, skip preprocessing steps.
- Returns:
train_df (pd.DataFrame) – Training data with polynomial features added.
test_df (pd.DataFrame or None) – Testing data with polynomial features added, or None if input was None.
- Return type:
See also
spline_featuresAlternative spline-based feature transformation
- ensembleset.feature_methods.spline_features(train_df, test_df, features, kwargs, shortcircuit_preprocessing=False)[source]
Runs sklearn’s polynomial feature transformer.
- ensembleset.feature_methods.log_features(train_df, test_df, features, kwargs)[source]
Takes log of feature, uses sklearn min-max scaler if needed to avoid undefined log errors.
- ensembleset.feature_methods.ratio_features(train_df, test_df, features, kwargs)[source]
Adds every possible ratio feature, replaces divide by zero errors with np.nan.
- ensembleset.feature_methods.exponential_features(train_df, test_df, features, kwargs)[source]
Adds exponential features with base 2 or base e.
- ensembleset.feature_methods.sum_features(train_df, test_df, features, kwargs)[source]
Adds sum features for variable number of addends.
- ensembleset.feature_methods.difference_features(train_df, test_df, features, kwargs)[source]
Adds difference features for variable number of subtrahends.
- ensembleset.feature_methods.kde_smoothing(train_df, test_df, features, kwargs, shortcircuit_preprocessing=False)[source]
Uses kernel density estimation to smooth features.
Preprocessing Methods
Data preprocessing utilities for feature engineering pipelines.
This module provides preprocessing functions used to clean and prepare data before and after feature engineering operations. Functions include handling missing values, removing constant features, scaling, type conversions, and managing extreme values.
See also
ensembleset.feature_methodsFeature engineering operations
- ensembleset.preprocessing_methods.preprocess_features(features, train_df, test_df, preprocessing_steps)[source]
Runs feature preprocessing steps.
- ensembleset.preprocessing_methods.exclude_string_features(features, train_df, test_df)[source]
Removes string features from features list.
- ensembleset.preprocessing_methods.enforce_floats(features, train_df, test_df)[source]
Changes features to float dtype.
- ensembleset.preprocessing_methods.remove_inf(features, train_df, test_df)[source]
Replaces any np.inf values with np.NAN.
- ensembleset.preprocessing_methods.remove_large_nums(features, train_df, test_df)[source]
Replaces numbers larger than the cube root of the float64 limit with np.nan.
- ensembleset.preprocessing_methods.remove_small_nums(features, train_df, test_df)[source]
Replaces values smaller than the float64 limit with zero.
- ensembleset.preprocessing_methods.knn_impute(features, train_df, test_df)[source]
Uses SciKit-lean’s KNN imputer to fill np.nan.
- ensembleset.preprocessing_methods.remove_constants(features, train_df, test_df)[source]
Removes constant valued features.