Configuration

EnsembleSet uses configuration dictionaries to define available feature engineering methods and their parameter options.

String Encoding Methods

String features are encoded before numerical feature engineering methods are applied.

ensembleset.feature_engineerings.STRING_ENCODINGS

Dictionary of available string encoding methods and their default parameters.

Available Methods:

onehot_encoding - One-hot encoding using sklearn’s OneHotEncoder
- sparse_output: False - Return dense arrays
ordinal_encoding - Ordinal encoding using sklearn’s OrdinalEncoder
- handle_unknown: ‘use_encoded_value’ - Handle unknown categories
- unknown_value: np.nan - Value to use for unknown categories

Numerical Feature Engineering Methods

ensembleset.feature_engineerings.NUMERICAL_METHODS

Dictionary of available numerical feature engineering methods and their parameter options.

Available Methods:

poly_features - Polynomial feature generation
- degree: [2, 3] - Polynomial degree options
- interaction_only: [True, False] - Include only interaction features
- include_bias: [True, False] - Include bias column
spline_features - Spline transformation
- n_knots: [5] - Number of knots
- degree: [2, 3, 4] - Spline degree options
- knots: [‘uniform’, ‘quantile’] - Knot placement strategy
- extrapolation: [‘constant’, ‘linear’, ‘continue’, ‘periodic’] - Extrapolation method
- include_bias: [True, False] - Include bias column
log_features - Logarithmic transformation
- base: [‘2’, ‘e’, ‘10’] - Logarithm base options
ratio_features - Ratio/division features
- div_zero_value: [np.nan] - Value to use for division by zero
exponential_features - Exponential transformation
- base: [‘2’, ‘e’] - Exponential base options
sum_features - Sum of feature combinations
- n_addends: [2, 3, 4] - Number of features to sum
difference_features - Difference of feature combinations
- n_subtrahends: [2, 3, 4] - Number of features in subtraction
kde_smoothing - Gaussian kernel density estimation smoothing
- bandwidth: [‘scott’, ‘silverman’] - Bandwidth selection method
- sample_size: [1000] - Number of samples for KDE
kbins_quantization - Quantization into bins
- n_bins: [4, 8, 16] - Number of bins options
- encode: [‘ordinal’] - Encoding method
- strategy: [‘uniform’, ‘quantile’, ‘kmeans’] - Binning strategy

Parameter Selection

During dataset generation, parameters are randomly selected from the available options. For example:

Polynomial features may use degree 2 or 3
Spline features may use degree 2, 3, or 4
Log features may use base 2, e, or 10

This randomization ensures diversity across the generated ensemble datasets.

Customization

While the configuration dictionaries are predefined, users can modify them by accessing the module attributes if custom parameter ranges are desired. However, this is not typically necessary for standard use cases.

Example

import ensembleset.feature_engineerings as engineerings

# View available string encoding methods
print(engineerings.STRING_ENCODINGS)

# View available numerical methods
print(engineerings.NUMERICAL_METHODS)

# Check polynomial feature options
print(engineerings.NUMERICAL_METHODS['poly_features'])