Configuration
EnsembleSet uses configuration dictionaries to define available feature engineering methods and their parameter options.
String Encoding Methods
String features are encoded before numerical feature engineering methods are applied.
- ensembleset.feature_engineerings.STRING_ENCODINGS
Dictionary of available string encoding methods and their default parameters.
Available Methods:
onehot_encoding- One-hot encoding using sklearn’s OneHotEncodersparse_output: False - Return dense arrays
ordinal_encoding- Ordinal encoding using sklearn’s OrdinalEncoderhandle_unknown: ‘use_encoded_value’ - Handle unknown categoriesunknown_value: np.nan - Value to use for unknown categories
Numerical Feature Engineering Methods
- ensembleset.feature_engineerings.NUMERICAL_METHODS
Dictionary of available numerical feature engineering methods and their parameter options.
Available Methods:
poly_features- Polynomial feature generationdegree: [2, 3] - Polynomial degree optionsinteraction_only: [True, False] - Include only interaction featuresinclude_bias: [True, False] - Include bias column
spline_features- Spline transformationn_knots: [5] - Number of knotsdegree: [2, 3, 4] - Spline degree optionsknots: [‘uniform’, ‘quantile’] - Knot placement strategyextrapolation: [‘constant’, ‘linear’, ‘continue’, ‘periodic’] - Extrapolation methodinclude_bias: [True, False] - Include bias column
log_features- Logarithmic transformationbase: [‘2’, ‘e’, ‘10’] - Logarithm base options
ratio_features- Ratio/division featuresdiv_zero_value: [np.nan] - Value to use for division by zero
exponential_features- Exponential transformationbase: [‘2’, ‘e’] - Exponential base options
sum_features- Sum of feature combinationsn_addends: [2, 3, 4] - Number of features to sum
difference_features- Difference of feature combinationsn_subtrahends: [2, 3, 4] - Number of features in subtraction
kde_smoothing- Gaussian kernel density estimation smoothingbandwidth: [‘scott’, ‘silverman’] - Bandwidth selection methodsample_size: [1000] - Number of samples for KDE
kbins_quantization- Quantization into binsn_bins: [4, 8, 16] - Number of bins optionsencode: [‘ordinal’] - Encoding methodstrategy: [‘uniform’, ‘quantile’, ‘kmeans’] - Binning strategy
Parameter Selection
During dataset generation, parameters are randomly selected from the available options. For example:
Polynomial features may use degree 2 or 3
Spline features may use degree 2, 3, or 4
Log features may use base 2, e, or 10
This randomization ensures diversity across the generated ensemble datasets.
Customization
While the configuration dictionaries are predefined, users can modify them by accessing the module attributes if custom parameter ranges are desired. However, this is not typically necessary for standard use cases.
Example
import ensembleset.feature_engineerings as engineerings
# View available string encoding methods
print(engineerings.STRING_ENCODINGS)
# View available numerical methods
print(engineerings.NUMERICAL_METHODS)
# Check polynomial feature options
print(engineerings.NUMERICAL_METHODS['poly_features'])