Feature Engineering Catalog ============================ This page provides detailed descriptions of all 11 feature engineering methods available in EnsembleSet. String Feature Encoding ------------------------ 1. One-Hot Encoding ^^^^^^^^^^^^^^^^^^^ Converts categorical string features into binary indicator columns. **Mathematical Description:** For a categorical feature with :math:`k` unique categories, one-hot encoding creates :math:`k` binary columns where each column represents one category. For a given sample, exactly one column has value 1 (the category present) and all others are 0. **Use Cases:** * Nominal categorical features without inherent ordering * Features with low to moderate cardinality * When treating each category as independent is appropriate **Example:** .. code-block:: python # Input: ['A', 'B', 'A', 'C'] # Output: # A B C # 1 0 0 # 0 1 0 # 1 0 0 # 0 0 1 2. Ordinal Encoding ^^^^^^^^^^^^^^^^^^^^ Converts categorical string features into integer codes. **Mathematical Description:** Each unique category is mapped to an integer. For :math:`k` unique categories, integers from 0 to :math:`k-1` are assigned. **Use Cases:** * Ordinal categorical features with inherent ordering * High-cardinality categorical features where one-hot encoding would create too many columns * Tree-based models that can handle encoded categories **Example:** .. code-block:: python # Input: ['low', 'medium', 'high', 'low'] # Output: [0, 1, 2, 0] Numerical Feature Engineering ------------------------------ 3. Polynomial Features ^^^^^^^^^^^^^^^^^^^^^^ Generates polynomial and interaction features from existing features. **Mathematical Description:** For degree :math:`d`, polynomial features include all monomials of degree :math:`\leq d`. For two features :math:`x_1` and :math:`x_2` with degree 2: .. math:: [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2] **Use Cases:** * Capturing non-linear relationships * Modeling feature interactions * Polynomial regression models **Parameters:** * Degree: 2 or 3 * Interaction only: Include only cross-products * Include bias: Add constant term 4. Spline Features ^^^^^^^^^^^^^^^^^^ Applies spline basis transformations to features. **Mathematical Description:** Spline transformations create piecewise polynomial functions. B-splines of degree :math:`d` with :math:`k` knots create smooth curves defined by control points. **Use Cases:** * Flexible non-linear transformations * Smoother than polynomial features * Capturing complex non-linear patterns **Parameters:** * Degree: 2, 3, or 4 * Knots: Number and placement (uniform or quantile) * Extrapolation: Behavior outside knot range 5. Logarithmic Features ^^^^^^^^^^^^^^^^^^^^^^^ Applies logarithmic transformations to compress large value ranges. **Mathematical Description:** .. math:: y = \log_b(x) where :math:`b \in \{2, e, 10\}` **Use Cases:** * Features with exponential distributions or heavy right tails * Reducing the impact of outliers * Making multiplicative relationships additive **Parameters:** * Base: 2, e (natural log), or 10 **Note:** Handles zero and negative values by preprocessing. 6. Ratio Features ^^^^^^^^^^^^^^^^^ Creates ratio features from all pairwise divisions of selected features. **Mathematical Description:** For features :math:`x_1, x_2, ..., x_n`, creates: .. math:: r_{ij} = \frac{x_i}{x_j} \quad \forall i \neq j **Use Cases:** * Capturing relative relationships between features * Normalizing features by reference values * Financial ratios (e.g., price/earnings) **Parameters:** * Division by zero value: Replacement value (default: NaN) 7. Exponential Features ^^^^^^^^^^^^^^^^^^^^^^^^ Applies exponential transformations to features. **Mathematical Description:** .. math:: y = b^x where :math:`b \in \{2, e\}` **Use Cases:** * Inverse of logarithmic transformation * Amplifying small differences * Modeling exponential growth **Parameters:** * Base: 2 or e (natural exponential) **Note:** Handles overflow by preprocessing. 8. Sum Features ^^^^^^^^^^^^^^^ Creates features by summing combinations of selected features. **Mathematical Description:** For :math:`n` addends, creates sums of all combinations: .. math:: s = x_{i_1} + x_{i_2} + ... + x_{i_n} where :math:`n \in \{2, 3, 4\}` **Use Cases:** * Capturing aggregate effects * Total or cumulative values * Additive relationships **Parameters:** * Number of addends: 2, 3, or 4 9. Difference Features ^^^^^^^^^^^^^^^^^^^^^^ Creates features by computing differences of feature combinations. **Mathematical Description:** For :math:`n` subtrahends, creates: .. math:: d = x_{i_1} - x_{i_2} - ... - x_{i_n} where :math:`n \in \{2, 3, 4\}` **Use Cases:** * Change or delta features * Comparing related measurements * Removing baseline effects **Parameters:** * Number of subtrahends: 2, 3, or 4 10. Gaussian KDE Smoothing ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Applies Gaussian kernel density estimation to smooth features. **Mathematical Description:** For each feature value :math:`x`, estimates the probability density: .. math:: \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right) where :math:`K` is the Gaussian kernel and :math:`h` is the bandwidth. **Use Cases:** * Noise reduction * Identifying underlying distributions * Smoothing irregular patterns **Parameters:** * Bandwidth: 'scott' or 'silverman' method * Sample size: Number of samples for KDE calculation **Note:** Fitted on training data only, then applied to both train and test. 11. K-Bins Quantization ^^^^^^^^^^^^^^^^^^^^^^^^ Discretizes continuous features into bins. **Mathematical Description:** Divides the feature range into :math:`k` bins and assigns each value to a bin: .. math:: y = \text{bin}(x) \in \{0, 1, ..., k-1\} **Use Cases:** * Converting continuous to categorical features * Reducing sensitivity to small variations * Handling non-linear relationships with linear models **Parameters:** * Number of bins: 4, 8, or 16 * Strategy: uniform, quantile, or k-means * Encoding: ordinal Feature Engineering Pipeline ----------------------------- During ensemble generation, these methods are: 1. **Randomly selected** - Each dataset gets a unique sequence 2. **Applied in sequence** - Methods build on previous transformations 3. **Applied to random subsets** - Only a fraction of features are transformed at each step 4. **Fitted on training data** - All transformations use training data statistics to prevent leakage 5. **Applied to test data** - The same fitted transformations are applied to test data This randomization strategy creates diverse datasets suitable for training ensemble models while maintaining consistent transformations between training and testing data.