Decomposition
Polynomial expansion, PCA variance plots, and cross-validated component selection.
featurely.decomposition
Polynomial expansion and PCA component selection.
Degree-2 polynomial expansion generates every squared term and pairwise product, which quickly produces hundreds of correlated columns. PCA rotates that expanded set into orthogonal components ordered by variance, and cross-validation over component counts finds how many are worth keeping. Note that PCA here is fit on the full dataset before cross-validation; it is unsupervised (never sees the target), so any optimism this introduces is mild, but a strict production pipeline would fit PCA inside each fold.
make_polynomial_features(df, feature_cols, degree=2, include_bias=False)
Return the polynomial expansion of the selected columns as a frame.
Column names come from scikit-learn but are sanitized for CSV round
trips: spaces (products) become _x_ and carets (powers) become
_pow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input frame; not modified. |
required |
feature_cols
|
list[str]
|
Columns to expand. |
required |
degree
|
int
|
Polynomial degree. |
2
|
include_bias
|
bool
|
When True, include the constant bias column. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A frame of expanded polynomial terms. |
Source code in src/featurely/decomposition.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | |
plot_pca_variance(x_df, title='PCA cumulative explained variance')
Plot cumulative explained variance and return the fitted PCA.
Features are standard-scaled first; PCA directions are meaningless when columns live on wildly different scales. Reference lines mark the 90, 95, and 99 percent variance thresholds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x_df
|
DataFrame
|
Feature frame to decompose. |
required |
title
|
str
|
Plot title. |
'PCA cumulative explained variance'
|
Returns:
| Type | Description |
|---|---|
PCA
|
The PCA instance fitted on the scaled features. |
Source code in src/featurely/decomposition.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
scan_pca_components(x_df, y, component_grid, cv=10)
Cross-validate a linear model on the first n principal components.
PCA is fit once at the largest grid value; truncating to the first n columns of the projection is equivalent to fitting PCA with n_components=n, so the scan avoids refitting for every grid point.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x_df
|
DataFrame
|
Feature frame to decompose. |
required |
y
|
Series
|
Target series. |
required |
component_grid
|
list[int]
|
Component counts to evaluate; entries larger than the feature count are skipped. |
required |
cv
|
int
|
Number of cross-validation folds. |
10
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A frame with one row per component count: |
DataFrame
|
|
Source code in src/featurely/decomposition.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | |
plot_pca_component_scan(results_df, title='CV R2 by number of PCA components')
Plot the component scan curve and return the best component count.
The best count maximizes mean CV R2; the shaded band shows one standard deviation across folds, a visual check on whether nearby counts are practically equivalent.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_df
|
DataFrame
|
Scan results from |
required |
title
|
str
|
Plot title. |
'CV R2 by number of PCA components'
|
Returns:
| Type | Description |
|---|---|
int
|
The component count with the highest mean CV R2. |
Source code in src/featurely/decomposition.py
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | |