Skip to content

Decomposition

Polynomial expansion, PCA variance plots, and cross-validated component selection.

featurely.decomposition

Polynomial expansion and PCA component selection.

Degree-2 polynomial expansion generates every squared term and pairwise product, which quickly produces hundreds of correlated columns. PCA rotates that expanded set into orthogonal components ordered by variance, and cross-validation over component counts finds how many are worth keeping. Note that PCA here is fit on the full dataset before cross-validation; it is unsupervised (never sees the target), so any optimism this introduces is mild, but a strict production pipeline would fit PCA inside each fold.

make_polynomial_features(df, feature_cols, degree=2, include_bias=False)

Return the polynomial expansion of the selected columns as a frame.

Column names come from scikit-learn but are sanitized for CSV round trips: spaces (products) become _x_ and carets (powers) become _pow.

Parameters:

Name Type Description Default
df DataFrame

Input frame; not modified.

required
feature_cols list[str]

Columns to expand.

required
degree int

Polynomial degree.

2
include_bias bool

When True, include the constant bias column.

False

Returns:

Type Description
DataFrame

A frame of expanded polynomial terms.

Source code in src/featurely/decomposition.py
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def make_polynomial_features(
    df: pd.DataFrame,
    feature_cols: list[str],
    degree: int = 2,
    include_bias: bool = False,
) -> pd.DataFrame:
    """Return the polynomial expansion of the selected columns as a frame.

    Column names come from scikit-learn but are sanitized for CSV round
    trips: spaces (products) become ``_x_`` and carets (powers) become
    ``_pow``.

    Args:
        df: Input frame; not modified.
        feature_cols: Columns to expand.
        degree: Polynomial degree.
        include_bias: When True, include the constant bias column.

    Returns:
        A frame of expanded polynomial terms.
    """

    poly = PolynomialFeatures(degree=degree, include_bias=include_bias)
    expanded = poly.fit_transform(df[list(feature_cols)])
    names = [name.replace(" ", "_x_").replace("^", "_pow") for name in poly.get_feature_names_out(list(feature_cols))]

    return pd.DataFrame(expanded, columns=names, index=df.index)

plot_pca_variance(x_df, title='PCA cumulative explained variance')

Plot cumulative explained variance and return the fitted PCA.

Features are standard-scaled first; PCA directions are meaningless when columns live on wildly different scales. Reference lines mark the 90, 95, and 99 percent variance thresholds.

Parameters:

Name Type Description Default
x_df DataFrame

Feature frame to decompose.

required
title str

Plot title.

'PCA cumulative explained variance'

Returns:

Type Description
PCA

The PCA instance fitted on the scaled features.

Source code in src/featurely/decomposition.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def plot_pca_variance(
    x_df: pd.DataFrame,
    title: str = "PCA cumulative explained variance",
) -> PCA:
    """Plot cumulative explained variance and return the fitted PCA.

    Features are standard-scaled first; PCA directions are meaningless when
    columns live on wildly different scales. Reference lines mark the 90,
    95, and 99 percent variance thresholds.

    Args:
        x_df: Feature frame to decompose.
        title: Plot title.

    Returns:
        The PCA instance fitted on the scaled features.
    """

    x = StandardScaler().fit_transform(x_df)
    pca = PCA().fit(x)
    cumulative = np.cumsum(pca.explained_variance_ratio_)

    _, ax = plt.subplots(figsize=(8, 4))
    ax.plot(np.arange(1, len(cumulative) + 1), cumulative, linewidth=1.5)

    for threshold in (0.90, 0.95, 0.99):
        n_at = int(np.searchsorted(cumulative, threshold) + 1)
        ax.axhline(threshold, color="gray", linewidth=0.6, linestyle="--")

        ax.annotate(
            f"{threshold:.0%} at n = {n_at}",
            xy=(n_at, threshold),
            xytext=(n_at + len(cumulative) * 0.03, threshold - 0.05),
            fontsize=8,
            arrowprops={"arrowstyle": "->", "lw": 0.6},
        )

    ax.set_xlabel("Number of components")
    ax.set_ylabel("Cumulative explained variance")
    ax.set_title(title)
    plt.tight_layout()
    show_figure()

    return pca

scan_pca_components(x_df, y, component_grid, cv=10)

Cross-validate a linear model on the first n principal components.

PCA is fit once at the largest grid value; truncating to the first n columns of the projection is equivalent to fitting PCA with n_components=n, so the scan avoids refitting for every grid point.

Parameters:

Name Type Description Default
x_df DataFrame

Feature frame to decompose.

required
y Series

Target series.

required
component_grid list[int]

Component counts to evaluate; entries larger than the feature count are skipped.

required
cv int

Number of cross-validation folds.

10

Returns:

Type Description
DataFrame

A frame with one row per component count: n_components,

DataFrame

mean_r2, std_r2, and the per-fold scores.

Source code in src/featurely/decomposition.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def scan_pca_components(
    x_df: pd.DataFrame,
    y: pd.Series,
    component_grid: list[int],
    cv: int = 10,
) -> pd.DataFrame:
    """Cross-validate a linear model on the first n principal components.

    PCA is fit once at the largest grid value; truncating to the first n
    columns of the projection is equivalent to fitting PCA with
    n_components=n, so the scan avoids refitting for every grid point.

    Args:
        x_df: Feature frame to decompose.
        y: Target series.
        component_grid: Component counts to evaluate; entries larger than
            the feature count are skipped.
        cv: Number of cross-validation folds.

    Returns:
        A frame with one row per component count: ``n_components``,
        ``mean_r2``, ``std_r2``, and the per-fold ``scores``.
    """

    x = StandardScaler().fit_transform(x_df)
    max_n = min(max(component_grid), x.shape[1])
    projected = PCA(n_components=max_n).fit_transform(x)

    rows = []

    for n in component_grid:
        if n > max_n:
            continue

        scores = cross_val_score(LinearRegression(), projected[:, :n], y, cv=cv, scoring="r2")

        rows.append(
            {
                "n_components": n,
                "mean_r2": scores.mean(),
                "std_r2": scores.std(),
                "scores": scores,
            }
        )

        print(f"n = {n:>4}: mean R2 = {scores.mean():.4f} ± {scores.std():.4f}")

    return pd.DataFrame(rows)

plot_pca_component_scan(results_df, title='CV R2 by number of PCA components')

Plot the component scan curve and return the best component count.

The best count maximizes mean CV R2; the shaded band shows one standard deviation across folds, a visual check on whether nearby counts are practically equivalent.

Parameters:

Name Type Description Default
results_df DataFrame

Scan results from scan_pca_components.

required
title str

Plot title.

'CV R2 by number of PCA components'

Returns:

Type Description
int

The component count with the highest mean CV R2.

Source code in src/featurely/decomposition.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
def plot_pca_component_scan(
    results_df: pd.DataFrame,
    title: str = "CV R2 by number of PCA components",
) -> int:
    """Plot the component scan curve and return the best component count.

    The best count maximizes mean CV R2; the shaded band shows one standard
    deviation across folds, a visual check on whether nearby counts are
    practically equivalent.

    Args:
        results_df: Scan results from ``scan_pca_components``.
        title: Plot title.

    Returns:
        The component count with the highest mean CV R2.
    """

    n_vals = results_df["n_components"].values
    means = results_df["mean_r2"].values
    stds = results_df["std_r2"].values

    best_idx = int(np.argmax(means))
    best_n = int(n_vals[best_idx])

    _, ax = plt.subplots(figsize=(8, 4))
    ax.plot(n_vals, means, marker="o", linewidth=1.5)
    ax.fill_between(n_vals, means - stds, means + stds, alpha=0.2)
    ax.axvline(best_n, linewidth=1, linestyle="--")

    ax.annotate(
        f"best n = {best_n}\nR2 = {means[best_idx]:.4f}",
        xy=(best_n, means[best_idx]),
        xytext=(best_n + max(n_vals) * 0.05, means[best_idx] - stds[best_idx]),
        fontsize=8,
        arrowprops={"arrowstyle": "->", "lw": 0.6},
    )

    ax.set_xlabel("Number of components")
    ax.set_ylabel("R2 score (10-fold CV)")
    ax.set_title(title)
    plt.tight_layout()
    show_figure()

    return best_n