Getting started

Installation

pip install featurely

featurely requires Python 3.10 or newer and depends on numpy, pandas, matplotlib, scipy, statsmodels, and scikit-learn.

Core workflow

featurely is built around a screen-then-commit loop: build candidate features, test whether they explain variance your current model misses, and keep only the winners.

1. Establish a baseline

import pandas as pd
import featurely as fl

df = pd.read_csv("my_data.csv")
target = "price"
features = [c for c in df.columns if c != target]

results = fl.add_pipeline_step(
    None, "raw", df[features], df[target],
    results_path="pipeline-results.pkl",  # persisted; reruns upsert by stage
)

2. Build candidate features

Every builder returns a new DataFrame of candidates and leaves the input untouched:

# Distance to anchor points (any dict of name -> (lat, lon))
distances = fl.compute_city_distances(df, cities={"downtown": (40.71, -74.01)})

# Per-bin summary statistics of other features
aggregates = fl.compute_bin_aggregates(df, "latitude", ["income"], n_bins=10)

# Cluster membership and centroid distance
clusters = fl.compute_kmeans_features(df, ["latitude", "longitude"], k=6, prefix="geo")

# Kernel-weighted neighborhood averages
smoothed = fl.compute_spatial_smoothed(df, ["income"], lat_col="latitude", lon_col="longitude")

3. Screen candidates statistically

The candidate scan correlates each candidate against the residuals of a baseline linear model, then applies Benjamini-Hochberg false discovery rate correction:

candidates = pd.concat([distances, aggregates], axis=1)

scan = fl.run_candidate_scan(df, candidates, target=target)
significant = fl.plot_candidate_scan(scan, title="Candidate scan")

keep = [name for name, is_sig in significant.items() if is_sig]
df = pd.concat([df, candidates[keep]], axis=1)

For grouped feature sets that act jointly (one-hot encodings, cluster memberships), compare whole sets with cross-validation and paired t-tests instead of per-column correlations.

4. Track progress

results = fl.add_pipeline_step(
    results, "+ location", df.drop(columns=target), df[target],
    results_path="pipeline-results.pkl",
)
fl.plot_pipeline_steps(results, results_path="pipeline-results.pkl")

Each stage prints mean cross-validated R² with its standard deviation and percent improvement over the raw baseline.

Complete worked example

The fsa-feature-engineering-challenge notebooks demonstrate the full loop across ten stages: outlier cleaning, monotonic transforms, interaction features, a censoring probability feature, location encodings, bin aggregates, clustering, spatial smoothing, and polynomial expansion with PCA component selection.