Skip to content

Smoothing

Gaussian kernel spatial smoothing of features.

featurely.smoothing

Spatial kernel smoothing of features.

Smoothing replaces each row's feature value with a weighted average over its spatial neighborhood, suppressing row-level noise while preserving regional structure. This is Nadaraya-Watson kernel regression truncated to the nearest neighbors for tractability. Only feature columns are smoothed, never the target, so the candidates are leakage-free.

compute_spatial_smoothed(df, features, lat_col='Latitude', lon_col='Longitude', n_neighbors=50, bandwidth=None, prefix='smooth')

Return Gaussian-kernel smoothed feature candidates.

For each row, the smoothed value is a Gaussian-weighted average of the feature over its n_neighbors nearest points in latitude-longitude space (each row is its own nearest neighbor, so the original value gets the largest single weight).

Parameters:

Name Type Description Default
df DataFrame

Input frame; not modified.

required
features list[str]

Columns to smooth.

required
lat_col str

Name of the latitude column.

'Latitude'
lon_col str

Name of the longitude column.

'Longitude'
n_neighbors int

Neighborhood size for the truncated kernel.

50
bandwidth float | None

Gaussian kernel width in coordinate units. When None it defaults to the median distance to the farthest retained neighbor, which adapts the width to local point density.

None
prefix str

Prefix for output column names, e.g. smooth_{col}.

'smooth'

Returns:

Type Description
DataFrame

A frame of smoothed candidate columns.

Source code in src/featurely/smoothing.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def compute_spatial_smoothed(
    df: pd.DataFrame,
    features: list[str],
    lat_col: str = "Latitude",
    lon_col: str = "Longitude",
    n_neighbors: int = 50,
    bandwidth: float | None = None,
    prefix: str = "smooth",
) -> pd.DataFrame:
    """Return Gaussian-kernel smoothed feature candidates.

    For each row, the smoothed value is a Gaussian-weighted average of the
    feature over its ``n_neighbors`` nearest points in latitude-longitude
    space (each row is its own nearest neighbor, so the original value gets
    the largest single weight).

    Args:
        df: Input frame; not modified.
        features: Columns to smooth.
        lat_col: Name of the latitude column.
        lon_col: Name of the longitude column.
        n_neighbors: Neighborhood size for the truncated kernel.
        bandwidth: Gaussian kernel width in coordinate units. When None it
            defaults to the median distance to the farthest retained
            neighbor, which adapts the width to local point density.
        prefix: Prefix for output column names, e.g. ``smooth_{col}``.

    Returns:
        A frame of smoothed candidate columns.
    """

    coords = df[[lat_col, lon_col]].values
    nn = NearestNeighbors(n_neighbors=n_neighbors).fit(coords)
    dists, idx = nn.kneighbors(coords)

    if bandwidth is None:
        bandwidth = float(np.median(dists[:, -1]))
        print(f"Using adaptive bandwidth: {bandwidth:.4f} degrees")

    # Gaussian kernel weights, normalized per row so each smoothed value is
    # a proper weighted average of its neighborhood.
    weights = np.exp(-0.5 * (dists / bandwidth) ** 2)
    weights /= weights.sum(axis=1, keepdims=True)

    out = {}

    for col in features:
        vals = df[col].values
        out[f"{prefix}_{col}"] = (weights * vals[idx]).sum(axis=1)

    return pd.DataFrame(out, index=df.index)