Skip to content

Outliers

IQR-based outlier clipping, log-style transformation, and KNN imputation.

featurely.outliers

impute_outliers_with_knn(df, features, n_neighbors=7, threshold=1.5)

Replace IQR outliers with NaN, then impute them with KNN.

Values outside [Q1 - threshold * IQR, Q3 + threshold * IQR] are treated as missing and reconstructed from the n_neighbors most similar rows, which preserves multivariate structure better than clipping when outliers are recording errors rather than real extremes.

Parameters:

Name Type Description Default
df DataFrame

Input frame; not modified.

required
features list[str]

Columns to screen for outliers and impute.

required
n_neighbors int

Number of neighbor rows used by the KNN imputer.

7
threshold float

IQR multiplier that defines the outlier fences.

1.5

Returns:

Type Description
DataFrame

A copy of df with outlier values imputed.

Source code in src/featurely/outliers.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def impute_outliers_with_knn(
    df: pd.DataFrame,
    features: list[str],
    n_neighbors: int = 7,
    threshold: float = 1.5,
) -> pd.DataFrame:
    """Replace IQR outliers with NaN, then impute them with KNN.

    Values outside ``[Q1 - threshold * IQR, Q3 + threshold * IQR]`` are
    treated as missing and reconstructed from the ``n_neighbors`` most
    similar rows, which preserves multivariate structure better than
    clipping when outliers are recording errors rather than real extremes.

    Args:
        df: Input frame; not modified.
        features: Columns to screen for outliers and impute.
        n_neighbors: Number of neighbor rows used by the KNN imputer.
        threshold: IQR multiplier that defines the outlier fences.

    Returns:
        A copy of ``df`` with outlier values imputed.
    """

    result = df.copy()

    for col in features:
        q1 = result[col].quantile(0.25)
        q3 = result[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - threshold * iqr
        upper = q3 + threshold * iqr
        outlier_mask = (result[col] < lower) | (result[col] > upper)

        result.loc[outlier_mask, col] = np.nan
        print(f"{col}: {outlier_mask.sum():>4} outliers replaced with NaN")

    print(f"\nTotal NaN values introduced: {result[features].isna().sum().sum()}")

    imputer = KNNImputer(n_neighbors=n_neighbors)
    result[features] = imputer.fit_transform(result[features])

    print(f"NaN values remaining after imputation: {result.isna().sum().sum()}")

    return result

clip_outliers(df, features, threshold=1.5)

Clip feature values to their IQR fences.

Winsorizes each column to [Q1 - threshold * IQR, Q3 + threshold * IQR]. Clipping keeps every row and caps the influence of extreme values, at the cost of piling clipped observations onto the fence values.

Parameters:

Name Type Description Default
df DataFrame

Input frame; not modified.

required
features list[str]

Columns to clip.

required
threshold float

IQR multiplier that defines the clip bounds.

1.5

Returns:

Type Description
DataFrame

A copy of df with the selected columns clipped.

Source code in src/featurely/outliers.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def clip_outliers(df: pd.DataFrame, features: list[str], threshold: float = 1.5) -> pd.DataFrame:
    """Clip feature values to their IQR fences.

    Winsorizes each column to ``[Q1 - threshold * IQR, Q3 + threshold * IQR]``.
    Clipping keeps every row and caps the influence of extreme values, at the
    cost of piling clipped observations onto the fence values.

    Args:
        df: Input frame; not modified.
        features: Columns to clip.
        threshold: IQR multiplier that defines the clip bounds.

    Returns:
        A copy of ``df`` with the selected columns clipped.
    """

    result = df.copy()

    for col in features:
        q1 = result[col].quantile(0.25)
        q3 = result[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - threshold * iqr
        upper = q3 + threshold * iqr

        result[col] = result[col].clip(lower=lower, upper=upper)
        print(f"{col}: Outliers clipped to [{lower:.2f}, {upper:.2f}]")

    return result

transform_outliers(df, features, threshold=1.5)

Log-transform features that contain IQR outliers and are non-negative.

Applies log1p only to columns where outliers are present and all values are non-negative, compressing long right tails instead of discarding or capping them.

Parameters:

Name Type Description Default
df DataFrame

Input frame; not modified.

required
features list[str]

Columns to screen and potentially transform.

required
threshold float

IQR multiplier that defines the outlier fences.

1.5

Returns:

Type Description
DataFrame

A copy of df with qualifying columns log-transformed.

Source code in src/featurely/outliers.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def transform_outliers(df: pd.DataFrame, features: list[str], threshold: float = 1.5) -> pd.DataFrame:
    """Log-transform features that contain IQR outliers and are non-negative.

    Applies ``log1p`` only to columns where outliers are present and all
    values are non-negative, compressing long right tails instead of
    discarding or capping them.

    Args:
        df: Input frame; not modified.
        features: Columns to screen and potentially transform.
        threshold: IQR multiplier that defines the outlier fences.

    Returns:
        A copy of ``df`` with qualifying columns log-transformed.
    """

    result = df.copy()

    for col in features:
        q1 = result[col].quantile(0.25)
        q3 = result[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - threshold * iqr
        upper = q3 + threshold * iqr
        n_outliers = ((result[col] < lower) | (result[col] > upper)).sum()

        if n_outliers > 0:
            if result[col].min() >= 0:
                result[col] = np.log1p(result[col])
                print(f"{col}: {n_outliers:>4} outliers -> log-transformed")

            else:
                print(f"{col}: {n_outliers:>4} outliers -> skipped (contains negative values)")

        else:
            print(f"{col}: no outliers")

    return result