Skip to content

Geo features

Haversine distances to anchor points, hand-rolled geohash encoding, and rotated coordinates.

featurely.geo

Location-based feature encodings for latitude and longitude.

These helpers turn raw coordinates into representations a linear model can use: distances to fixed anchor points, discrete spatial cells, and rotated axes. Raw latitude and longitude only let a linear model fit a single plane over the map; these encodings expose distance decay and neighborhood structure that the plane cannot capture.

haversine_distance(lat1, lon1, lat2, lon2)

Great-circle distance in kilometers between coordinate pairs.

The haversine formula treats Earth as a sphere, which is accurate to roughly 0.5 percent; plenty for feature engineering distances.

Parameters:

Name Type Description Default
lat1 float | ndarray | Series

Latitude of the first point; scalar or array-like, degrees.

required
lon1 float | ndarray | Series

Longitude of the first point; scalar or array-like, degrees.

required
lat2 float | ndarray | Series

Latitude of the second point; scalar or array-like, degrees.

required
lon2 float | ndarray | Series

Longitude of the second point; scalar or array-like, degrees.

required

Returns:

Type Description
float | ndarray

Distance in kilometers, matching the broadcast shape of the inputs.

Source code in src/featurely/geo.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def haversine_distance(
    lat1: float | np.ndarray | pd.Series,
    lon1: float | np.ndarray | pd.Series,
    lat2: float | np.ndarray | pd.Series,
    lon2: float | np.ndarray | pd.Series,
) -> float | np.ndarray:
    """Great-circle distance in kilometers between coordinate pairs.

    The haversine formula treats Earth as a sphere, which is accurate to
    roughly 0.5 percent; plenty for feature engineering distances.

    Args:
        lat1: Latitude of the first point; scalar or array-like, degrees.
        lon1: Longitude of the first point; scalar or array-like, degrees.
        lat2: Latitude of the second point; scalar or array-like, degrees.
        lon2: Longitude of the second point; scalar or array-like, degrees.

    Returns:
        Distance in kilometers, matching the broadcast shape of the inputs.
    """

    lat1 = np.radians(np.asarray(lat1, dtype=float))
    lon1 = np.radians(np.asarray(lon1, dtype=float))
    lat2 = np.radians(np.asarray(lat2, dtype=float))
    lon2 = np.radians(np.asarray(lon2, dtype=float))

    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2

    return 2 * _EARTH_RADIUS_KM * np.arcsin(np.sqrt(a))

compute_city_distances(df, cities, lat_col='Latitude', lon_col='Longitude')

Return distance-to-anchor candidate features in kilometers.

One column per anchor point plus dist_nearest_city, which collapses the set into a single proximity measure. Many spatial outcomes decay with distance from activity centers, a pattern raw coordinates cannot express linearly.

Parameters:

Name Type Description Default
df DataFrame

Input frame with coordinate columns.

required
cities dict[str, tuple[float, float]]

Mapping of anchor name to (latitude, longitude). Column names follow the pattern dist_{name}.

required
lat_col str

Name of the latitude column.

'Latitude'
lon_col str

Name of the longitude column.

'Longitude'

Returns:

Type Description
DataFrame

A frame of distance columns plus dist_nearest_city.

Source code in src/featurely/geo.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def compute_city_distances(
    df: pd.DataFrame,
    cities: dict[str, tuple[float, float]],
    lat_col: str = "Latitude",
    lon_col: str = "Longitude",
) -> pd.DataFrame:
    """Return distance-to-anchor candidate features in kilometers.

    One column per anchor point plus dist_nearest_city, which collapses the
    set into a single proximity measure. Many spatial outcomes decay with
    distance from activity centers, a pattern raw coordinates cannot express
    linearly.

    Args:
        df: Input frame with coordinate columns.
        cities: Mapping of anchor name to (latitude, longitude). Column
            names follow the pattern ``dist_{name}``.
        lat_col: Name of the latitude column.
        lon_col: Name of the longitude column.

    Returns:
        A frame of distance columns plus ``dist_nearest_city``.
    """

    out = {}

    for name, (city_lat, city_lon) in cities.items():
        out[f"dist_{name}"] = haversine_distance(df[lat_col], df[lon_col], city_lat, city_lon)

    frame = pd.DataFrame(out, index=df.index)
    frame["dist_nearest_city"] = frame.min(axis=1)

    return frame

encode_geohash(lat, lon, precision=4)

Encode one coordinate pair as a geohash string.

Geohashing interleaves bits from successive binary subdivisions of the longitude and latitude ranges, then packs each group of 5 bits into a base32 character. Nearby points usually share a prefix, so shorter hashes give coarser spatial cells: precision 4 cells are roughly 39 km by 19.5 km.

Parameters:

Name Type Description Default
lat float

Latitude in degrees.

required
lon float

Longitude in degrees.

required
precision int

Number of base32 characters in the hash.

4

Returns:

Type Description
str

The geohash string.

Source code in src/featurely/geo.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def encode_geohash(lat: float, lon: float, precision: int = 4) -> str:
    """Encode one coordinate pair as a geohash string.

    Geohashing interleaves bits from successive binary subdivisions of the
    longitude and latitude ranges, then packs each group of 5 bits into a
    base32 character. Nearby points usually share a prefix, so shorter
    hashes give coarser spatial cells: precision 4 cells are roughly
    39 km by 19.5 km.

    Args:
        lat: Latitude in degrees.
        lon: Longitude in degrees.
        precision: Number of base32 characters in the hash.

    Returns:
        The geohash string.
    """

    lat_range = [-90.0, 90.0]
    lon_range = [-180.0, 180.0]
    bits: list[int] = []
    use_lon = True

    while len(bits) < precision * 5:
        rng = lon_range if use_lon else lat_range
        value = lon if use_lon else lat
        mid = (rng[0] + rng[1]) / 2

        if value >= mid:
            bits.append(1)
            rng[0] = mid

        else:
            bits.append(0)
            rng[1] = mid

        use_lon = not use_lon

    chars = []

    for i in range(0, len(bits), 5):
        idx = 0

        for bit in bits[i : i + 5]:
            idx = (idx << 1) | bit

        chars.append(_GEOHASH_BASE32[idx])

    return "".join(chars)

compute_geohash_cells(df, precision=4, min_cell_count=100, lat_col='Latitude', lon_col='Longitude')

Return one-hot geohash cell membership candidates.

Cells with fewer than min_cell_count rows are pooled into a shared "other" bucket so the linear model does not fit dummy coefficients to nearly empty cells. Membership indicators are target-free, so there is no leakage risk from this encoding.

Parameters:

Name Type Description Default
df DataFrame

Input frame with coordinate columns.

required
precision int

Geohash length; higher values give smaller cells.

4
min_cell_count int

Minimum rows per cell before pooling into "other".

100
lat_col str

Name of the latitude column.

'Latitude'
lon_col str

Name of the longitude column.

'Longitude'

Returns:

Type Description
DataFrame

A frame of one-hot indicator columns named gh{precision}_{cell}.

Source code in src/featurely/geo.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def compute_geohash_cells(
    df: pd.DataFrame,
    precision: int = 4,
    min_cell_count: int = 100,
    lat_col: str = "Latitude",
    lon_col: str = "Longitude",
) -> pd.DataFrame:
    """Return one-hot geohash cell membership candidates.

    Cells with fewer than min_cell_count rows are pooled into a shared
    "other" bucket so the linear model does not fit dummy coefficients to
    nearly empty cells. Membership indicators are target-free, so there is
    no leakage risk from this encoding.

    Args:
        df: Input frame with coordinate columns.
        precision: Geohash length; higher values give smaller cells.
        min_cell_count: Minimum rows per cell before pooling into "other".
        lat_col: Name of the latitude column.
        lon_col: Name of the longitude column.

    Returns:
        A frame of one-hot indicator columns named ``gh{precision}_{cell}``.
    """

    hashes = [encode_geohash(lat, lon, precision) for lat, lon in zip(df[lat_col], df[lon_col], strict=False)]
    cells = pd.Series(hashes, index=df.index)

    counts = cells.value_counts()
    keep = counts[counts >= min_cell_count].index
    pooled = cells.where(cells.isin(keep), "other")

    return pd.get_dummies(pooled, prefix=f"gh{precision}", dtype=float)

compute_rotated_coordinates(df, angle_deg, lat_col='Latitude', lon_col='Longitude')

Return coordinate axes rotated by an arbitrary angle.

Raw latitude and longitude only let a linear model fit gradients that run north-south or east-west. Rotating the frame exposes gradients that run diagonally across the map, for example along a coastline or a mountain range. A 45 degree rotation reproduces the classic sum-and-difference encoding up to scale.

Parameters:

Name Type Description Default
df DataFrame

Input frame with coordinate columns.

required
angle_deg float

Rotation angle in degrees, counterclockwise. Column names embed the angle, e.g. rot45_x and rot45_y.

required
lat_col str

Name of the latitude column.

'Latitude'
lon_col str

Name of the longitude column.

'Longitude'

Returns:

Type Description
DataFrame

A frame with the two rotated coordinate columns.

Source code in src/featurely/geo.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def compute_rotated_coordinates(
    df: pd.DataFrame,
    angle_deg: float,
    lat_col: str = "Latitude",
    lon_col: str = "Longitude",
) -> pd.DataFrame:
    """Return coordinate axes rotated by an arbitrary angle.

    Raw latitude and longitude only let a linear model fit gradients that
    run north-south or east-west. Rotating the frame exposes gradients that
    run diagonally across the map, for example along a coastline or a
    mountain range. A 45 degree rotation reproduces the classic
    sum-and-difference encoding up to scale.

    Args:
        df: Input frame with coordinate columns.
        angle_deg: Rotation angle in degrees, counterclockwise. Column names
            embed the angle, e.g. ``rot45_x`` and ``rot45_y``.
        lat_col: Name of the latitude column.
        lon_col: Name of the longitude column.

    Returns:
        A frame with the two rotated coordinate columns.
    """

    theta = np.radians(angle_deg)
    x = df[lon_col]
    y = df[lat_col]

    # Sanitize the angle for CSV-safe column names: rot-30.5 -> rotm30p5.
    label = f"{angle_deg:g}".replace("-", "m").replace(".", "p")

    return pd.DataFrame(
        {
            f"rot{label}_x": x * np.cos(theta) + y * np.sin(theta),
            f"rot{label}_y": -x * np.sin(theta) + y * np.cos(theta),
        },
        index=df.index,
    )