我有一个带GPS坐标的数据框,我试图制作离线反向地理编码城市查询服务。我基本上试图从一对GPS坐标中解析城市名称。我不能使用第三方服务。
我的数据框如下所示:
data = [
["LATITUDE","LONGITUDE"],
[41.9021454,-87.624176],
[38.8898163,-76.9598312],
[39.304615,-76.6136703],
[38.9550285,-76.7441483],
[41.8815498,-87.6620789],
[33.9141922,-84.3123169]
]
df = pd.DataFrame(data[1:],columns=data[0])
LATITUDE LONGITUDE
41.9021454 -87.624176
38.8898163 -76.9598312
39.304615 -76.6136703
38.9550285 -76.7441483
41.8815498 -87.6620789
33.9141922 -84.3123169
我制作了城市查询数据框
city_data = [
['CITY',"LAT","LON"],
['PHOENIX',33.0,-112.0],
['ATLANTA',33.0,-84.0],
['MIAMI',25.0,-80.0],
['WASHINGTON_DC',39.0,-77.0],
['CHICAGO',41.0,-87.0],
]
df_geo = pd.DataFrame(city_data[1:],columns=city_data[0])
CITY LAT LON
0 PHOENIX 33.0 -112.0
1 ATLANTA 33.0 -84.0
2 MIAMI 25.0 -80.0
3 WASHINGTON_DC 39.0 -77.0
4 CHICAGO 41.0 -87.0
我想比较两个数据帧的lat和lon,看看这些值是否大约+/- 1,如果是这样的话,请创建一个城市名称如下的新列:
LATITUDE LONGITUDE CITY
41.9021454 -87.624176 CHICAGO
38.8898163 -76.9598312 WASHINGTON_DC
39.304615 -76.6136703 WASHINGTON_DC
38.9550285 -76.7441483 WASHINGTON_DC
41.8815498 -87.6620789 CHICAGO
33.9141922 -84.3123169 ATLANTA
数据框长度不同。城市查找可能是10行,但数据可能是数千。我很确定比较可以用np.where或df.isin在一行中完成,但我不知道怎么说。我有这个,但我已经卡住了
df['city'] = np.where(abs(df['LATITUDE'] - df_geo.loc[df["LAT"]]) <= 1 and
abs(df['LONGITUDE'] - df_geo.loc[df["LON"]]) <= 1, df_geo['CITY'], 'TBD')
df['city'] = np.where(df['LATITUDE'].round(0) in df_geo['LAT'] and
df['LONGITUDE'] in df_geo['LON'] , df_geo['CITY'], 'TBD')
答案 0 :(得分:2)
这在性能方面是一个粗略的解决方案,但它应该提供一个框架:
df_geo['GPS'] = list(zip(df_geo.LAT, df_geo.LON))
geo_map = df_geo.set_index('CITY')['GPS'].to_dict()
# {'ATLANTA': (33.0, -84.0),
# 'CHICAGO': (41.0, -87.0),
# 'MIAMI': (25.0, -80.0),
# 'PHOENIX': (33.0, 112.0),
# 'WASHINGTON_DC': (39.0, -77.0)}
def calculator(row, mapper, error):
for k, v in mapper.items():
if abs(row['LATITUDE'] - v[0]) <= error and \
abs(row['LONGITUDE'] - v[1]) <= error:
return k
else:
return None
df['CITY'] = df.apply(calculator, mapper=geo_map, error=1, axis=1)
# LATITUDE LONGITUDE CITY
# 0 41.902145 -87.624176 CHICAGO
# 1 38.889816 -76.959831 WASHINGTON_DC
# 2 39.304615 -76.613670 WASHINGTON_DC
# 3 38.955028 -76.744148 WASHINGTON_DC
# 4 41.881550 -87.662079 CHICAGO
# 5 33.914192 -84.312317 ATLANTA
答案 1 :(得分:2)
你可以用numpy做一些很酷的东西。这是使用广播比较的一种解决方案。
i = df.values[:, None]
j = df_geo.values[None, :, 1:].astype(float)
df['CITY'] = df_geo.CITY.iloc[
(np.abs(j - i) <= 1).all(2).argmax(1)
].values
df
LATITUDE LONGITUDE CITY
0 41.902145 -87.624176 CHICAGO
1 38.889816 -76.959831 WASHINGTON_DC
2 39.304615 -76.613670 WASHINGTON_DC
3 38.955028 -76.744148 WASHINGTON_DC
4 41.881550 -87.662079 CHICAGO
5 33.914192 -84.312317 ATLANTA
这是快速。但请注意,这会占用内存,特别是对于较大的数据集。