pandas比较来自不同长度的不同帧的多个列值

时间:2018-02-17 00:02:55

标签: python pandas numpy dataframe

我有一个带GPS坐标的数据框,我试图制作离线反向地理编码城市查询服务。我基本上试图从一对GPS坐标中解析城市名称。我不能使用第三方服务。

我的数据框如下所示:

data = [
    ["LATITUDE","LONGITUDE"],
    [41.9021454,-87.624176],
    [38.8898163,-76.9598312],
    [39.304615,-76.6136703],
    [38.9550285,-76.7441483],
    [41.8815498,-87.6620789],
    [33.9141922,-84.3123169]
]
df = pd.DataFrame(data[1:],columns=data[0])

LATITUDE    LONGITUDE
41.9021454  -87.624176
38.8898163  -76.9598312
39.304615   -76.6136703
38.9550285  -76.7441483
41.8815498  -87.6620789
33.9141922  -84.3123169

我制作了城市查询数据框

city_data = [
    ['CITY',"LAT","LON"],
    ['PHOENIX',33.0,-112.0],
    ['ATLANTA',33.0,-84.0],
    ['MIAMI',25.0,-80.0],
    ['WASHINGTON_DC',39.0,-77.0],
    ['CHICAGO',41.0,-87.0],
]
df_geo = pd.DataFrame(city_data[1:],columns=city_data[0])

            CITY   LAT    LON
0        PHOENIX  33.0 -112.0
1        ATLANTA  33.0  -84.0
2          MIAMI  25.0  -80.0
3  WASHINGTON_DC  39.0  -77.0
4        CHICAGO  41.0  -87.0

我想比较两个数据帧的lat和lon,看看这些值是否大约+/- 1,如果是这样的话,请创建一个城市名称如下的新列:

LATITUDE    LONGITUDE   CITY
41.9021454  -87.624176  CHICAGO
38.8898163  -76.9598312 WASHINGTON_DC
39.304615   -76.6136703 WASHINGTON_DC
38.9550285  -76.7441483 WASHINGTON_DC
41.8815498  -87.6620789 CHICAGO
33.9141922  -84.3123169 ATLANTA

数据框长度不同。城市查找可能是10行,但数据可能是数千。我很确定比较可以用np.where或df.isin在一行中完成,但我不知道怎么说。我有这个,但我已经卡住了

df['city'] = np.where(abs(df['LATITUDE'] - df_geo.loc[df["LAT"]]) <= 1  and
                      abs(df['LONGITUDE'] - df_geo.loc[df["LON"]]) <= 1, df_geo['CITY'], 'TBD')


df['city'] = np.where(df['LATITUDE'].round(0) in df_geo['LAT'] and
                      df['LONGITUDE'] in df_geo['LON'] , df_geo['CITY'], 'TBD')

2 个答案:

答案 0 :(得分:2)

这在性能方面是一个粗略的解决方案,但它应该提供一个框架:

df_geo['GPS'] = list(zip(df_geo.LAT, df_geo.LON))
geo_map = df_geo.set_index('CITY')['GPS'].to_dict()

# {'ATLANTA': (33.0, -84.0),
#  'CHICAGO': (41.0, -87.0),
#  'MIAMI': (25.0, -80.0),
#  'PHOENIX': (33.0, 112.0),
#  'WASHINGTON_DC': (39.0, -77.0)}

def calculator(row, mapper, error):
    for k, v in mapper.items():
        if abs(row['LATITUDE'] - v[0]) <= error and \
           abs(row['LONGITUDE'] - v[1]) <= error:
            return k
    else:
        return None

df['CITY'] = df.apply(calculator, mapper=geo_map, error=1, axis=1)

#     LATITUDE  LONGITUDE           CITY
# 0  41.902145 -87.624176        CHICAGO
# 1  38.889816 -76.959831  WASHINGTON_DC
# 2  39.304615 -76.613670  WASHINGTON_DC
# 3  38.955028 -76.744148  WASHINGTON_DC
# 4  41.881550 -87.662079        CHICAGO
# 5  33.914192 -84.312317        ATLANTA

答案 1 :(得分:2)

你可以用numpy做一些很酷的东西。这是使用广播比较的一种解决方案。

i = df.values[:, None]
j = df_geo.values[None, :, 1:].astype(float)    

df['CITY'] = df_geo.CITY.iloc[
                   (np.abs(j - i) <= 1).all(2).argmax(1)
             ].values

df

    LATITUDE  LONGITUDE           CITY
0  41.902145 -87.624176        CHICAGO
1  38.889816 -76.959831  WASHINGTON_DC
2  39.304615 -76.613670  WASHINGTON_DC
3  38.955028 -76.744148  WASHINGTON_DC
4  41.881550 -87.662079        CHICAGO
5  33.914192 -84.312317        ATLANTA

这是快速。但请注意,这会占用内存,特别是对于较大的数据集。