获取某些列最大化多个数据框之间重叠/交叉的所有行

时间:2018-03-09 13:34:19

标签: python pandas dataframe

我有N个数据框,其中两列保存经度和纬度数据,跟踪汽车的运动。对于所有数据帧,汽车的一般跟踪是相同的,但由于跟踪有时会开始稍晚或稍早结束,因此数据帧的长度不同。

我希望数据帧“排列”,即修剪对应于“非重叠位置数据”的行。我希望结果是N数据帧的长度相等。所有数据帧的位置数据都是相同的。

实施例

三个任意数据帧如下所示:

time     speed     longitude      latitude
 t00       v00         19.70         48.67
 t01       v01         19.71         48.65
 t02       v02         19.72         48.64
 t03       v03         19.73         48.64
 t04       v04         19.74         48.63
 t05       v05         19.74         48.63
 t06       v06         19.75         48.64
 t07       v07         19.75         48.64
 t08       v08         19.75         48.64
 t09       v09         19.75         48.64


time     speed     longitude      latitude
 t10       v10         19.72         48.64
 t11       v11         19.73         48.64
 t12       v12         19.74         48.63
 t13       v13         19.74         48.63
 t14       v14         19.75         48.64
 t15       v15         19.75         48.64
 t16       v16         19.75         48.64



time     speed     longitude      latitude
 t20       v20         19.72         48.64
 t21       v21         19.73         48.64
 t22       v22         19.74         48.63
 t23       v23         19.74         48.63
 t24       v24         19.75         48.64
 t25       v25         19.75         48.63
 t26       v26         19.75         48.64
 t27       v27         19.75         48.64
 t28       v28         19.75         48.64

结果应该是三个新的数据框:

time     speed     longitude      latitude
 t02       v02         19.72         48.64
 t03       v03         19.73         48.64
 t04       v04         19.74         48.63
 t05       v05         19.74         48.63
 t06       v06         19.75         48.64


time     speed     longitude      latitude
 t10       v10         19.72         48.64
 t11       v11         19.73         48.64
 t12       v12         19.74         48.63
 t13       v13         19.74         48.63
 t14       v14         19.75         48.64


time     speed     longitude      latitude
 t20       v20         19.72         48.64
 t21       v21         19.73         48.64
 t22       v22         19.74         48.63
 t23       v23         19.74         48.63
 t24       v24         19.75         48.64

实际上,重叠坐标的数量会更高,但我希望这显示出它的要点。

我找到了this post,其中检索了两个列表之间的交集。我试图从数据框中提取位置数据,然后仅从所有数据框中提取具有匹配坐标的行,但由于数据帧之间的行数不同,这会失败。

我目前的代码如下所示:

first_route = True

for route in routes:  # extract all route's coordinates                                                                             
    lon = route["longitude"].values.tolist()                                                                                                           
    lat = route["latitude"].values.tolist()                                                                                                            
    if first_route:  # add first route regardless                                                                                          
        cropped_lon = lon                                                                                                                              
        cropped_lat = lat
        first_route = False                                                                                                                              
        continue                                                                                                                                       
    old_lon = collections.Counter(cropped_lon)                                                                                                         
    old_lat = collections.Counter(cropped_lat)                                                                                                         
    new_lon = collections.Counter(lon)                                                                                                                 
    new_lat = collections.Counter(lat)                                                                                                                 
    cropped_lon = list((old_lon & new_lon).elements())                                                                                                 
    cropped_lat = list((old_lat & new_lat).elements())                                                                                                 

cropped_lon = np.asarray(cropped_lon)                                                                                                                  
cropped_lat = np.asarray(cropped_lat)                                                                                                                  

# THIS fails due to length difference
# Here I want to extract all rows which satisfy the positional restrictions                                                                                                                                                                                                  
for route in routes:                                                                                                                
    print(route[route.longitude == cropped_lon and route.latitude == cropped_lat])                                                               

如果有人有更好的想法,我完全愿意抛弃我的全部想法。

更新

接受的答案解决了标题中的问题,但我正在寻找一个扩展的解决方案。我希望它能以类似的方式实现,为什么我把它留作更新。

我的实际坐标数据具有更高的分辨率(6位小数)但测量结果不够准确。结果是接受的答案中的代码产生空数据帧。我可以使用最短的数据帧,然后“滑动”所有其他数据帧,以便进行最小二乘拟合,但我希望有一个更类似于下面的解决方案。

1 个答案:

答案 0 :(得分:1)

您可以合并所有数据框以仅保留重叠部分。 让我们从您的示例数据开始:

cols = ['time','speed']
group_cols = ['longitude','latitude']

input_list = [[['t00','v00',19.70,48.67],
    ['t01','v01',19.71,48.65],
    ['t02','v02',19.72,48.64],
    ['t03','v03',19.73,48.64],
    ['t04','v04',19.74,48.63],
    ['t05','v05',19.74,48.63],
    ['t06','v06',19.75,48.64],
    ['t07','v07',19.75,48.64],
    ['t08','v08',19.75,48.64],
    ['t09','v09',19.75,48.64]],

    [['t10','v10',19.72,48.64],
    ['t11','v11',19.73,48.64],
    ['t12','v12',19.74,48.63],
    ['t13','v13',19.74,48.63],
    ['t14','v14',19.75,48.64],
    ['t15','v15',19.75,48.64],
    ['t16','v16',19.75,48.64]],

    [['t20','v20',19.72,48.64],
    ['t21','v21',19.73,48.64],
    ['t22','v22',19.74,48.63],
    ['t23','v23',19.74,48.63],
    ['t24','v24',19.75,48.64],
    ['t25','v25',19.75,48.63],
    ['t26','v26',19.75,48.64],
    ['t27','v27',19.75,48.64],
    ['t28','v28',19.75,48.64]]]

import pandas as pd
df_list = [pd.DataFrame(l, columns=[c + str(i) for c in cols] + group_cols) for i, l in enumerate(input_list)]

现在合并它们:

from functools import reduce
df = reduce(
    lambda x, y: pd.merge(x, y, on=group_cols, how='inner'), 
    df_list)

    +-----+--------+---------+------------+-----------+--------+---------+--------+--------+
    |     | time0  | speed0  | longitude  | latitude  | time1  | speed1  | time2  | speed2 |
    +-----+--------+---------+------------+-----------+--------+---------+--------+--------+
    |  0  | t02    | v02     | 19.72      | 48.64     | t10    | v10     | t20    | v20    |
    |  1  | t03    | v03     | 19.73      | 48.64     | t11    | v11     | t21    | v21    |
    |  2  | t04    | v04     | 19.74      | 48.63     | t12    | v12     | t22    | v22    |
    |  3  | t04    | v04     | 19.74      | 48.63     | t12    | v12     | t23    | v23    |
    |  4  | t04    | v04     | 19.74      | 48.63     | t13    | v13     | t22    | v22    |
    |  5  | t04    | v04     | 19.74      | 48.63     | t13    | v13     | t23    | v23    |
    |  6  | t05    | v05     | 19.74      | 48.63     | t12    | v12     | t22    | v22    |
    |  7  | t05    | v05     | 19.74      | 48.63     | t12    | v12     | t23    | v23    |
    |  8  | t05    | v05     | 19.74      | 48.63     | t13    | v13     | t22    | v22    |
    |  9  | t05    | v05     | 19.74      | 48.63     | t13    | v13     | t23    | v23    |
    | 10  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 11  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 12  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 13  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 14  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 15  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 16  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 17  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 18  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 19  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 20  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 21  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    | 22  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 23  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 24  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 25  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 26  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 27  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 28  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 29  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 30  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 31  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 32  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 33  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    | 34  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 35  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 36  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 37  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 38  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 39  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 40  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 41  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 42  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 43  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 44  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 45  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    | 46  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 47  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 48  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 49  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 50  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 51  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 52  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 53  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 54  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 55  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 56  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 57  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    +-----+--------+---------+------------+-----------+--------+---------+--------+--------+

最后:

df_list_out = [
    df[[c + str(i) for c in cols] + group_cols].drop_duplicates() for i in range(len(input_list))]