如何有效地有条件地合并两个数据框

时间:2019-01-04 05:34:12

标签: python pandas merge

我正在尝试根据GPS时间戳为每个GPS数据包分配各自的计划编号和行程编号。既然我有来自各种设备的近一百万个GPS数据包,该如何有效地做到这一点?

我没有找到任何最佳方法。现在,我在所有行上循环运行,并将其时间戳与计划中的所有间隔进行比较,不发送表,并将匹配的计划号附加到每个GPS数据包中。

GPS数据框:

import pandas as pd
gps_df = pd.DataFrame({'Device':[1,1,2,2,3,3,3],'time-stamp': ['6:00:00','7:00:30','12:12:12','13:13:13','20:15:10','22:16:10','22:18:23']})

计划数据框:\ n

schedule_df = pd.DataFrame({'Device'    :[1,    1,  1,  1,  2,  2,  2,  3,3,    3],
'schedule'  :['A1','A1','A2','A2','B1','B2','B2','C1','C2','C3'],
'route no'  :[1,    2,  1,  2,  1,  5,  6,  1,  1,  2],
'start time' :  ['6:00:00','7:00:01','8:30:00','10:00:00','12:00:00','14:00:00','16:00:00','20:00:00','21:00:00','22:00:00'],
'end time'  :['7:00:00','8:30:00','9:30:00','12:00:00','13:00:00','16:00:00','20:00:00','21:00:00','22:00:00','23:00:00']})

我想要这样的输出:

gps_df = pd.DataFrame({'Device':[1,1,2,2,3,3,3],
                   'time-stamp':['6:00:00','7:00:30','12:12:12','13:13:13','20:15:10','22:16:10','22:18:23'],
                    'schedule': ['A1','A1','B1','Na','C1','C3','C3'],
                    'route':    [1, 2,  1,  'Na',1, 2,  2]})

3 个答案:

答案 0 :(得分:0)

尝试一下: 将熊猫作为pd导入

gps_df = pd.DataFrame({'Device':[1,1,2,2,3,3,3],'time-stamp': ['6:00:00','7:00:30','12:12:12','13:13:13','20:15:10','22:16:10','22:18:23']})
schedule_df = pd.DataFrame({'Device'    :[1,    1,  1,  1,  2,  2,  2,  3,3,    3],
'schedule'  :['A1','A1','A2','A2','B1','B2','B2','C1','C2','C3'],
'route no'  :[1,    2,  1,  2,  1,  5,  6,  1,  1,  2],
'start time' :  ['6:00:00','7:00:01','8:30:00','10:00:00','12:00:00','14:00:00','16:00:00','20:00:00','21:00:00','22:00:00'],
'end time'  :['7:00:00','8:30:00','9:30:00','12:00:00','13:00:00','16:00:00','20:00:00','21:00:00','22:00:00','23:00:00']})
print(gps_df)
print(schedule_df)
gps_df = pd.concat([gps_df, schedule_df],sort=True)
gps_df = gps_df.drop('end time', axis=1)
print(gps_df)

输出

   Device time-stamp
0       1    6:00:00
1       1    7:00:30
2       2   12:12:12
3       2   13:13:13
4       3   20:15:10
5       3   22:16:10
6       3   22:18:23


   Device schedule  route no start time  end time
0       1       A1         1    6:00:00   7:00:00
1       1       A1         2    7:00:01   8:30:00
2       1       A2         1    8:30:00   9:30:00
3       1       A2         2   10:00:00  12:00:00
4       2       B1         1   12:00:00  13:00:00
5       2       B2         5   14:00:00  16:00:00
6       2       B2         6   16:00:00  20:00:00
7       3       C1         1   20:00:00  21:00:00
8       3       C2         1   21:00:00  22:00:00
9       3       C3         2   22:00:00  23:00:00


      Device time-stamp schedule route
0       1    6:00:00       A1     1
1       1    7:00:30       A1     2
2       2   12:12:12       B1     1
3       2   13:13:13       Na    Na
4       3   20:15:10       C1     1
5       3   22:16:10       C3     2
6       3   22:18:23       C3     2

希望这会有所帮助

答案 1 :(得分:0)

使用merge

{"messages": [{"to":"+123","hsm":{"template": "demo","parameters":{"1": "12-12-2018"}}}]}

或者:

cols = ['Device', 'schedule', 'route','timestamp']
df = df2.merge(df1, on='Device')
df = df.loc[df.timestamp.lt(df.end_time) & df.timestamp.gt(df.start_time), cols]\
       .set_index(['timestamp','Device'])\
       .reindex(index=df1.set_index(['timestamp','Device']).index)\
       .reset_index()

print(df)
  timestamp  Device schedule  route
0  06:00:01       1       A1    1.0
1  07:00:30       1       A1    2.0
2  12:12:12       2       B1    1.0
3  13:13:13       2      NaN    NaN
4  20:15:10       3       C1    1.0
5  22:16:10       3       C3    2.0
6  22:18:23       3       C3    2.0

答案 2 :(得分:0)

您可以尝试使用numpy数组。我已经省略了一些代码来初始化要添加到GPS数据帧中的其他输出列,但是尽管如此,我们的想法是创建一个2-D数组,其中AND逻辑的交集会生成一个真值表,该真值表可按设备ID映射匹配项时间范围内的“ i”是GPS df中的对应行索引,“ j”是Schedule df中的对应行索引。

gpsd = GPS_df.Device.values
schedd = Sched_df.Device.values

gpst = GPS_df.timestamp.values
tl = Sched_df.start_time.values
th = Sched_df.end_time.values

i, j = np.where((gpsd[None].T == schedd) & 
                (gpst[None].T >= tl ) &
                (gpst[None].T <= th))
GPS_df.loc[i,'schedule'] = Sched_df.loc[j,'schedule']
GPS_df.loc[i,'route'] = Sched_df.loc[j,'route']