如何使用额外的共享变量将具有多个重叠时间戳的两个DataFrame联接在一起

时间:2018-10-31 13:57:07

标签: python-3.x pandas

此问题基于以下问题:How to join two dataframes for which column values are within a certain range?并由@coldspeed回答。下面是针对我的问题修改的DataFrame:

print df_1

  timestamp              A          B       User
0 2016-05-14 10:00    0.020228   0.026572    1
1 2016-05-14 10:00    0.057780   0.175499    2
2 2016-05-14 10:00    0.098808   0.620986    3
3 2016-05-14 10:15    0.158789   1.014819    1
4 2016-05-14 10:15    0.038129   2.384590    2
5 2016-05-14 10:15    0.038129   2.384590    3

print df_2

  start                end                  event   User  
0 2016-05-14 10:00     2016-05-14 10:54:33  E1       1        
1 2016-05-14 10:00     2016-05-14 10:54:37  E2       2
2 2016-05-14 10:00     2016-05-14 10:54:42  E3       3

desired output:

  timestamp              A          B       User  event
0 2016-05-14 10:00    0.020228   0.026572    1     E1
1 2016-05-14 10:00    0.057780   0.175499    2     E2
2 2016-05-14 10:00    0.098808   0.620986    3     E3
3 2016-05-14 10:15    0.158789   1.014819    1     E1
4 2016-05-14 10:15    0.038129   2.384590    2     E2
5 2016-05-14 10:15    0.038129   2.384590    3     E3

所以,我相信我可以用作基础:

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.loc[idx.get_indexer(df_1.timestamp), 'event']
df_1['event'] = event.values

但是我需要一种引用UserID的方法,以防止混淆重叠的会话。

1 个答案:

答案 0 :(得分:0)

在这种情况下,您可以使用merge_asof

pd.merge_asof(df1,df2,left_on='timestamp',right_on='end',by='User',direction ='forward')
Out[148]: 
            timestamp         A  ...                   end  event
0 2016-05-14 10:00:00  0.020228  ...   2016-05-14 10:54:33     E1
1 2016-05-14 10:00:00  0.057780  ...   2016-05-14 10:54:37     E2
2 2016-05-14 10:00:00  0.098808  ...   2016-05-14 10:54:42     E3
3 2016-05-14 10:15:00  0.158789  ...   2016-05-14 10:54:33     E1
4 2016-05-14 10:15:00  0.038129  ...   2016-05-14 10:54:37     E2
5 2016-05-14 10:15:00  0.038129  ...   2016-05-14 10:54:42     E3