如何对列值在一定范围内的两个数据框进行外部合并?

时间:2019-04-11 09:55:23

标签: python python-3.x pandas

这是this

的后续问题

我有两个dataframes

print df_1

  timestamp      A          B
0 2016-05-15     0.020228   0.026572
1 2016-05-15     0.057780   0.175499
2 2016-05-15     0.098808   0.620986
3 2016-05-17     0.158789   1.014819
4 2016-05-17     0.038129   2.384590
5 2018-05-17     0.011111   9.999999


print df_2

  start                end  event    
0 2016-05-14   2016-05-16   E1
1 2016-05-14   2016-05-16   E2
2 2016-05-17   2016-05-18   E3

如果df_1介于df_2和{{1}之间,我想合并event columndf_1并在timestamp中获得start end

问题和与this问题的差异

1)eventE1具有相同的E2start

2)同样在end中,第六行也不在任何间隔内。

最后,我希望同时拥有这两个事件,对于没有任何事件的行,则要拥有df_1

所以我希望我得到的NA像这样

dataframe

1 个答案:

答案 0 :(得分:0)

import pandas as pd

df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                     'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                     'end':['2016-05-16','2016-05-16','2016-05-18'],
                     'event':['E1','E2','E3']})
df_1.timestamp = pd.to_datetime(df_1.timestamp, format='%Y-%m-%d')
df_2.start = pd.to_datetime(df_2.start, format='%Y-%m-%d')
df_2.end = pd.to_datetime(df_2.end, format='%Y-%m-%d')

# convert game_ref_dt to long format with all the dates in between, and do a left merge on date
df_2_2 = pd.melt(df_2, id_vars='event', value_name='timestamp')
df_2_2.timestamp = pd.to_datetime(df_2_2.timestamp)
df_2_2.set_index('timestamp', inplace=True)
df_2_2.drop('variable', axis=1, inplace=True)

df_2_3 = df_2_2.groupby('event').resample('D').ffill().reset_index(level=0, drop=True).reset_index()

df_2 = pd.merge(df_2, df_2_3)
df_2 = df_2.drop(columns=['start', 'end'])

df_1 = df_1.merge(df_2,on='timestamp',  how='left')

print(df_1)
   timestamp  A event
0 2016-05-15  1    E1
1 2016-05-15  1    E2
2 2016-05-15  1    E1
3 2016-05-15  1    E2
4 2016-05-15  1    E1
5 2016-05-15  1    E2
6 2016-05-17  1    E3
7 2016-05-17  1    E3
8 2018-05-17  1   NaN

贷记this

也是这种解决方案,但在最后一行给出NA

  import pandas as pd

df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                     'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                     'end':['2016-05-16','2016-05-16','2016-05-18'],
                     'event':['E1','E2','E3']})   

df_try2 = pd.merge(df_1.assign(key=1), df_2.assign(key=1), on='key').query('timestamp >= start and timestamp <= end')    

print(df_try2)

   timestamp  A  key      start        end event
0  2016-05-15  1    1 2016-05-14 2016-05-16    E1
1  2016-05-15  1    1 2016-05-14 2016-05-16    E2
3  2016-05-15  1    1 2016-05-14 2016-05-16    E1
4  2016-05-15  1    1 2016-05-14 2016-05-16    E2
6  2016-05-15  1    1 2016-05-14 2016-05-16    E1
7  2016-05-15  1    1 2016-05-14 2016-05-16    E2
11 2016-05-17  1    1 2016-05-17 2016-05-18    E3
14 2016-05-17  1    1 2016-05-17 2016-05-18    E3