给定日期范围内的熊猫过滤和标记数据

时间:2020-01-11 11:52:35

标签: python pandas dataframe

我有两个数据框。

pd.DataFrame({'date': {10: Timestamp('2019-01-01 10:00:00'), 52: Timestamp('2019-01-03 04:00:00'), 54: Timestamp('2019-01-03 06:00:00'), 72: Timestamp('2019-01-04 00:00:00'), 74: Timestamp('2019-01-04 02:00:00')}, 'value_1': {10: 4380.0, 52: 4440.0, 54: 4630.0, 72: 4540.0, 74: 4460.0}, 'value_2': {10: 5, 52: 5, 54: 1, 72: 5, 74: 1}})

DF1

                  date  value_1  value_2
10 2019-01-01 10:00:00   4380.0        5
52 2019-01-03 04:00:00   4440.0        5
54 2019-01-03 06:00:00   4630.0        1
72 2019-01-04 00:00:00   4540.0        5
74 2019-01-04 02:00:00   4460.0        1

DF2包含与DF1相同的日期列,起始日期为2019-01-01 00:00:00,结束于2019-12-31 00:00:00,以及其他不常见的列。

如果DF1和DF2中的日期匹配,我将DF1中的values_1的值放入DF2中,如下所示:

DF2['value_1'] = DF2['date'].map(DF1.set_index('date')['value_1'])

现在,我尝试将匹配日期的最后30分钟内的相同值放入DF2。换句话说,如果匹配的日期和时间为2019-01-01 10:00:00,而value_1为4380.0。然后,对于DF2中4380.02019-01-01 09:30:00日期的日期,value_1列应为2019-01-01 10:00:00

我该怎么做?

1 个答案:

答案 0 :(得分:1)

我认为您需要merge_asof,其默认值为direction='backward',然后是direction='forward',并按DataFrame.combine_first合并两个DataFrame:

DF1 = pd.DataFrame({'date': {10: pd.Timestamp('2019-01-01 10:00:00'), 52: pd.Timestamp('2019-01-03 04:00:00'), 54: pd.Timestamp('2019-01-03 06:00:00'), 72: pd.Timestamp('2019-01-04 00:00:00'), 74: pd.Timestamp('2019-01-04 02:00:00')}, 'value_1': {10: 4380.0, 52: 4440.0, 54: 4630.0, 72: 4540.0, 74: 4460.0}, 'value_2': {10: 5, 52: 5, 54: 1, 72: 5, 74: 1}})

#small data for test    
DF2 = pd.DataFrame({'date':pd.date_range('2019-01-01 08:00:00', 
                                         '2019-01-01 12:00:00', freq='20Min')})
print (DF2)
                  date
0  2019-01-01 08:00:00
1  2019-01-01 08:20:00
2  2019-01-01 08:40:00
3  2019-01-01 09:00:00
4  2019-01-01 09:20:00
5  2019-01-01 09:40:00
6  2019-01-01 10:00:00
7  2019-01-01 10:20:00
8  2019-01-01 10:40:00
9  2019-01-01 11:00:00
10 2019-01-01 11:20:00
11 2019-01-01 11:40:00
12 2019-01-01 12:00:00

df1 = pd.merge_asof(DF2, DF1, on='date', tolerance=pd.Timedelta('30Min'))
df2 = pd.merge_asof(DF2, DF1, on='date', tolerance=pd.Timedelta('30Min'), direction='forward')

df = df1.combine_first(df2)
print (df)
                  date  value_1  value_2
0  2019-01-01 08:00:00      NaN      NaN
1  2019-01-01 08:20:00      NaN      NaN
2  2019-01-01 08:40:00      NaN      NaN
3  2019-01-01 09:00:00      NaN      NaN
4  2019-01-01 09:20:00      NaN      NaN
5  2019-01-01 09:40:00   4380.0      5.0
6  2019-01-01 10:00:00   4380.0      5.0
7  2019-01-01 10:20:00   4380.0      5.0
8  2019-01-01 10:40:00      NaN      NaN
9  2019-01-01 11:00:00      NaN      NaN
10 2019-01-01 11:20:00      NaN      NaN
11 2019-01-01 11:40:00      NaN      NaN
12 2019-01-01 12:00:00      NaN      NaN