假设我有两个数据框df1
和df2
在df1
date value
0 2018-01-23 10:00:00 10
1 2018-01-23 10:05:00 20
2 2018-01-23 10:10:00 30
3 2018-01-23 10:15:00 40
4 2018-01-23 10:20:00 50
在df2
date value
0 2018-01-23 10:02:00 10
1 2018-01-23 10:03:00 20
2 2018-01-23 10:04:00 30
3 2018-01-23 10:05:00 40
4 2018-01-23 10:16:00 50
5 2018-01-23 10:17:00 60
首先我根据df1.date
获得IntervalIndex(左侧关闭,右侧打开),对于每个区间,我需要计算df2.value
的总和,并将总和映射到df1
。
编辑: 我使用的代码:
shift_date = df1.date.shift(-1)
shift_date[-1] = df1.date.iloc[-2] + timedelta(minutes=5) #avoid NaT
idx = pd.IntervalIndex.from_arrays(df1.date, shift_date, closed = "left")
df2_sum = df2.loc[idx.get_indexer(df1.date), 'value']
df2_sum = df2_sum.groupby(df2_sum.index).sum()
但只将df1
的值映射到df2.index
。
我正在寻找的是
date value df2_value
0 2018-01-23 10:00:00 10 60
1 2018-01-23 10:05:00 20 40
2 2018-01-23 10:10:00 30 0
3 2018-01-23 10:15:00 40 0
4 2018-01-23 10:20:00 50 110
答案 0 :(得分:1)
首先创建IntervalIndex
并删除未来某个日期NaT
填充2100-01-01
:
df1.index = pd.IntervalIndex.from_arrays(df1.date,
df1.date.shift(-1).fillna(pd.datetime(2100,1,1)),
closed = "left")
print (df1)
date value
[2018-01-23 10:00:00, 2018-01-23 10:05:00) 2018-01-23 10:00:00 10
[2018-01-23 10:05:00, 2018-01-23 10:10:00) 2018-01-23 10:05:00 20
[2018-01-23 10:10:00, 2018-01-23 10:15:00) 2018-01-23 10:10:00 30
[2018-01-23 10:15:00, 2018-01-23 10:20:00) 2018-01-23 10:15:00 40
[2018-01-23 10:20:00, 2100-01-01) 2018-01-23 10:20:00 50
然后将cut
与groupby一起使用并汇总sum
:
df3 = df2.groupby(pd.cut(df2.date, bins=df1.index))['value'].sum().rename('df2_value')
print (df3)
date
[2018-01-23 10:00:00, 2018-01-23 10:05:00) 60
[2018-01-23 10:05:00, 2018-01-23 10:10:00) 40
[2018-01-23 10:10:00, 2018-01-23 10:15:00) 0
[2018-01-23 10:15:00, 2018-01-23 10:20:00) 110
[2018-01-23 10:20:00, 2100-01-01) 0
Name: df2_value, dtype: int64
两个索引都相同,因此可以删除它和concat
:
df = pd.concat([df1.reset_index(drop=True), df3.reset_index(drop=True)], axis=1)
print (df)
date value df2_value
0 2018-01-23 10:00:00 10 60
1 2018-01-23 10:05:00 20 40
2 2018-01-23 10:10:00 30 0
3 2018-01-23 10:15:00 40 110
4 2018-01-23 10:20:00 50 0
答案 1 :(得分:0)
稍微简单一些:
ii = pd.IntervalIndex.from_breaks(df1['date'], closed='left')
res = df2.groupby(ii.get_indexer(df2['date']))['value'].sum()
df1['df2_value'] = res.reindex(df1.index, fill_value=0)
df1
的结果输出:
date value df2_value
0 2018-01-23 10:00:00 10 60
1 2018-01-23 10:05:00 20 40
2 2018-01-23 10:10:00 30 0
3 2018-01-23 10:15:00 40 110
4 2018-01-23 10:20:00 50 0