请考虑以下示例数据框(以下用于构建的代码):
t p
o
2007-01-01 0.0 1.0
2007-01-02 0.0 1.0
2007-01-03 0.0 1.0
2007-01-10 0.0 1.0
2007-01-11 0.0 1.0
2007-01-20 1.0 0.0
2007-01-21 1.0 0.0
2007-01-22 1.0 0.0
2007-01-23 1.0 0.0
2007-01-27 1.0 0.0
我想为t
中的每个“组”提供2天的前瞻性窗口期的总和。为此,我实现了:
df.iloc[::-1].groupby('t').rolling(window='2D').sum()
但是,这返回:
t p
t o
0.0 2007-01-11 0.0 1.0
2007-01-10 0.0 2.0
2007-01-03 0.0 3.0
2007-01-02 0.0 4.0
2007-01-01 0.0 5.0
1.0 2007-01-27 1.0 0.0
2007-01-23 2.0 0.0
2007-01-22 3.0 0.0
2007-01-21 4.0 0.0
2007-01-20 5.0 0.0
这不是两天的滚动窗口总和。我认为问题在于,当我对t
进行分组时,我丢失了时间信息('o'),因为它被设置为数据帧索引。
由于数据框的大小,每组将行重新采样为固定的1天间隔将不起作用。我尝试按“ t”然后按“ o”分组,但这不起作用。
我想要的解决方案是:
t p
o
2007-01-01 0.0 2.0
2007-01-02 0.0 1.0
2007-01-03 0.0 0.0
2007-01-10 0.0 1.0
2007-01-11 0.0 0.0
2007-01-20 2.0 0.0
2007-01-21 2.0 0.0
2007-01-22 1.0 0.0
2007-01-23 0.0 0.0
2007-01-27 0.0 0.0
补充代码:
# code to construct df used in this example
o = ['2007-01-01','2007-01-02','2007-01-03','2007-01-10','2007-01-11',
'2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-27']
t = np.zeros(10)
p = np.ones(10)
p[5:] = 0
t[5:] = 1
df = pd.DataFrame({'o':o, 't':t, 'p':p})
df['o'] = pd.to_datetime(df['o'], format='%Y-%m-%d')
df = df.set_index('o')
答案 0 :(得分:1)
作为替代方案(持续两天):
def day_shift(x, days=2):
ret = pd.DataFrame(0, index=x.index, columns=x.columns)
for day in range(-days, 0):
ret = ret.add(x.shift(day, freq='D'), fill_value=0)
return ret.reindex(x.index)
df.groupby('t', as_index=False).apply(day_shift, days=2)
输出:
t p
o
2007-01-01 0.0 2.0
2007-01-02 0.0 1.0
2007-01-03 0.0 0.0
2007-01-10 0.0 1.0
2007-01-11 0.0 0.0
2007-01-20 2.0 0.0
2007-01-21 2.0 0.0
2007-01-22 1.0 0.0
2007-01-23 0.0 0.0
2007-01-27 0.0 0.0
编辑:利用滚动日期的另一种方法是反转日期索引,然后我们可以使用向后滚动,这实际上是根据原始日期进行的向前滚动:
future_date = pd.to_datetime('2100-01-01')
ancient_date = pd.to_datetime('2000-01-01')
# instead of setting `'o'` as index, let set ['o','t'] as index
df = df.set_index(['o','t'])
# here comes the crazy code
(df
.assign(r_dates = (future_date - df.index.get_level_values('o')) + ancient_date) # reverse date
.sort_values('r_dates')
.groupby('t')
.rolling('2D', on='r_dates').sum() # change 2 to the actual number of days
.reset_index(level=0, drop=True) # remove the index caused by groupby
.assign(r_dates = lambda x: (x.index.get_level_values('o') - pd.to_timedelta('1D')), # shifted the date by one, since rolling includes the current date
)
.reset_index()
.drop('o', axis=1)
.set_index(['r_dates','t'])
.reindex(df.index, fill_value=0)
)
输出:
p
o t
2007-01-01 0.0 2.0
2007-01-02 0.0 1.0
2007-01-03 0.0 0.0
2007-01-10 0.0 1.0
2007-01-11 0.0 0.0
2007-01-01 1.0 0.0
2007-01-02 1.0 0.0
2007-01-03 1.0 0.0
2007-01-10 1.0 0.0
2007-01-11 1.0 0.0