按组在熊猫数据帧上滚动时态窗口

时间:2019-11-01 11:36:59

标签: python pandas pandas-groupby

请考虑以下示例数据框(以下用于构建的代码):

             t    p
o                   
2007-01-01  0.0  1.0
2007-01-02  0.0  1.0
2007-01-03  0.0  1.0
2007-01-10  0.0  1.0
2007-01-11  0.0  1.0
2007-01-20  1.0  0.0
2007-01-21  1.0  0.0
2007-01-22  1.0  0.0
2007-01-23  1.0  0.0
2007-01-27  1.0  0.0

我想为t中的每个“组”提供2天的前瞻性窗口期的总和。为此,我实现了:

df.iloc[::-1].groupby('t').rolling(window='2D').sum()

但是,这返回:

                 t    p
 t      o                   
0.0 2007-01-11  0.0  1.0
    2007-01-10  0.0  2.0
    2007-01-03  0.0  3.0
    2007-01-02  0.0  4.0
    2007-01-01  0.0  5.0
1.0 2007-01-27  1.0  0.0
    2007-01-23  2.0  0.0
    2007-01-22  3.0  0.0
    2007-01-21  4.0  0.0
    2007-01-20  5.0  0.0

这不是两天的滚动窗口总和。我认为问题在于,当我对t进行分组时,我丢失了时间信息('o'),因为它被设置为数据帧索引。

由于数据框的大小,每组将行重新采样为固定的1天间隔将不起作用。我尝试按“ t”然后按“ o”分组,但这不起作用。

我想要的解决方案是:

             t    p
    o                   
2007-01-01  0.0  2.0
2007-01-02  0.0  1.0
2007-01-03  0.0  0.0
2007-01-10  0.0  1.0
2007-01-11  0.0  0.0
2007-01-20  2.0  0.0
2007-01-21  2.0  0.0
2007-01-22  1.0  0.0
2007-01-23  0.0  0.0
2007-01-27  0.0  0.0

补充代码:

# code to construct df used in this example
o = ['2007-01-01','2007-01-02','2007-01-03','2007-01-10','2007-01-11',
     '2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-27']
t = np.zeros(10)
p = np.ones(10)
p[5:] = 0
t[5:] = 1
df = pd.DataFrame({'o':o, 't':t, 'p':p})
df['o'] = pd.to_datetime(df['o'], format='%Y-%m-%d')
df = df.set_index('o')

1 个答案:

答案 0 :(得分:1)

作为替代方案(持续两天):

def day_shift(x, days=2):
    ret = pd.DataFrame(0, index=x.index, columns=x.columns)
    for day in range(-days, 0):
        ret = ret.add(x.shift(day, freq='D'), fill_value=0)

    return ret.reindex(x.index)

df.groupby('t', as_index=False).apply(day_shift, days=2)

输出:

              t    p
o                   
2007-01-01  0.0  2.0
2007-01-02  0.0  1.0
2007-01-03  0.0  0.0
2007-01-10  0.0  1.0
2007-01-11  0.0  0.0
2007-01-20  2.0  0.0
2007-01-21  2.0  0.0
2007-01-22  1.0  0.0
2007-01-23  0.0  0.0
2007-01-27  0.0  0.0

编辑:利用滚动日期的另一种方法是反转日期索引,然后我们可以使用向后滚动,这实际上是根据原始日期进行的向前滚动:

future_date = pd.to_datetime('2100-01-01')
ancient_date = pd.to_datetime('2000-01-01')

# instead of setting `'o'` as index, let set ['o','t'] as index
df = df.set_index(['o','t'])

# here comes the crazy code
(df
    .assign(r_dates = (future_date - df.index.get_level_values('o')) + ancient_date)  # reverse date
    .sort_values('r_dates')
    .groupby('t')
    .rolling('2D', on='r_dates').sum()    # change 2 to the actual number of days
    .reset_index(level=0, drop=True)      # remove the index caused by groupby
    .assign(r_dates = lambda x: (x.index.get_level_values('o') - pd.to_timedelta('1D')), # shifted the date by one, since rolling includes the current date
           )
    .reset_index()
    .drop('o', axis=1)
    .set_index(['r_dates','t'])
    .reindex(df.index, fill_value=0)
)

输出:

                  p
o          t       
2007-01-01 0.0  2.0
2007-01-02 0.0  1.0
2007-01-03 0.0  0.0
2007-01-10 0.0  1.0
2007-01-11 0.0  0.0
2007-01-01 1.0  0.0
2007-01-02 1.0  0.0
2007-01-03 1.0  0.0
2007-01-10 1.0  0.0
2007-01-11 1.0  0.0