Question

假设我们有这样的df（用户可能在同一天有多行）：

df = pd.DataFrame({"user_id" : ["A"] * 5 + ["B"] * 5,
               "hour" : [10] * 10,
               "date" : ["2018-01-16", "2018-01-16","2018-01-18","2018-01-19","2018-02-16","2018-01-16", "2018-01-16","2018-01-18","2018-01-19","2018-02-16"], "amount" : [1] * 10})  
df['date'] = pd.to_datetime(df['date'])

输出：

amount  date    hour    user_id
0   1   2018-01-16  10  A
1   1   2018-01-16  10  A
2   1   2018-01-18  10  A
3   1   2018-01-19  10  A
4   1   2018-02-16  10  A
5   1   2018-01-16  10  B
6   1   2018-01-16  10  B
7   1   2018-01-18  10  B
8   1   2018-01-19  10  B
9   1   2018-02-16  10  B

我希望每个agg rolling stats和amount获得user_id hour。目前我这样做了：

def get_rolling_stats(df, rolling_interval = 7) : 
    index_cols = ['user_id', 'hour', 'date']
    grp = df.groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).rolling(window='%sD'%rolling_interval, on = 'date')
    def agg_grp(grp, func):
        res = grp.agg({'amount' : func})

        res = res.reset_index()
        res.drop_duplicates(index_cols, inplace = True, keep = 'last')
        res.rename(columns = {'amount' : "amount_" + func}, inplace = True)
       return res

    grp1 = agg_grp(grp, "mean")
    grp2 = agg_grp(grp, "count")

    grp = grp1.merge(grp2, on = index_cols)
    return grp

所以输出：

user_id hour    date    amount_mean amount_count
0   A   10  2018-01-16  1.0 1.0
1   A   10  2018-01-18  1.0 3.0
2   A   10  2018-01-19  1.0 4.0
3   A   10  2018-02-16  1.0 1.0
4   B   10  2018-01-16  1.0 1.0
5   B   10  2018-01-18  1.0 3.0
6   B   10  2018-01-19  1.0 4.0
7   B   10  2018-02-16  1.0 1.0

但我想从滚动窗口中排除当前日期。所以我想要那样的输出：

user_id hour    date    amount_mean amount_count
0   A   10  2018-01-16  nan 0.0
1   A   10  2018-01-18  1.0 2.0
2   A   10  2018-01-19  1.0 3.0
3   A   10  2018-02-16  nan 0.0
4   B   10  2018-01-16  nan 0.0
5   B   10  2018-01-18  1.0 2.0
6   B   10  2018-01-19  1.0 3.0
7   B   10  2018-02-16  nan 0.0

我已经读过rolling方法已经arg closed。但是，如果我使用它 - 它会引发错误：ValueError: closed only implemented for datetimelike and offset based windows。我还没有找到如何使用它的任何例子。有人可以解释如何正确实现get_rolling_stats功能？

Answer 1

好像我找到了例子 - https://pandas.pydata.org/pandas-docs/stable/computation.html#rolling-window-endpoints。而我所要做的就是取代：

grp = df.groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).rolling(window='%sD'%rolling_interval, on = 'date')

通过

grp = df.set_index('date').groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).\
                   rolling(window='%sD'%rolling_interval, closed = 'neither')

通过滚动打开的窗口

1 个答案: