假设我们有这样的df(用户可能在同一天有多行):
df = pd.DataFrame({"user_id" : ["A"] * 5 + ["B"] * 5,
"hour" : [10] * 10,
"date" : ["2018-01-16", "2018-01-16","2018-01-18","2018-01-19","2018-02-16","2018-01-16", "2018-01-16","2018-01-18","2018-01-19","2018-02-16"], "amount" : [1] * 10})
df['date'] = pd.to_datetime(df['date'])
输出:
amount date hour user_id
0 1 2018-01-16 10 A
1 1 2018-01-16 10 A
2 1 2018-01-18 10 A
3 1 2018-01-19 10 A
4 1 2018-02-16 10 A
5 1 2018-01-16 10 B
6 1 2018-01-16 10 B
7 1 2018-01-18 10 B
8 1 2018-01-19 10 B
9 1 2018-02-16 10 B
我希望每个agg rolling stats
和amount
获得user_id
hour
。目前我这样做了:
def get_rolling_stats(df, rolling_interval = 7) :
index_cols = ['user_id', 'hour', 'date']
grp = df.groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).rolling(window='%sD'%rolling_interval, on = 'date')
def agg_grp(grp, func):
res = grp.agg({'amount' : func})
res = res.reset_index()
res.drop_duplicates(index_cols, inplace = True, keep = 'last')
res.rename(columns = {'amount' : "amount_" + func}, inplace = True)
return res
grp1 = agg_grp(grp, "mean")
grp2 = agg_grp(grp, "count")
grp = grp1.merge(grp2, on = index_cols)
return grp
所以输出:
user_id hour date amount_mean amount_count
0 A 10 2018-01-16 1.0 1.0
1 A 10 2018-01-18 1.0 3.0
2 A 10 2018-01-19 1.0 4.0
3 A 10 2018-02-16 1.0 1.0
4 B 10 2018-01-16 1.0 1.0
5 B 10 2018-01-18 1.0 3.0
6 B 10 2018-01-19 1.0 4.0
7 B 10 2018-02-16 1.0 1.0
但我想从滚动窗口中排除当前日期。所以我想要那样的输出:
user_id hour date amount_mean amount_count
0 A 10 2018-01-16 nan 0.0
1 A 10 2018-01-18 1.0 2.0
2 A 10 2018-01-19 1.0 3.0
3 A 10 2018-02-16 nan 0.0
4 B 10 2018-01-16 nan 0.0
5 B 10 2018-01-18 1.0 2.0
6 B 10 2018-01-19 1.0 3.0
7 B 10 2018-02-16 nan 0.0
我已经读过rolling
方法已经arg closed
。但是,如果我使用它 - 它会引发错误:ValueError: closed only implemented for datetimelike and offset based windows
。我还没有找到如何使用它的任何例子。有人可以解释如何正确实现get_rolling_stats
功能?
答案 0 :(得分:1)
好像我找到了例子 - https://pandas.pydata.org/pandas-docs/stable/computation.html#rolling-window-endpoints。而我所要做的就是取代:
grp = df.groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).rolling(window='%sD'%rolling_interval, on = 'date')
通过
grp = df.set_index('date').groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).\
rolling(window='%sD'%rolling_interval, closed = 'neither')