在group-user-date数据集中按组计算滚动7天唯一身份用户时遇到问题。这是一个经典的指标,并且认为有人可以帮助我在熊猫中做到这一点。
示例数据:
from StringIO import StringIO
import pandas as pd
data = StringIO("""grp1,user,date
a,1,2016-10-10
a,1,2016-10-09
a,1,2016-10-07
a,2,2016-10-09
a,2,2016-10-06
a,3,2016-10-10
a,3,2016-10-09
""")
df = pd.read_csv(data)
对于这个简单的数据集,我想返回:
a, 2016-10-10, 3 <- 3 users were in group a in the 7 days ending 10/10
a, 2016-10-09, 3 <- 3 users were in group a in the 7 days ending 10/09
a, 2016-10-07, 2 <- 2 users were in group a in the 7 days ending 10/07
a, 2016-10-06, 1 <- 1 users were in group a in the 7 days ending 10/06
如果它是原始数据集或聚合的转换,我不介意。
尝试了1)很多搜索和2)很多变种
from datetime import datetime, timedelta
rolling_uniques = lambda x: x['user'].unique().size if x['date'] + timedelta(days=6) <= x['date'].max() else 0
df.apply(rolling_uniques, axis=1)
OR
df.groupby(['grp1', 'user', 'date']).transform(rolling_uniques)
但没有任何结果。在我的数据中,我有多个组列,当然还有grp1中的更多类别,而不仅仅是&#39; a&#39;。
答案 0 :(得分:1)
我现在不知道这是否是预期的结果,但我认为它可以帮到你。让我知道。
# Test data
data = io.StringIO("""grp1,user,date
a,1,2016-10-10
a,1,2016-10-09
a,1,2016-10-07
a,2,2016-10-09
a,2,2016-10-06
a,3,2016-10-10
a,3,2016-10-09
b,1,2016-10-09
b,2,2016-10-10
""")
df = pd.read_csv(data)
df['date'] = pd.to_datetime(df['date'])
# Setting and sorting the index
df.set_index('date', inplace=True)
df.sort_index(inplace=True)
# Resampling data by preserving the group
df = df.groupby([df.index.to_period('D'), df['grp1']]).sum()
df = df.unstack('grp1')
df = df.resample('D').sum().fillna(0)
# Computing the rolling sum
df = df.rolling(7, min_periods=0).sum()
# Formatting
df = df.stack()
df = df.swaplevel(0,1)
print(df)
# user
# grp1 date
# a 2016-10-06 2.0
# b 2016-10-06 0.0
# a 2016-10-07 3.0
# b 2016-10-07 0.0
# a 2016-10-08 3.0
# b 2016-10-08 0.0
# a 2016-10-09 9.0
# b 2016-10-09 1.0
# a 2016-10-10 13.0
# b 2016-10-10 3.0