滚动唯一群组+大熊猫时间

时间:2016-10-29 04:26:15

标签: python pandas

在group-user-date数据集中按组计算滚动7天唯一身份用户时遇到问题。这是一个经典的指标,并且认为有人可以帮助我在熊猫中做到这一点。

示例数据:

from StringIO import StringIO
import pandas as pd

data = StringIO("""grp1,user,date
    a,1,2016-10-10
    a,1,2016-10-09
    a,1,2016-10-07
    a,2,2016-10-09
    a,2,2016-10-06
    a,3,2016-10-10
    a,3,2016-10-09
    """)

df = pd.read_csv(data)

对于这个简单的数据集,我想返回:

    a, 2016-10-10, 3  <- 3 users were in group a in the 7 days ending 10/10
    a, 2016-10-09, 3  <- 3 users were in group a in the 7 days ending 10/09
    a, 2016-10-07, 2  <- 2 users were in group a in the 7 days ending 10/07
    a, 2016-10-06, 1  <- 1 users were in group a in the 7 days ending 10/06

如果它是原始数据集或聚合的转换,我不介意。

尝试了1)很多搜索和2)很多变种

from datetime import datetime, timedelta

rolling_uniques = lambda x: x['user'].unique().size if x['date'] + timedelta(days=6) <= x['date'].max() else 0

df.apply(rolling_uniques, axis=1)

OR

df.groupby(['grp1', 'user', 'date']).transform(rolling_uniques)

但没有任何结果。在我的数据中,我有多个组列,当然还有grp1中的更多类别,而不仅仅是&#39; a&#39;。

1 个答案:

答案 0 :(得分:1)

我现在不知道这是否是预期的结果,但我认为它可以帮到你。让我知道。

# Test data
data = io.StringIO("""grp1,user,date
    a,1,2016-10-10
    a,1,2016-10-09
    a,1,2016-10-07
    a,2,2016-10-09
    a,2,2016-10-06
    a,3,2016-10-10
    a,3,2016-10-09
    b,1,2016-10-09
    b,2,2016-10-10
    """)


df = pd.read_csv(data)
df['date'] = pd.to_datetime(df['date'])
# Setting and sorting the index
df.set_index('date', inplace=True)
df.sort_index(inplace=True)

# Resampling data by preserving the group
df = df.groupby([df.index.to_period('D'), df['grp1']]).sum()
df = df.unstack('grp1')
df = df.resample('D').sum().fillna(0)
# Computing the rolling sum
df = df.rolling(7, min_periods=0).sum()

# Formatting
df = df.stack()
df = df.swaplevel(0,1)

print(df)
#                   user
# grp1  date            
#     a 2016-10-06   2.0
#     b 2016-10-06   0.0
#     a 2016-10-07   3.0
#     b 2016-10-07   0.0
#     a 2016-10-08   3.0
#     b 2016-10-08   0.0
#     a 2016-10-09   9.0
#     b 2016-10-09   1.0
#     a 2016-10-10  13.0
#     b 2016-10-10   3.0