大熊猫跨时间窗口与用户日志聚合

时间:2019-07-12 16:04:37

标签: pandas pandas-groupby

我正在尝试跨时间窗口获取日志和汇总计数的数据框,尤其是在购买之前。目标是创建可用于预测未来购买的功能。

这是我的原始df

user_id activity_date activity_type
0       2013-07-11    EmailOpen
0       2013-07-11    FormSubmit
0       2013-07-15    EmailOpen
0       2013-07-17    Purchase
0       2013-07-18    EmailOpen                     

我希望我的结果看起来像这样:

user_id EmailOpen_count FormSubmit_count Days_since_start Purchase
0       2               1                6                1
0       1               0                1                0

上面的想法是我在购买之前进行了汇总,由于该用户只有一次购买,因此下一行将汇总上次购买之后的所有内容。

我尝试先提取“购买”日期,然后进行迭代,但是整夜都没有成功运行。这是我要提取日期的方式,但是即使这样花费的时间也太长了,我相信建立新的数据框将花费数千年:

purchase_dict = {}
for user in list_of_users:
    # Stores list of days when purchase was made for each user.
    days_bought = list(df[df['user_id'] == user][df['activity_type'] == 'Purchase']['activity_date'])
    purchase_dict[user] = days_bought

我想知道groupbys,agg,time_between等是否有半有效的方法。谢谢!

1 个答案:

答案 0 :(得分:0)

也许有点笨拙,并且最后需要重命名某些列,但这似乎对我有用(使用新的测试数据):

user_id activity_date activity_type
0       2013-07-11    EmailOpen
0       2013-07-11    FormSubmit
0       2013-07-15    EmailOpen
0       2013-07-17    Purchase
0       2013-07-18    EmailOpen   
1       2013-07-12    Purchase
1       2013-07-12    FormSubmit
1       2013-07-15    EmailOpen
1       2013-07-18    Purchase
1       2013-07-18    EmailOpen   
2       2013-07-09    EmailOpen
2       2013-07-10    Purchase
2       2013-07-15    EmailOpen
2       2013-07-22    Purchase
2       2013-07-23    EmailOpen   
# Convert to datetime
df['activity_date'] = pd.to_datetime(df['activity_date'])
# Create shifted flag to identify purchase
df['x'] = (df['activity_type'] == 'Purchase').astype(int).shift().fillna(method='bfill')
# Calculate time window as cumsum of this shifted flag
df['time_window'] = df.groupby('user_id')['x'].cumsum()
# Pivot to count activities by user ID and time window
df2 = df.pivot_table(values='activity_date', index=['user_id', 'time_window'], 
                     columns='activity_type', aggfunc=len, fill_value=0)

# Create separate table of days elapsed by user ID & time window
time_elapsed = ( df.groupby(['user_id', 'time_window'])['activity_date'].max() 
                 - df.groupby(['user_id', 'time_window'])['activity_date'].min() )

# Merge dataframes
df3 = df2.join(time_elapsed)

收益

                     EmailOpen  FormSubmit  Purchase activity_date
user_id time_window                                               
0       0.0                  2           1         1        6 days
        1.0                  1           0         0        0 days
1       0.0                  0           0         1        0 days
        1.0                  1           1         1        6 days
        2.0                  1           0         0        0 days
2       0.0                  1           0         1        1 days
        1.0                  1           0         1        7 days
        2.0                  1           0         0        0 days

按评论编辑:

要添加活动类型所花费的时间:

time_since_activity = ( df.groupby(['user_id', 'time_window'])['activity_date'].max() 
                      - df.groupby(['user_id', 'time_window', 'activity_type'])['activity_date'].max() )

df4 = df3.join(time_since_activity.unstack('activity_type'), rsuffix='_time')

屈服

                     EmailOpen  FormSubmit  ...  FormSubmittime Purchasetime
user_id time_window                         ...                             
0       0.0                  2           1  ...          6 days       0 days
        1.0                  1           0  ...             NaT          NaT
1       0.0                  0           0  ...             NaT       0 days
        1.0                  1           1  ...          6 days       0 days
        2.0                  1           0  ...             NaT          NaT
2       0.0                  1           0  ...             NaT       0 days
        1.0                  1           0  ...             NaT       0 days
        2.0                  1           0  ...             NaT          NaT