我正在尝试跨时间窗口获取日志和汇总计数的数据框,尤其是在购买之前。目标是创建可用于预测未来购买的功能。
这是我的原始df
user_id activity_date activity_type
0 2013-07-11 EmailOpen
0 2013-07-11 FormSubmit
0 2013-07-15 EmailOpen
0 2013-07-17 Purchase
0 2013-07-18 EmailOpen
我希望我的结果看起来像这样:
user_id EmailOpen_count FormSubmit_count Days_since_start Purchase
0 2 1 6 1
0 1 0 1 0
上面的想法是我在购买之前进行了汇总,由于该用户只有一次购买,因此下一行将汇总上次购买之后的所有内容。
我尝试先提取“购买”日期,然后进行迭代,但是整夜都没有成功运行。这是我要提取日期的方式,但是即使这样花费的时间也太长了,我相信建立新的数据框将花费数千年:
purchase_dict = {}
for user in list_of_users:
# Stores list of days when purchase was made for each user.
days_bought = list(df[df['user_id'] == user][df['activity_type'] == 'Purchase']['activity_date'])
purchase_dict[user] = days_bought
我想知道groupbys,agg,time_between等是否有半有效的方法。谢谢!
答案 0 :(得分:0)
也许有点笨拙,并且最后需要重命名某些列,但这似乎对我有用(使用新的测试数据):
user_id activity_date activity_type
0 2013-07-11 EmailOpen
0 2013-07-11 FormSubmit
0 2013-07-15 EmailOpen
0 2013-07-17 Purchase
0 2013-07-18 EmailOpen
1 2013-07-12 Purchase
1 2013-07-12 FormSubmit
1 2013-07-15 EmailOpen
1 2013-07-18 Purchase
1 2013-07-18 EmailOpen
2 2013-07-09 EmailOpen
2 2013-07-10 Purchase
2 2013-07-15 EmailOpen
2 2013-07-22 Purchase
2 2013-07-23 EmailOpen
# Convert to datetime
df['activity_date'] = pd.to_datetime(df['activity_date'])
# Create shifted flag to identify purchase
df['x'] = (df['activity_type'] == 'Purchase').astype(int).shift().fillna(method='bfill')
# Calculate time window as cumsum of this shifted flag
df['time_window'] = df.groupby('user_id')['x'].cumsum()
# Pivot to count activities by user ID and time window
df2 = df.pivot_table(values='activity_date', index=['user_id', 'time_window'],
columns='activity_type', aggfunc=len, fill_value=0)
# Create separate table of days elapsed by user ID & time window
time_elapsed = ( df.groupby(['user_id', 'time_window'])['activity_date'].max()
- df.groupby(['user_id', 'time_window'])['activity_date'].min() )
# Merge dataframes
df3 = df2.join(time_elapsed)
收益
EmailOpen FormSubmit Purchase activity_date
user_id time_window
0 0.0 2 1 1 6 days
1.0 1 0 0 0 days
1 0.0 0 0 1 0 days
1.0 1 1 1 6 days
2.0 1 0 0 0 days
2 0.0 1 0 1 1 days
1.0 1 0 1 7 days
2.0 1 0 0 0 days
按评论编辑:
要添加活动类型所花费的时间:
time_since_activity = ( df.groupby(['user_id', 'time_window'])['activity_date'].max()
- df.groupby(['user_id', 'time_window', 'activity_type'])['activity_date'].max() )
df4 = df3.join(time_since_activity.unstack('activity_type'), rsuffix='_time')
屈服
EmailOpen FormSubmit ... FormSubmittime Purchasetime
user_id time_window ...
0 0.0 2 1 ... 6 days 0 days
1.0 1 0 ... NaT NaT
1 0.0 0 0 ... NaT 0 days
1.0 1 1 ... 6 days 0 days
2.0 1 0 ... NaT NaT
2 0.0 1 0 ... NaT 0 days
1.0 1 0 ... NaT 0 days
2.0 1 0 ... NaT NaT