以下数据集代表购买行为:
user_id, product_code, bought_date, time_spent, store_id, product_type, refurbished, unqiue_visit_id
001, e.12, 20120102, 104, 101, computer, yes, 1010
002, e.24, 20120201, 100, 101, infant-dress, no, 2001
003, s.32, 20130302, 230, 101, shoes, no, 2121
004, y.23, 20130404, 212, 103, computer, yes, 2422
005, s.43, 20130803, 104, 101, laptop, yes, 2342
001, a.12, 20120102, 104, 101, computer, yes, 1011
002, b.24, 20120201, 100, 101, infant-dress, no, 2001
003, c.32, 20130302, 230, 101, shoes, no, 2122
004, e.23, 20130404, 212, 103, computer, yes, 2424
005, f.43, 20130803, 104, 101, laptop, yes, 2340
001, g.12, 20120102, 104, 101, computer, yes, 1013
002, h.24, 20120201, 100, 101, infant-dress, no, 2031
003, l.32, 20130302, 230, 101, shoes, no, 2000
004, m.23, 20130404, 212, 103, computer, yes, 1422
005, d.43, 20130803, 104, 101, laptop, yes, 1142
001, d.12, 20120102, 104, 101, desk, yes, 1110
002, f.24, 20120201, 100, 101, glass, no, 1111
003, n.32, 20130302, 230, 101, liquid, no, 2021
004, t.23, 20130404, 212, 103, liquid, yes, 22
005, u.43, 20130803, 104, 101, dress, yes, 2942
001, d.12, 20120102, 104, 101, desk, yes, 1910
002, f.24, 20120201, 100, 101, glass, no, 2901
003, n.32, 20130302, 230, 101, liquid, no, 2921
004, t.23, 20130404, 212, 103, liquid, yes, 2922
005, u.43, 20130803, 104, 101, dress, yes, 2942
001, kk.12, 20120103, 105, 101, desk, yes, 410
003, n.32, 20130303, 230, 101, liquid, no, 2621
最终目标是使用以下步骤为用户分配产品类型。
首先,我按user_id
,product_type
分组,并获得product_type
所访问的用户访问次数(次数)。
如果组(user_id
,product_id
中的数量相等,则选择用户最近访问过的产品类型并将其分配给用户。如果访问日期相等,那么我们通过查看refurbished
值(yes > no)
来打破平局。
visit_counts = merged_visits_df.groupby(['user_id','product_type'], as_index=False).agg({'unique_visits_id': 'nunique'})
上面给出了访问次数,以尝试计算其余的过程。
答案 0 :(得分:1)
我认为以下内容可以满足您的要求(列名在您发布的数据中拼写错误,我以这种方式保留了它们,即'unqiue_visit_id')
counts = (
# sort by bought date
merged_visits_df.sort_values('bought_date', ascending=False)
# groupby desired cols
.groupby(['user_id','product_type'],as_index=False)
# apply desired aggregation functions
.agg({'unqiue_visit_id': 'nunique', 'bought_date': 'first', 'refurbished': 'first'})
)
然后我们可以通过user_id获取最大访问量
max_by_user = counts.groupby('user_id')['unqiue_visit_id'].max()
最后,我们可以过滤出访问量=用户最大访问量的项目,按所需的列进行排序,并获得第一个。
result = (
# filter to products with max visits by user
counts[counts['user_id'].apply(max_by_user.get) == counts['unqiue_visit_id']]
# sort bought_date descending (max on top), refurbished descending (yes above no)
.sort_values(['bought_date', 'refurbished'], ascending=False)
# groupby user id and select the first
.groupby('user_id').nth(0)
)
以这种方式思考可能更直观:
步骤1: 添加您要排序的列:
# initial question
visits_df = merged_visits_df.groupby(['user_id','product_type']).agg({'unqiue_visit_id': 'nunique'}).add_suffix('_count')
df_to_sort = merged_visits_df.merge(visits_df.reset_index())
# follow up question
df_to_sort['last_num'] = df_to_sort['store_id'] % 10
然后排序,分组,首先获得:
(
df_to_sort
.sort_values([unqiue_visit_id_count, bought_date, last_num], ascending=[False, False, True])
.groupby(['user_id']).nth(0)
)