以下数据集表示用户的浏览行为。
user_id session_id keyword real_time_stamp presented clicked
0 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 101 101
1 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 102 None
2 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 103 None
3 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 104 None
4 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 105 None
5 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 106 None
6 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 107 None
7 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 108 None
8 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 109 None
9 10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 110 None
呈现的列显示向用户呈现了哪些项目,以及从呈现的项目中单击了哪些项目。在上面的示例中,单击了第一项。目标是将这些信息合并为一行,同时还要保留时间戳。
通过遵循group_by
给出结构,但没有real_time_stamp
和clicked
。将real_time_stamp
投放到网上论坛后,您就不会获得“汇总”版本。
df_collapse = df.groupby(['user_id', 'session_id', 'keyword'])['presented'].apply(lambda x: '|'.join(x)).reset_index()
我的尝试是获得以下结构:
user_id session_id keyword real_time_stamp presented clicked
10010 s2342009n camera 2020-03-01 05:00:19.195000+00:00 101|102|103|104|105|106|107|108|109|110 101
答案 0 :(得分:2)
使用transform
s=df.groupby(['user_id', 'session_id', 'keyword'])['presented'].transform(lambda x: '|'.join(x.astype(str)))
df['New']=s
如果汇总为
df.groupby(['user_id', 'session_id', 'keyword']).\
agg({'presented':'|'.join,
'real_time_stamp':'first',
'clicked':'first'})