我在做搜索引擎的查询分析。用户可以在一次会话的不同时间在谷歌搜索引擎上一一搜索不同的查询。
我有几个字段的数据:session_id
、log_time
、query
、feature_i
等。我想按 session_id
分组,然后按 {{ 1}} 按 concat
的顺序将几行合并为一行。以便输出数据以时间序列的方式表示用户的行为。
代码:
log_time
输出:
toy_data = pd.DataFrame({'session_id':[1,2,1,2,3,3,],
'log_time':[4,5,6,1,2,3],
'query':['hi','dude','pandas','groupby','sort','agg'],
'cate_feat_0':['apple','banana']*3,
'num_feat_0':[1,2,3,4,5,6]})
print(toy_data)
我想要的:
session_id log_time query cate_feat_0 num_feat_0
0 1 4 hi apple 1
1 2 5 dude banana 2
2 1 6 pandas apple 3
3 2 1 groupby banana 4
4 3 2 sort apple 5
5 3 3 agg banana 6
首先我们使用代码进行分组和聚合:
## note that all list are sorted by log time with each session_id group
session_id query_list log_time_list cate_feat_0_list num_feat_0_list
1 [hi, pandas] [4,6] [apple, apple] [1,3]
2 [groupby, dude] [1,5] [banana, banana] [4,2]
3 [sort,agg] [2,3] [apple, banana] [5,6]
给出:
toy_data_res = toy_data.groupby('session_id').agg({'query':list, 'log_time':list, 'cate_feat_0':list, 'num_feat_0':list})
toy_data_res
然后我们在每个会话中用代码排序:
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [dude, groupby] [5, 1] [banana, banana] [2, 4]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
给出:
for i in toy_data_res.index:
sort_index = np.argsort(toy_data_res.loc[i,'log_time']) ## get time order with in group
for col in toy_data_res.columns.values:
toy_data_res.loc[i,col] = [toy_data_res.loc[i,col][j] for j in sort_index] ## sort values in cols
toy_data_res
我的方法是快慢。有没有更好的方法来做 query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
?
提示:
We can use STRING_AGG
or GROUP_CONCAT
in SQL to do within group sorting.
答案 0 :(得分:3)
在groupby
之前使用DataFrame.sort_values
,如果需要应用相同的功能可以使用列名列表:
df = (toy_data.sort_values(['session_id','log_time'])
.groupby('session_id')[['query','log_time','cate_feat_0', 'num_feat_0']]
.agg(list))
print (df)
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
答案 1 :(得分:0)
尝试在 groupby 之前按 session_id 和 log_time 排序
df = pd.DataFrame({'session_id':[1,2,1,2,3,3,],
'log_time':[4,5,6,1,2,3],
'query':['hi','dude','pandas','groupby','sort','agg'],
'cate_feat_0':['apple','banana']*3,
'num_feat_0':[1,2,3,4,5,6]})
df=df.sort_values(by=['session_id','log_time'])
grouped=df.groupby('session_id')
['log_time','query','cate_feat_0','num_feat_0'].agg(list)
print(grouped)
输出
log_time query cate_feat_0 num_feat_0
session_id
1 [4, 6] [hi, pandas] [apple, apple] [1, 3]
2 [1, 5] [groupby, dude] [banana, banana] [4, 2]
3 [2, 3] [sort, agg] [apple, banana] [5, 6]