我有一个数据集(如下),我想按SafariViewController
对数据进行分组,并为每个user_id
获得每个cluster_label
的计数。目的是找出每个用户访问他们访问的每个集群的次数。
本质上,我正在寻找返回此信息的结果(可以在列表,字典或逗号分隔中):
user_id
我尝试了以下代码:
user_id, cluster 54, cluster 109, cluster 191, cluster 204, cluster 260, cluster 263, cluster 264, cluster 278, cluster 290
819000000000000000, 1 1 2 1 3 1 1 1 1
和
data['user_id'] = data.index
result = data.groupby(['user_id','cluster_label']).count()
第二个代码块使我更接近所要查找的内容,但我无法弄清计数部分:
groupby = data.groupby('user_id').filter(lambda x: len(x['user_id'])>=2)
#sort user locations by time
groupsort = groupby.sort_values(by='timestamp')
f = lambda x: [list(x)]
trajs = groupsort.groupby('user_id')['cluster_label'].apply(f).reset_index()
数据:
790068 [[485, 256, 304, 311, 311, 311, 311, 417, 417]]
答案 0 :(得分:1)
我认为您可以使用GroupBy.size
来替代计数,并用Series.unstack
进行整形,以替换是否缺失值:
result = data.groupby(['user_id','cluster_label']).size().unstack(fill_value=0)
print (result)
cluster_label 35 54 77 90 98 109 143 191 204 207 ... \
user_id ...
819000000000000000 0 1 0 0 0 1 0 2 1 0 ...
820000000000000000 0 0 1 0 2 0 1 0 0 0 ...
821000000000000000 1 0 0 1 0 0 0 0 0 1 ...
822000000000000000 0 0 0 0 0 0 0 0 0 0 ...
cluster_label 278 290 327 413 432 438 485 521 565 634
user_id
819000000000000000 1 1 0 0 0 0 0 0 0 0
820000000000000000 0 0 0 0 1 0 0 0 0 0
821000000000000000 0 0 1 0 0 1 1 1 1 0
822000000000000000 0 0 0 15 0 0 0 0 0 2
[4 rows x 23 columns]
result = data.groupby(['user_id','cluster_label']).size().unstack()
print (result)
cluster_label 35 54 77 90 98 109 143 191 204 207 ... \
user_id ...
819000000000000000 NaN 1.0 NaN NaN NaN 1.0 NaN 2.0 1.0 NaN ...
820000000000000000 NaN NaN 1.0 NaN 2.0 NaN 1.0 NaN NaN NaN ...
821000000000000000 1.0 NaN NaN 1.0 NaN NaN NaN NaN NaN 1.0 ...
822000000000000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
cluster_label 278 290 327 413 432 438 485 521 565 634
user_id
819000000000000000 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
820000000000000000 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
821000000000000000 NaN NaN 1.0 NaN NaN 1.0 1.0 1.0 1.0 NaN
822000000000000000 NaN NaN NaN 15.0 NaN NaN NaN NaN NaN 2.0
[4 rows x 23 columns]
或使用crosstab
:
result = pd.crosstab(data['user_id'],data['cluster_label'])
print (result)
cluster_label 35 54 77 90 98 109 143 191 204 207 ... \
user_id ...
819000000000000000 0 1 0 0 0 1 0 2 1 0 ...
820000000000000000 0 0 1 0 2 0 1 0 0 0 ...
821000000000000000 1 0 0 1 0 0 0 0 0 1 ...
822000000000000000 0 0 0 0 0 0 0 0 0 0 ...
cluster_label 278 290 327 413 432 438 485 521 565 634
user_id
819000000000000000 1 1 0 0 0 0 0 0 0 0
820000000000000000 0 0 0 0 1 0 0 0 0 0
821000000000000000 0 0 1 0 0 1 1 1 1 0
822000000000000000 0 0 0 15 0 0 0 0 0 2
[4 rows x 23 columns]