Python:按一列分组,从另一列获取计数

时间:2019-12-05 06:55:02

标签: python pandas pandas-groupby

我有一个数据集(如下),我想按SafariViewController对数据进行分组,并为每个user_id获得每个cluster_label的计数。目的是找出每个用户访问他们访问的每个集群的次数。

本质上,我正在寻找返回此信息的结果(可以在列表,字典或逗号分隔中):

user_id

我尝试了以下代码:

user_id,          cluster 54, cluster 109, cluster 191, cluster 204, cluster 260, cluster 263, cluster 264, cluster 278, cluster 290
819000000000000000, 1        1             2             1           3             1           1           1              1           

data['user_id'] = data.index
result = data.groupby(['user_id','cluster_label']).count() 

第二个代码块使我更接近所要查找的内容,但我无法弄清计数部分:

groupby = data.groupby('user_id').filter(lambda x: len(x['user_id'])>=2)

#sort user locations by time
groupsort = groupby.sort_values(by='timestamp')
f = lambda x: [list(x)]
trajs = groupsort.groupby('user_id')['cluster_label'].apply(f).reset_index()

数据:

790068    [[485, 256, 304, 311, 311, 311, 311, 417, 417]]

1 个答案:

答案 0 :(得分:1)

我认为您可以使用GroupBy.size来替代计数,并用Series.unstack进行整形,以替换是否缺失值:

result = data.groupby(['user_id','cluster_label']).size().unstack(fill_value=0)
print (result)
cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000    0    1    0    0    0    1    0    2    1    0  ...   
820000000000000000    0    0    1    0    2    0    1    0    0    0  ...   
821000000000000000    1    0    0    1    0    0    0    0    0    1  ...   
822000000000000000    0    0    0    0    0    0    0    0    0    0  ...   

cluster_label       278  290  327  413  432  438  485  521  565  634  
user_id                                                               
819000000000000000    1    1    0    0    0    0    0    0    0    0  
820000000000000000    0    0    0    0    1    0    0    0    0    0  
821000000000000000    0    0    1    0    0    1    1    1    1    0  
822000000000000000    0    0    0   15    0    0    0    0    0    2  

[4 rows x 23 columns]

result = data.groupby(['user_id','cluster_label']).size().unstack()
print (result)

cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000  NaN  1.0  NaN  NaN  NaN  1.0  NaN  2.0  1.0  NaN  ...   
820000000000000000  NaN  NaN  1.0  NaN  2.0  NaN  1.0  NaN  NaN  NaN  ...   
821000000000000000  1.0  NaN  NaN  1.0  NaN  NaN  NaN  NaN  NaN  1.0  ...   
822000000000000000  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...   

cluster_label       278  290  327   413  432  438  485  521  565  634  
user_id                                                                
819000000000000000  1.0  1.0  NaN   NaN  NaN  NaN  NaN  NaN  NaN  NaN  
820000000000000000  NaN  NaN  NaN   NaN  1.0  NaN  NaN  NaN  NaN  NaN  
821000000000000000  NaN  NaN  1.0   NaN  NaN  1.0  1.0  1.0  1.0  NaN  
822000000000000000  NaN  NaN  NaN  15.0  NaN  NaN  NaN  NaN  NaN  2.0  

[4 rows x 23 columns]

或使用crosstab

result = pd.crosstab(data['user_id'],data['cluster_label'])
print (result)
cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000    0    1    0    0    0    1    0    2    1    0  ...   
820000000000000000    0    0    1    0    2    0    1    0    0    0  ...   
821000000000000000    1    0    0    1    0    0    0    0    0    1  ...   
822000000000000000    0    0    0    0    0    0    0    0    0    0  ...   

cluster_label       278  290  327  413  432  438  485  521  565  634  
user_id                                                               
819000000000000000    1    1    0    0    0    0    0    0    0    0  
820000000000000000    0    0    0    0    1    0    0    0    0    0  
821000000000000000    0    0    1    0    0    1    1    1    1    0  
822000000000000000    0    0    0   15    0    0    0    0    0    2  

[4 rows x 23 columns]