pandas groupby-size和unique之间的区别

时间:2015-12-07 13:14:09

标签: python pandas unique

这里的目标是查看我的数据库中有多少个唯一值。这是我写的代码:

apps = pd.read_csv('ConcatOwned1_900.csv', sep='\t', usecols=['appid'])

apps[('appid')] = apps[('appid')].astype(int)
apps_list=apps['appid'].unique()

b = apps.groupby('appid').size()
blist = b.unique()

print len(apps_list), len(blist), len(set(b))
>>>7672 2164 2164

为什么这两种方法有区别?

由于要求,我发布了一些数据:

Unnamed: 0  StudID          No  appid work work2
0   0   76561193665298433   0   10  nan 0
1   1   76561193665298433   1   20  nan 0
2   2   76561193665298433   2   30  nan 0
3   3   76561193665298433   3   40  nan 0
4   4   76561193665298433   4   50  nan 0
5   5   76561193665298433   5   60  nan 0
6   6   76561193665298433   6   70  nan 0
7   7   76561193665298433   7   80  nan 0
8   8   76561193665298433   8   100 nan 0
9   9   76561193665298433   9   130 nan 0
10  10  76561193665298433   10  220 nan 0
11  11  76561193665298433   11  240 nan 0
12  12  76561193665298433   12  280 nan 0
13  13  76561193665298433   13  300 nan 0
14  14  76561193665298433   14  320 nan 0
15  15  76561193665298433   15  340 nan 0
16  16  76561193665298433   16  360 nan 0
17  17  76561193665298433   17  380 nan 0
18  18  76561193665298433   18  400 nan 0
19  19  76561193665298433   19  420 nan 0
20  20  76561193665298433   20  500 nan 0
21  21  76561193665298433   21  550 nan 0
22  22  76561193665298433   22  620 6.0 3064
33  33  76561193665298434   0   10  nan 837
34  34  76561193665298434   1   20  nan 27
35  35  76561193665298434   2   30  nan 9
36  36  76561193665298434   3   40  nan 5
37  37  76561193665298434   4   50  nan 2
38  38  76561193665298434   5   60  nan 0
39  39  76561193665298434   6   70  nan 403
40  40  76561193665298434   7   130 nan 0
41  41  76561193665298434   8   80  nan 6
42  42  76561193665298434   9   100 nan 10
43  43  76561193665298434   10  220 nan 14

1 个答案:

答案 0 :(得分:1)

IIUC基于数据框的附件,您似乎应该分析b.index,而不是b的值。看看:

b = apps.groupby('appid').size()

In [24]: b  
Out[24]:    
appid       
10     2    
20     2    
30     2    
40     2    
50     2    
60     2    
70     2    
80     2    
100    2    
130    2    
220    2    
240    1    
280    1    
300    1    
320    1    
340    1    
360    1    
380    1    
400    1    
420    1    
500    1    
550    1    
620    1    
dtype: int64

In [25]: set(b)
Out[25]: {1, 2}

但如果你为b.index执行此操作,您将获得所有3种方法的相同值:

blist = b.index.unique()

In [30]: len(apps_list), len(blist), len(set(b.index))
Out[30]: (23, 23, 23)