加入并计算每个子类别中的值的数量

时间:2017-12-07 11:25:08

标签: python pandas

假设有一个如下所示的数据框:

 V290   V311
0   GOOD    TOP QUARTER
1   NK-UNASCERTAIN  MIDDLE HALF
2   AVERAGE TOP QUARTER
3   POOR    NK-UNASCERTAIN
4   POOR    MIDDLE HALF
5   GOOD    MIDDLE HALF
6   POOR    TOP QUARTER
7   AVERAGE MIDDLE HALF
8   POOR    MIDDLE HALF
9   AVERAGE MIDDLE HALF
10  POOR    MIDDLE HALF
11  POOR    MIDDLE HALF
12  AVERAGE MIDDLE HALF
13  AVERAGE TOP QUARTER

我希望按['V311']对这些数据进行分组,看看每个['V311']子类别中有多少GOOD或POOR。 我想做这样的事情:

  Top Quarter:GOOD:12
              POOR:30
              Average:15
  Middle half:GOOD:5
              POOR:19
              Average:3

等等......

3 个答案:

答案 0 :(得分:3)

您可以使用unstack进行转轴,即

df.pivot_table(index='V290',columns='V311',aggfunc='size',fill_value=0).unstack()

V311            V290          
MIDDLE HALF     AVERAGE           3
                GOOD              1
                NK-UNASCERTAIN    1
                POOR              4
NK-UNASCERTAIN  AVERAGE           0
                GOOD              0
                NK-UNASCERTAIN    0
                POOR              1
TOP QUARTER     AVERAGE           2
                GOOD              1
                NK-UNASCERTAIN    0
                POOR              1
dtype: int64

另外:

df.groupby(['V290','V311']).size().unstack().fillna(0).unstack()

如果你想要百分比,那么你可以除以总和,即

ndf = df.pivot_table(index='V290',columns='V311',aggfunc='size',fill_value=0)
percents = (ndf/ndf.sum()*100).unstack()

V311            V290          
MIDDLE HALF     AVERAGE            33.333333
                GOOD               11.111111
                NK-UNASCERTAIN     11.111111
                POOR               44.444444
NK-UNASCERTAIN  AVERAGE             0.000000
                GOOD                0.000000
                NK-UNASCERTAIN      0.000000
                POOR              100.000000
TOP QUARTER     AVERAGE            33.333333
                GOOD               33.333333
                NK-UNASCERTAIN      0.000000
                POOR               33.333333
dtype: float64

答案 1 :(得分:2)

dict compreheniongroupbyvalue_counts一起使用并转换为dict

d = {k:v.value_counts().to_dict() for k,v in df.groupby('V311')['V290']}
print (d)
{'NK-UNASCERTAIN': {'POOR': 1}, 
'MIDDLE HALF': {'POOR': 4, 'NK-UNASCERTAIN': 1, 'AVERAGE': 3, 'GOOD': 1}, 
'TOP QUARTER': {'POOR': 1, 'AVERAGE': 2, 'GOOD': 1}}

输出为Series

s = df.groupby('V311')['V290'].value_counts()
print (s)
V311            V290          
MIDDLE HALF     POOR              4
                AVERAGE           3
                GOOD              1
                NK-UNASCERTAIN    1
NK-UNASCERTAIN  POOR              1
TOP QUARTER     AVERAGE           2
                GOOD              1
                POOR              1
Name: V290, dtype: int64

编辑:如果需要相对频率:

s = df.groupby('V311')['V290'].value_counts(normalize=True)
print (s)
V311            V290          
MIDDLE HALF     POOR              0.444444
                AVERAGE           0.333333
                GOOD              0.111111
                NK-UNASCERTAIN    0.111111
NK-UNASCERTAIN  POOR              1.000000
TOP QUARTER     AVERAGE           0.500000
                GOOD              0.250000
                POOR              0.250000
Name: V290, dtype: float64

EDIT1:

如果想要所有缺少的类别:

s = df.groupby('V311')['V290'].value_counts()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
MIDDLE HALF     AVERAGE           3
                GOOD              1
                NK-UNASCERTAIN    1
                POOR              4
NK-UNASCERTAIN  AVERAGE           0
                GOOD              0
                NK-UNASCERTAIN    0
                POOR              1
TOP QUARTER     AVERAGE           2
                GOOD              1
                NK-UNASCERTAIN    0
                POOR              1
Name: V290, dtype: int64

答案 2 :(得分:2)

仅使用熊猫:

import pandas as pd
dataframe = pd.DataFrame()
dataframe['V311'] = ['MIDDLE','TOP','MIDDLE','TOP','MIDDLE','TOP','TOP']
print(dataframe['V311'].value_counts())

输出:

TOP       4
MIDDLE    3
Name: V311, dtype: int64