应用groupby()后计算最大行数

时间:2018-05-25 13:05:56

标签: python python-3.x pandas dataframe

我有一个数据框如下 - FileName PageNo LineNo Name Class_par_ratio 17973375 - 1 TM000010 82 POWDERS MILK 17973375 - 1 TM000015 49 milk MILK 17973375 - 1 TM000015 49 Dairy OTHER FOODS 17973375 - 1 TM000016 11 Fat ANIMAL AND VEGETABLE OIL 17973375 - 1 TM000006 79 POWDER MILK 17973375 - 1 TM000016 9 Milk MILK

我想通过FileName和Class_par_ratio对输出进行分组,我也希望找到Class_par_ratio的频率并将其放在列 - 频率中,然后我想在名为' max的另一列中找到最大频率频率'

输出有点像 -

FileName      Class_par_ratio           Frequency    Max_Class     Max Freq.
17743633 - 1  OTHER FOODS               2            OTHER GOODS    4
              OTHER GOODS               4                  
17743634 - 1  MEAT                      12           MEAT           12
17743634 - 2  MEAT                       1           MEAT            1
17743635 - 1  MEAT                      83           MEAT           83
              OTHER GOODS               2      
17743642 - 1  MEAT                      43           MEAT           43
              OTHER GOODS               2                  
17743739 - 1  OTHER GOODS               3            OTHER GOODS     3

我已尝试过的“代码”片段现在是 -

1) df.groupby(['FileName'])['Class_par_ratio'].value_counts()

我到这里的输出是: -

FileName      Class_par_ratio
17743633 - 1  OTHER GOODS         8
17743634 - 1  MEAT AND LIVESTOCK 15
17743634 - 2  PETROLEUM           1
17743635 - 1  MEAT AND LIVESTOCK 87

另一个是 -

2) coll_g = coll.groupby(['FileName', 'Class_par_ratio']).size().groupby(              
['FileName', 'Class_par_ratio']).agg({'count': max})
coll_g = coll_g['count'].groupby(level=0, group_keys=False)
coll_g = coll_g.nlargest(1)
coll_g

我在这里找到了出现次数最多的班级,但我没有得到最高级别。频率号 我得到的输出是 -

17743754 - 1  MEAT & LIVESTOCK            1
17743759 - 1  ANIMAL AND VEGETABLE OIL    1
17743970 - 1  IRON ORE                    1
17743996 - 1  OTHER GOODS                 1

我使用Pandas .20和python 3.6.3

你们可以告诉我哪里出错了,我的代码应该是什么。

2 个答案:

答案 0 :(得分:1)

agg之后使用idxmax,返回最大类别的原因是set_index之前和max之前的新DataFrame然后join返回原始{ {1}}:

DataFrame

或使用双transform

df = df.groupby(['FileName'])['Class_par_ratio'].value_counts().reset_index(name='Freq')

df1 = df.set_index('Class_par_ratio').groupby(['FileName'])['Freq'].agg(['idxmax','max'])

d = {'idxmax':'Max_Class','max':'Max Freq.'}
df = df.join(df1, on='FileName').rename(columns=d)

验证第二个样本数据的解决方案:

df = df.groupby(['FileName'])['Class_par_ratio'].value_counts().reset_index(name='Freq')

g = df.set_index('Class_par_ratio').groupby(['FileName'])['Freq']
df['Max_Class'] = g.transform('idxmax').values
df['Max Freq.'] = g.transform('max').values
print (df)
       FileName           Class_par_ratio  Freq Max_Class  Max Freq.
0  17973375 - 1                      MILK     4      MILK          4
1  17973375 - 1  ANIMAL AND VEGETABLE OIL     1      MILK          4
2  17973375 - 1               OTHER FOODS     1      MILK          4

如果需要删除重复的值,请duplicated添加mask

df1 = df.set_index('Class_par_ratio').groupby(['FileName'])['Frequency'].agg(['idxmax','max'])
d = {'idxmax':'Max_Class','max':'Max Freq.'}
df = df.join(df1, on='FileName').rename(columns=d)
print (df)
       FileName Class_par_ratio  Frequency    Max_Class  Max Freq.
0  17743633 - 1      OTHE FOODS          2  OTHER GOODS          4
1  17743633 - 1     OTHER GOODS          4  OTHER GOODS          4
2  17743634 - 1            MEAT         12         MEAT         12
3  17743634 - 2            MEAT          1         MEAT          1
4  17743635 - 1            MEAT         83         MEAT         83
5  17743635 - 1     OTHER GOODS          2         MEAT         83
6  17743642 - 1            MEAT         43         MEAT         43
7  17743642 - 1     OTHER GOODS          2         MEAT         43
8  17743739 - 1     OTHER GOODS          3  OTHER GOODS          3

答案 1 :(得分:0)

不幸的是,需求发生了变化,我想要的列是 - FileName,Class_par_ratio,Confidence。 这里的信心不过是((freq / max_freq)* 100) 如果该值大于80,则输出HIGH。 如果值在60-80之间,则输出MEDIUM,否则为LOW。 输出看起来像 -

FileName        COMM_CODE           Confidence
17743633 - 1    OTHER GOODS         MEDIUM
17743634 - 1    MEAT & LIVESTOCK    HIGH
17743634 - 2    MEAT & LIVESTOCK    HIGH
17743635 - 1    MEAT & LIVESTOCK    HIGH

以下是我为了达到此输出而编写的代码 -

mf = (df.groupby(['FileName'])['COMM_CODE'].value_counts().reset_index(name='Freq'))
we = mf.groupby(['FileName'])['Freq'].apply(lambda grp: grp.nlargest(6).sum()).reset_index(name='Tot')
mf = mf.groupby(['FileName']).first().reset_index()

mf['Confidence_%'] = (mf['Freq']/we['Tot'])*100

mf['Confidence'] = ['HIGH' if x >= 80.0 else 'MEDIUM' if x>=60.0 else 'LOW' for x in mf['Confidence_%']]
mf.drop(['Freq','Confidence_%'],axis=1,inplace=True)