我有一个数据框如下 -
FileName PageNo LineNo Name Class_par_ratio
17973375 - 1 TM000010 82 POWDERS MILK
17973375 - 1 TM000015 49 milk MILK
17973375 - 1 TM000015 49 Dairy OTHER FOODS
17973375 - 1 TM000016 11 Fat ANIMAL AND VEGETABLE OIL
17973375 - 1 TM000006 79 POWDER MILK
17973375 - 1 TM000016 9 Milk MILK
我想通过FileName和Class_par_ratio对输出进行分组,我也希望找到Class_par_ratio的频率并将其放在列 - 频率中,然后我想在名为' max的另一列中找到最大频率频率'
输出有点像 -
FileName Class_par_ratio Frequency Max_Class Max Freq.
17743633 - 1 OTHER FOODS 2 OTHER GOODS 4
OTHER GOODS 4
17743634 - 1 MEAT 12 MEAT 12
17743634 - 2 MEAT 1 MEAT 1
17743635 - 1 MEAT 83 MEAT 83
OTHER GOODS 2
17743642 - 1 MEAT 43 MEAT 43
OTHER GOODS 2
17743739 - 1 OTHER GOODS 3 OTHER GOODS 3
我已尝试过的“代码”片段现在是 -
1) df.groupby(['FileName'])['Class_par_ratio'].value_counts()
我到这里的输出是: -
FileName Class_par_ratio
17743633 - 1 OTHER GOODS 8
17743634 - 1 MEAT AND LIVESTOCK 15
17743634 - 2 PETROLEUM 1
17743635 - 1 MEAT AND LIVESTOCK 87
另一个是 -
2) coll_g = coll.groupby(['FileName', 'Class_par_ratio']).size().groupby(
['FileName', 'Class_par_ratio']).agg({'count': max})
coll_g = coll_g['count'].groupby(level=0, group_keys=False)
coll_g = coll_g.nlargest(1)
coll_g
我在这里找到了出现次数最多的班级,但我没有得到最高级别。频率号 我得到的输出是 -
17743754 - 1 MEAT & LIVESTOCK 1
17743759 - 1 ANIMAL AND VEGETABLE OIL 1
17743970 - 1 IRON ORE 1
17743996 - 1 OTHER GOODS 1
我使用Pandas .20和python 3.6.3
你们可以告诉我哪里出错了,我的代码应该是什么。
答案 0 :(得分:1)
在agg
之后使用idxmax
,返回最大类别的原因是set_index
之前和max
之前的新DataFrame
然后join
返回原始{ {1}}:
DataFrame
或使用双transform
:
df = df.groupby(['FileName'])['Class_par_ratio'].value_counts().reset_index(name='Freq')
df1 = df.set_index('Class_par_ratio').groupby(['FileName'])['Freq'].agg(['idxmax','max'])
d = {'idxmax':'Max_Class','max':'Max Freq.'}
df = df.join(df1, on='FileName').rename(columns=d)
验证第二个样本数据的解决方案:
df = df.groupby(['FileName'])['Class_par_ratio'].value_counts().reset_index(name='Freq')
g = df.set_index('Class_par_ratio').groupby(['FileName'])['Freq']
df['Max_Class'] = g.transform('idxmax').values
df['Max Freq.'] = g.transform('max').values
print (df)
FileName Class_par_ratio Freq Max_Class Max Freq.
0 17973375 - 1 MILK 4 MILK 4
1 17973375 - 1 ANIMAL AND VEGETABLE OIL 1 MILK 4
2 17973375 - 1 OTHER FOODS 1 MILK 4
如果需要删除重复的值,请duplicated
添加mask
:
df1 = df.set_index('Class_par_ratio').groupby(['FileName'])['Frequency'].agg(['idxmax','max'])
d = {'idxmax':'Max_Class','max':'Max Freq.'}
df = df.join(df1, on='FileName').rename(columns=d)
print (df)
FileName Class_par_ratio Frequency Max_Class Max Freq.
0 17743633 - 1 OTHE FOODS 2 OTHER GOODS 4
1 17743633 - 1 OTHER GOODS 4 OTHER GOODS 4
2 17743634 - 1 MEAT 12 MEAT 12
3 17743634 - 2 MEAT 1 MEAT 1
4 17743635 - 1 MEAT 83 MEAT 83
5 17743635 - 1 OTHER GOODS 2 MEAT 83
6 17743642 - 1 MEAT 43 MEAT 43
7 17743642 - 1 OTHER GOODS 2 MEAT 43
8 17743739 - 1 OTHER GOODS 3 OTHER GOODS 3
答案 1 :(得分:0)
不幸的是,需求发生了变化,我想要的列是 - FileName,Class_par_ratio,Confidence。 这里的信心不过是((freq / max_freq)* 100) 如果该值大于80,则输出HIGH。 如果值在60-80之间,则输出MEDIUM,否则为LOW。 输出看起来像 -
FileName COMM_CODE Confidence
17743633 - 1 OTHER GOODS MEDIUM
17743634 - 1 MEAT & LIVESTOCK HIGH
17743634 - 2 MEAT & LIVESTOCK HIGH
17743635 - 1 MEAT & LIVESTOCK HIGH
以下是我为了达到此输出而编写的代码 -
mf = (df.groupby(['FileName'])['COMM_CODE'].value_counts().reset_index(name='Freq'))
we = mf.groupby(['FileName'])['Freq'].apply(lambda grp: grp.nlargest(6).sum()).reset_index(name='Tot')
mf = mf.groupby(['FileName']).first().reset_index()
mf['Confidence_%'] = (mf['Freq']/we['Tot'])*100
mf['Confidence'] = ['HIGH' if x >= 80.0 else 'MEDIUM' if x>=60.0 else 'LOW' for x in mf['Confidence_%']]
mf.drop(['Freq','Confidence_%'],axis=1,inplace=True)