我想在应用groupby函数后使用列变量的标准偏差从pandas数据框中删除异常值。
这是我的数据框:
ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 NaN NaN
1 8.276460 64.478573 9.034156 William Dudley 1.670275
2 19.570911 27.362067 17.253580 Janet Yellen -0.604757
3 -2.090000 121.220000 -3.400000 NaN NaN
4 -2.090000 121.220000 -3.400000 NaN NaN
5 20.643483 17.069411 18.394178 Lael Brainard 0.215396
6 -2.090000 121.220000 -3.400000 NaN NaN
7 -2.090000 121.220000 -3.400000 NaN NaN
8 12.624198 52.220468 11.403157 Jerome H. Powell -1.350798
9 18.466305 35.186261 16.205693 Stanley Fischer 0.522121
10 -2.090000 121.220000 -3.400000 NaN NaN
11 16.953460 36.246573 15.323457 Lael Brainard -0.217779
12 -2.090000 121.220000 -3.400000 NaN NaN
13 -2.090000 121.220000 -3.400000 NaN NaN
14 17.066088 32.592551 16.108486 Stanley Fischer 0.642245
15 -2.090000 121.220000 -3.400000 NaN NaN
我想先通过'Speaker'对数据帧进行分组,然后删除'ARI','Flesch'和'Kincaid'值,这些值通过超出3个标准差来定义异常值。具体功能。
如果可能,请告诉我。谢谢!
答案 0 :(得分:1)
此方法唯一需要的依赖项是Pandas
假设我们已将'Speaker'列的值'NaN'替换为'CommitteOrganization'等代表性的
speaker = dataset['Speaker'].fillna(value='CommitteeOrganization')
dataset['Speaker'] = speaker
所以我们的数据如下:
Index ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
1 8.276460 64.478573 9.034156 WilliamDudley 1.670275
2 19.570911 27.362067 17.253580 JanetYellen -0.604757
3 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
4 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
使用 Pandas 功能分组:
datasetGrouped = dataset.groupby(by='Speaker').mean()
所以我们的数据如下:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 121.220000 -3.400000 NaN
JanetYellen 19.570911 27.362067 17.253580 -0.604757
JeromeH.Powell 12.624198 52.220468 11.403157 -1.350798
LaelBrainard 18.798471 26.657992 16.858818 -0.001191
StanleyFischer 17.766196 33.889406 16.157089 0.582183
WilliamDudley 8.276460 64.478573 9.034156 1.670275
计算每列的标准偏差:
aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()
在符合条件的行上用'NaN'替换值:
datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'
最终数据集是:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 NaN -3.400000 NaN
JanetYellen 19.570911 27.3621 17.253580 -0.604757
JeromeH.Powell 12.624198 52.2205 11.403157 -1.350798
LaelBrainard 18.798471 26.658 16.858818 -0.001191
StanleyFischer 17.766196 33.8894 16.157089 0.582183
WilliamDudley 8.276460 64.4786 9.034156 1.670275
完整代码:Github
注意:这可以用比所呈现的更少的代码完成,但是为了便于理解,它是“一步一步”完成的答案。
注意2:因为这个问题有点含糊不清,如果我不理解某些内容并且没有提供正确的答案,请不要犹豫告诉我,如果可能的话我会更新答案< / em>的