在pandas数据框中自动删除异常值

时间:2016-06-09 06:41:49

标签: python-2.7 pandas outliers

我有一个pandas数据框:

data = pd.read_csv(path)

在我运行一些预测算法之前,我正在寻找一种很好的方法来删除任何特征中具有极值的异常值行(我在数据框中有400个特征)。

尝试了几种方法,但他们似乎无法解决问题:

  • data[data.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

  • 使用Standard Scaler

1 个答案:

答案 0 :(得分:0)

我认为您可以检查输出但是将这两个索引按Index.difference进行比较,因为我认为您的解决方案非常好用:

import pandas as pd
import numpy as np

np.random.seed(1234)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
print (df)
           A         B         C
0   0.471435 -1.190976  1.432707
1  -0.312652 -0.720589  0.887163
2   0.859588 -0.636524  0.015696
3  -2.242685  1.150036  0.991946
4   0.953324 -2.021255 -0.334077
5   0.002118  0.405453  0.289092
6   1.321158 -1.546906 -0.202646
7  -0.655969  0.193421  0.553439
8   1.318152 -0.469305  0.675554
9  -1.817027 -0.183109  1.058969
10 -0.397840  0.337438  1.047579
11  1.045938  0.863717 -0.122092
12  0.124713 -0.322795  0.841675
13  2.390961  0.076200 -0.566446
14  0.036142 -2.074978  0.247792
15 -0.897157 -0.136795  0.018289
16  0.755414  0.215269  0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620  0.354020
19 -0.035513  0.565738  1.545659
20 -0.974236 -0.070345  0.307969
21 -0.208499  1.033801 -2.400454
22  2.030604 -1.142631  0.211883
23  0.704721 -0.785435  0.462060
24  0.704228  0.523508 -0.926254
25  2.007843  0.226963 -1.152659
26  0.631979  0.039513  0.464392
27 -3.563517  1.321106  0.152631
28  0.164530 -0.430096  0.767369
29  0.984920  0.270836  1.391986
df1 = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
print (df1)
           A         B         C
0   0.471435 -1.190976  1.432707
1  -0.312652 -0.720589  0.887163
2   0.859588 -0.636524  0.015696
3  -2.242685  1.150036  0.991946
4   0.953324 -2.021255 -0.334077
5   0.002118  0.405453  0.289092
6   1.321158 -1.546906 -0.202646
7  -0.655969  0.193421  0.553439
8   1.318152 -0.469305  0.675554
9  -1.817027 -0.183109  1.058969
10 -0.397840  0.337438  1.047579
11  1.045938  0.863717 -0.122092
12  0.124713 -0.322795  0.841675
13  2.390961  0.076200 -0.566446
14  0.036142 -2.074978  0.247792
15 -0.897157 -0.136795  0.018289
16  0.755414  0.215269  0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620  0.354020
19 -0.035513  0.565738  1.545659
20 -0.974236 -0.070345  0.307969
22  2.030604 -1.142631  0.211883
23  0.704721 -0.785435  0.462060
24  0.704228  0.523508 -0.926254
25  2.007843  0.226963 -1.152659
26  0.631979  0.039513  0.464392
28  0.164530 -0.430096  0.767369
29  0.984920  0.270836  1.391986
30  0.079842 -0.399965 -1.027851
31 -0.584718  0.816594 -0.081947
idx = df.index.difference(df1.index)
print (idx)
Int64Index([21, 27], dtype='int64')

print (df.loc[idx])
           A         B         C
21 -0.208499  1.033801 -2.400454
27 -3.563517  1.321106  0.152631