我有一个pandas数据框:
data = pd.read_csv(path)
在我运行一些预测算法之前,我正在寻找一种很好的方法来删除任何特征中具有极值的异常值行(我在数据框中有400个特征)。
尝试了几种方法,但他们似乎无法解决问题:
data[data.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
使用Standard Scaler
答案 0 :(得分:0)
我认为您可以检查输出但是将这两个索引按Index.difference
进行比较,因为我认为您的解决方案非常好用:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
print (df)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
21 -0.208499 1.033801 -2.400454
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
27 -3.563517 1.321106 0.152631
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
df1 = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
print (df1)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
30 0.079842 -0.399965 -1.027851
31 -0.584718 0.816594 -0.081947
idx = df.index.difference(df1.index)
print (idx)
Int64Index([21, 27], dtype='int64')
print (df.loc[idx])
A B C
21 -0.208499 1.033801 -2.400454
27 -3.563517 1.321106 0.152631