受Can scipy.stats identify and mask obvious outliers?的启发,我想了解statsmodel
的OLS的输出。
我将代码修改为适合当今的要求,并希望了解如何解释*.outlier_test()
的响应-分别是否真正地bonf(p)
<0.5是用于识别某个对象的正确方法。离群值。
代码:
from random import random
import statsmodels.api as smapi
from statsmodels.formula.api import ols
import statsmodels.graphics as smgraphics
# Make data #
x = list(range(30))
y = [y*(10+random())+200 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
x.insert(6,16)
y.insert(6,295)
# Make fit #
regression = ols("data ~ x", data=dict(data=y, x=x)).fit()
# Find outliers #
test = regression.outlier_test()
print("test.columns:", test.columns)
print(test)
outliers = ((x[i],y[i]) for i,t in enumerate(test.iloc[:,2]) if t < 0.5)
print ('Outliers: ', list(outliers))
# Figure #
figure = smgraphics.regressionplots.plot_fit(regression, 1)
# Add line #
smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])
figure.show()
输出:
test.columns: Index(['student_resid', 'unadj_p', 'bonf(p)'], dtype='object')
student_resid unadj_p bonf(p)
0 0.256226 7.995850e-01 1.000000e+00
1 0.235436 8.155247e-01 1.000000e+00
2 0.266506 7.917355e-01 1.000000e+00
3 0.195602 8.462860e-01 1.000000e+00
4 0.206646 8.377301e-01 1.000000e+00
5 0.235760 8.152759e-01 1.000000e+00
6 -2.670250 1.229206e-02 3.933460e-01
7 -9.404263 2.609308e-10 8.349786e-09
8 0.160577 8.735400e-01 1.000000e+00
9 0.317017 7.535015e-01 1.000000e+00
10 0.120925 9.045843e-01 1.000000e+00
11 0.249872 8.044476e-01 1.000000e+00
12 0.250744 8.037804e-01 1.000000e+00
13 0.399508 6.924460e-01 1.000000e+00
14 0.313912 7.558347e-01 1.000000e+00
15 0.187027 8.529415e-01 1.000000e+00
16 0.019263 9.847634e-01 1.000000e+00
17 0.038839 9.692847e-01 1.000000e+00
18 0.015481 9.877546e-01 1.000000e+00
19 0.417676 6.792601e-01 1.000000e+00
20 0.153612 8.789799e-01 1.000000e+00
21 0.201890 8.414121e-01 1.000000e+00
22 0.540464 5.930042e-01 1.000000e+00
23 0.216489 8.301220e-01 1.000000e+00
24 -0.156133 8.770102e-01 1.000000e+00
25 0.477092 6.368722e-01 1.000000e+00
26 0.246855 8.067600e-01 1.000000e+00
27 0.494958 6.243592e-01 1.000000e+00
28 0.413796 6.820681e-01 1.000000e+00
29 0.067460 9.466782e-01 1.000000e+00
30 0.165854 8.694224e-01 1.000000e+00
31 0.511132 6.131286e-01 1.000000e+00
和:
Outliers: [(16, 295), (15, 220)]
所以我的问题是:
在使用statsmodels普通最小二乘法模型时,bonf(p)
<0.5真的是识别异常值的正确方法吗?