sklearn 和 statsmodels 得到非常不同的逻辑回归结果

时间:2021-01-06 08:57:14

标签: scikit-learn logistic-regression statsmodels

from sklearn.linear_model import LogisticRegression
from io import StringIO
import pandas as pd
import statsmodels.api as sm
    

TESTDATA = StringIO(""",age,age2,gender,average,hypertension
0,61,3721,0,0.068025807,FALSE 1,52,2704,0,0.066346102,FALSE
2,59,3481,0,0.068163704,FALSE 3,47,2209,0,0.062870186,FALSE
4,57,3249,0,0.065415069,TRUE 5,50,2500,1,0.06260146,FALSE
6,44,1936,0,0.067612307,FALSE 7,60,3600,0,0.062675767,FALSE
8,60,3600,0,0.063555558,TRUE 9,65,4225,0,0.066346102,FALSE
10,61,3721,0,0.068163704,FALSE 11,52,2704,0,0.062870186,FALSE
12,59,3481,0,0.065415069,FALSE 13,47,2209,0,0.06260146,FALSE
14,57,2209,0,0.067612307,TRUE 15,50,3249,1,0.067612307,FALSE
16,44,2500,0,0.067612307,FALSE 17,50,1936,0,0.062675767,FALSE
18,44,3600,0,0.063555558,FALSE 19,60,3600,0,0.066346102,TRUE
20,60,4225,0,0.068163704,TRUE 21,65,3721,0,0.062870186,TRUE
22,61,3600,0,0.065415069,FALSE 23,52,3600,0,0.06260146,FALSE
24,57,4225,0,0.067612307,FALSE 25,50,2209,1,0.066346102,TRUE
26,44,3249,0,0.068163704,FALSE 27,60,2500,0,0.062870186,FALSE
28,60,1936,0,0.065415069,FALSE 29,60,3600,0,0.06260146,FALSE
30,65,3600,0,0.067612307,FALSE 31,61,4225,0,0.066346102,FALSE
32,52,3721,0,0.068163704,TRUE 33,59,2704,0,0.062870186,FALSE
34,47,3249,0,0.065415069,FALSE 35,57,2500,1,0.06260146,TRUE
36,50,1936,0,0.067612307,FALSE 37,60,3600,0,0.062675767,FALSE
38,57,3600,0,0.063555558,FALSE 39,50,4225,0,0.067508574,FALSE
40,44,3721,0,0.068163704,TRUE 41,50,3600,0,0.066346102,FALSE
42,44,3600,0,0.068163704,FALSE 43,60,4225,0,0.062870186,TRUE
44,60,3600,0,0.065415069,TRUE 45,33,4225,1,0.06260146,TRUE
46,44,3721,0,0.067612307,FALSE 47,60,2704,0,0.067508574,FALSE
48,60,3600,0,0.068025807,FALSE 49,65,4225,0,0.066346102,FALSE
50,61,3721,0,0.068163704,FALSE 51,52,3600,0,0.062870186,TRUE
52,60,3600,0,0.065415069,FALSE 53,65,4225,0,0.066346102,FALSE
54,61,2209,0,0.062870186,TRUE 55,52,3600,1,0.065415069,FALSE
56,59,4225,0,0.068163704,FALSE 57,47,3721,0,0.062870186,FALSE
58,57,3600,0,0.065415069,TRUE 59,50,3600,0,0.06260146,FALSE
60,44,4225,0,0.067612307,FALSE 61,60,3721,0,0.066346102,FALSE
62,34,1936,0,0.068163704,FALSE 63,59,3600,0,0.062870186,FALSE
64,47,3600,0,0.065415069,TRUE 65,57,4225,1,0.06260146,FALSE
66,56,1936,0,0.067612307,FALSE 67,56,2209,0,0.062675767,FALSE
68,60,3249,0,0.063555558,FALSE 69,65,2500,0,0.067508574,FALSE""")

    
df = pd.read_csv(TESTDATA, sep=",")
    
print(sm.Logit(endog=df["hypertension"], exog=df[[ "age", "age2", "gender","average"]]).fit( disp=False).params)
print(LogisticRegression(fit_intercept = False, C = 1e9).fit( df[[ "age", "age2", "gender","average"]],df["hypertension"]).coef_)

结果完全不同:

age         0.011864
age2        0.000294
gender      1.015793
average   -44.285129
[[-2.69997534e-02  8.27509854e-05  7.92208243e-01 -2.28174015e-02]]

同时,线性回归的结果相同。

print(sm.OLS(endog=df["a"], exog=df[["b","c"]]).fit( disp=False).params)
print(LinearRegression(fit_intercept = False).fit( df[["b","c"]],df["a"]).coef_)

结果:

age        0.002484
age2       0.000050
gender     0.223877
average   -1.235937
[ 2.48380428e-03  4.98449037e-05  2.23877433e-01 -1.23593682e+00]

这是为什么?真是令人费解……

1 个答案:

答案 0 :(得分:2)

scikit-learn 没有在这里找到最佳的目标值。 statsmodels 在这个特定的例子中做得更好。唯一的区别似乎是优化器的选择,如果 statsmodels 被迫使用与 SK learn 相同的选择,那么估计的参数值是相同的。

from sklearn.linear_model import LogisticRegression
from io import StringIO
import pandas as pd
import statsmodels.api as sm
    

TESTDATA = StringIO(""",age,age2,gender,average,hypertension
0,61,3721,0,0.068025807,FALSE 1,52,2704,0,0.066346102,FALSE
2,59,3481,0,0.068163704,FALSE 3,47,2209,0,0.062870186,FALSE
4,57,3249,0,0.065415069,TRUE 5,50,2500,1,0.06260146,FALSE
6,44,1936,0,0.067612307,FALSE 7,60,3600,0,0.062675767,FALSE
8,60,3600,0,0.063555558,TRUE 9,65,4225,0,0.066346102,FALSE
10,61,3721,0,0.068163704,FALSE 11,52,2704,0,0.062870186,FALSE
12,59,3481,0,0.065415069,FALSE 13,47,2209,0,0.06260146,FALSE
14,57,2209,0,0.067612307,TRUE 15,50,3249,1,0.067612307,FALSE
16,44,2500,0,0.067612307,FALSE 17,50,1936,0,0.062675767,FALSE
18,44,3600,0,0.063555558,FALSE 19,60,3600,0,0.066346102,TRUE
20,60,4225,0,0.068163704,TRUE 21,65,3721,0,0.062870186,TRUE
22,61,3600,0,0.065415069,FALSE 23,52,3600,0,0.06260146,FALSE
24,57,4225,0,0.067612307,FALSE 25,50,2209,1,0.066346102,TRUE
26,44,3249,0,0.068163704,FALSE 27,60,2500,0,0.062870186,FALSE
28,60,1936,0,0.065415069,FALSE 29,60,3600,0,0.06260146,FALSE
30,65,3600,0,0.067612307,FALSE 31,61,4225,0,0.066346102,FALSE
32,52,3721,0,0.068163704,TRUE 33,59,2704,0,0.062870186,FALSE
34,47,3249,0,0.065415069,FALSE 35,57,2500,1,0.06260146,TRUE
36,50,1936,0,0.067612307,FALSE 37,60,3600,0,0.062675767,FALSE
38,57,3600,0,0.063555558,FALSE 39,50,4225,0,0.067508574,FALSE
40,44,3721,0,0.068163704,TRUE 41,50,3600,0,0.066346102,FALSE
42,44,3600,0,0.068163704,FALSE 43,60,4225,0,0.062870186,TRUE
44,60,3600,0,0.065415069,TRUE 45,33,4225,1,0.06260146,TRUE
46,44,3721,0,0.067612307,FALSE 47,60,2704,0,0.067508574,FALSE
48,60,3600,0,0.068025807,FALSE 49,65,4225,0,0.066346102,FALSE
50,61,3721,0,0.068163704,FALSE 51,52,3600,0,0.062870186,TRUE
52,60,3600,0,0.065415069,FALSE 53,65,4225,0,0.066346102,FALSE
54,61,2209,0,0.062870186,TRUE 55,52,3600,1,0.065415069,FALSE
56,59,4225,0,0.068163704,FALSE 57,47,3721,0,0.062870186,FALSE
58,57,3600,0,0.065415069,TRUE 59,50,3600,0,0.06260146,FALSE
60,44,4225,0,0.067612307,FALSE 61,60,3721,0,0.066346102,FALSE
62,34,1936,0,0.068163704,FALSE 63,59,3600,0,0.062870186,FALSE
64,47,3600,0,0.065415069,TRUE 65,57,4225,1,0.06260146,FALSE
66,56,1936,0,0.067612307,FALSE 67,56,2209,0,0.062675767,FALSE
68,60,3249,0,0.063555558,FALSE 69,65,2500,0,0.067508574,FALSE""")

    
df = pd.read_csv(TESTDATA, sep=",")


mod = sm.Logit(endog=df["hypertension"], exog=df[[ "age", "age2", "gender","average"]])
sk_mod = LogisticRegression(fit_intercept = False, C = 1e9).fit( df[[ "age", "age2", "gender","average"]],df["hypertension"])

res_default = mod.fit(np.squeeze(sk_mod.coef_), disp=False)
res_lbfgs= mod.fit(np.squeeze(sk_mod.coef_), method="lbfgs", disp=False)

print("The default optimizer produces a larger log-likelihood (the optimization target)")
print(f"Default: {res_default.llf}, LBFGS: {res_lbfgs.llf}")
print("LBFGS is identical to SK Learn")
print(f"SK Learn coef\n {np.squeeze(sk_mod.coef_)}")
print(f"LBFGS coef \n {np.asarray(res_lbfgs.params)}")
print("The default optimizer produces different estimates")
print(f"Default coef \n {np.asarray(res_default.params)}")


res_lbfgs_sv= mod.fit(res_default.params, method="lbfgs", disp=False)
print(f"LBFGS with better starting parameters matches the default\n {np.asarray(res_lbfgs_sv.params)}")

运行代码产生

The default optimizer produces a larger log-likelihood (the optimization target)
Default: -15.853969516447952, LBFGS: -16.30414297615966
LBFGS is identical to SK Learn
SK Learn coef
 [-4.42216394e-02  2.23648541e-04  1.19470339e+00 -4.28565669e-03]
LBFGS coef
 [-4.42216394e-02  2.23648541e-04  1.19470339e+00 -4.28565669e-03]
The default optimizer produces different estimates
Default coef
 [ 1.33419520e-02  4.79332044e-04  1.69742850e+00 -6.53888649e+01]
LBFGS with better starting parameters matches the default
 [ 1.33419520e-02  4.79332044e-04  1.69742850e+00 -6.53888649e+01]
相关问题