我正在尝试在python中生成逻辑回归,该回归产生与R相同的结果。这似乎很接近,但并不相同。我整理了以下示例,以说明存在差异。数据不真实。
R
# RStudio 1.1.453
d <- data.frame(c(0, 0, 1, 1, 1),
c(1, 0, 0, 0, 0),
c(0, 1, 0, 0, 0))
colnames(d) <- c("v1", "v2", "v3")
model <- glm(v1 ~ v2,
data = d,
family = "binomial")
summary(model)
R输出
Call:
glm(formula = v1 ~ v2, family = "binomial", data = d)
Deviance Residuals:
1 2 3 4 5
-1.66511 -0.00013 0.75853 0.75853 0.75853
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.099 1.155 0.951 0.341
v2 -19.665 6522.639 -0.003 0.998
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.7301 on 4 degrees of freedom
Residual deviance: 4.4987 on 3 degrees of freedom
AIC: 8.4987
Number of Fisher Scoring iterations: 17
Python
# Python 3.7.1
import pandas as pd # 0.23.4
import statsmodels.api as sm # 0.9.0
import statsmodels.formula.api as smf # 0.9.0
d = pd.DataFrame({"v1" : [0, 0, 1, 1, 1],
"v2" : [1, 0, 0, 0, 0],
"v3" : [0, 1, 0, 0, 0]})
model = smf.glm(formula = "v1 ~ v2",
family=sm.families.Binomial(link = sm.genmod.families.links.logit),
data=d
).fit()
model.summary()
Python输出
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: v1 No. Observations: 5
Model: GLM Df Residuals: 3
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2.2493
Date: Wed, 07 Nov 2018 Deviance: 4.4987
Time: 15:17:52 Pearson chi2: 4.00
No. Iterations: 19 Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.0986 1.155 0.951 0.341 -1.165 3.362
v2 -21.6647 1.77e+04 -0.001 0.999 -3.48e+04 3.47e+04
==============================================================================
迭代次数有所不同。据我所知,有一些收敛方法可能在两者之间有所不同,但我不明白。我可能还会缺少其他设置吗?
答案 0 :(得分:2)
猜测它们在数值稳定性方面有不同的权衡。
v2
估计值的方差很大,这可能使他们俩都陷入困境……我要说的是,他们基本上给出了相同的答案,至少在双精度算术可用的极限范围内。 >
R实现允许您传递control
参数:
> options(digits=12)
> model <- glm(v1 ~ v2, data=d, family="binomial", control=list(trace=T))
Deviance = 4.67724333758 Iterations - 1
Deviance = 4.5570420311 Iterations - 2
Deviance = 4.51971688994 Iterations - 3
Deviance = 4.50636401333 Iterations - 4
Deviance = 4.50150009179 Iterations - 5
Deviance = 4.49971718523 Iterations - 6
Deviance = 4.49906215541 Iterations - 7
Deviance = 4.49882130019 Iterations - 8
Deviance = 4.4987327103 Iterations - 9
Deviance = 4.49870012203 Iterations - 10
Deviance = 4.49868813377 Iterations - 11
Deviance = 4.49868372357 Iterations - 12
Deviance = 4.49868210116 Iterations - 13
Deviance = 4.4986815043 Iterations - 14
Deviance = 4.49868128473 Iterations - 15
Deviance = 4.49868120396 Iterations - 16
Deviance = 4.49868117424 Iterations - 17
显示了它的收敛性,但是我在Python代码中找不到类似的东西。
看到上面的输出表明他们也可以使用不同的临界值来确定收敛; R使用epsilon = 1e-8