我在R和Python上对iris数据集运行逻辑回归。但两者都给出了不同的结果(系数,截距和分数)。
#Python codes.
In[23]: iris_df.head(5)
Out[23]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
In[35]: iris_df.shape
Out[35]: (100, 5)
#looking at the levels of the Species dependent variable..
In[25]: iris_df['Species'].unique()
Out[25]: array([0, 1], dtype=int64)
#creating dependent and independent variable datasets..
x = iris_df.ix[:,0:4]
y = iris_df.ix[:,-1]
#modelling starts..
y = np.ravel(y)
logistic = LogisticRegression()
model = logistic.fit(x,y)
#getting the model coefficients..
model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
model_intercept = model.intercept_
In[30]: model_coef
Out[36]:
0 1
0 Sepal.Length [-0.402473917528]
1 Sepal.Width [-1.46382924771]
2 Petal.Length [2.23785647964]
3 Petal.Width [1.0000929404]
In[31]: model_intercept
Out[31]: array([-0.25906453])
#scores...
In[34]: logistic.predict_proba(x)
Out[34]:
array([[ 0.9837306 , 0.0162694 ],
[ 0.96407227, 0.03592773],
[ 0.97647105, 0.02352895],
[ 0.95654126, 0.04345874],
[ 0.98534488, 0.01465512],
[ 0.98086592, 0.01913408],
> str(irisdf)
'data.frame': 100 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : int 0 0 0 0 0 0 0 0 0 0 ...
> model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.681e-05 -2.110e-08 0.000e+00 2.110e-08 2.006e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.556 601950.324 0 1
Sepal.Length -9.879 194223.245 0 1
Sepal.Width -7.418 92924.451 0 1
Petal.Length 19.054 144515.981 0 1
Petal.Width 25.033 216058.936 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 1.3166e-09 on 95 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
由于收敛问题,我增加了最大迭代次数并将epsilon设为0.05。
> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf,
control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.0102793 -0.0005659 -0.0000052 0.0001438 0.0112531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.796 704.352 0.003 0.998
Sepal.Length -3.426 215.912 -0.016 0.987
Sepal.Width -4.208 123.513 -0.034 0.973
Petal.Length 7.615 159.478 0.048 0.962
Petal.Width 11.835 285.938 0.041 0.967
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 5.3910e-04 on 95 degrees of freedom
AIC: 10.001
Number of Fisher Scoring iterations: 12
#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
1 2 3 4 5
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08
R和python中的得分,截距和系数都完全不同。哪一个是正确的,我想继续python.Now有混淆,结果是准确的。
答案 0 :(得分:3)
已更新 问题是沿着花瓣宽度变量存在完美的分离。换句话说,此变量可用于完美地预测给定数据集中的样本是setosa还是versicolor。这打破了R中逻辑回归中使用的对数似然最大化估计。问题是通过将花瓣宽度系数设为无穷大,可以将对数似然驱动得非常高。
一些背景和策略是discussed here。
还有一个很好的thread on CrossValidated讨论策略。
那么为什么sklearn LogisticRegression有效呢?因为它采用“正则化逻辑回归”。正则化惩罚估计参数的大值。
在下面的例子中,我使用Firth的偏差减少逻辑回归方法logistf来生成融合模型。
library(logistf)
iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)
model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.
model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.
ORIGINAL 基于R解决方案中的std.error和z值,我认为你的模型规范很糟糕。接近0的z值基本上告诉您模型和因变量之间没有相关性。所以这是一个荒谬的模型。
我的第一个想法是你需要将Species字段转换为分类变量。它是您示例中的int
类型。尝试使用as.factor