Question

逻辑回归中的正则化参数C. （参见http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html）用于允许对函数进行适当定义，避免过度拟合或步进函数出现问题（参见https://datascience.stackexchange.com/questions/10805/does-scikit-learn-use-regularization-by-default/10806）。

然而，逻辑回归中的正则化应仅涉及特征的权重，而不是截距（也在此处解释：http://aimotion.blogspot.com/2011/11/machine-learning-with-python-logistic.html）

但似乎sklearn.linear_model.LogisticRegression也实际上规则化了拦截。原因如下：

1）仔细考虑以上链接（https://datascience.stackexchange.com/questions/10805/does-scikit-learn-use-regularization-by-default/10806）：sigmod略微向左移动，更靠近拦截0。

2）我尝试使用逻辑曲线和手动最大似然函数拟合数据点。将截距包括在L2范数中会得到与sklearn函数相同的结果。

请提出两个问题：

1）我是否弄错了，这是一个错误，还是有正当理由来规范拦截？

2）有没有办法使用sklearn并指定除了拦截之外的所有参数的正规化？

谢谢！

import numpy as np
from sklearn.linear_model import LogisticRegression

C = 1e1
model = LogisticRegression(C=C)

x = np.arange(100, 110)
x = x[:, np.newaxis]
y = np.array([0]*5 + [1]*5)

print x
print y

model.fit(x, y)
a = model.coef_[0][0]
b = model.intercept_[0]

b_modified = -b/a                   # without regularization, b_modified should be 104.5 (as for C=1e10)

print "a, b:", a, -b/a

# OUTPUT: 
# [[100]
#  [101]
#  [102]
#  [103]
#  [104]
#  [105]
#  [106]
#  [107]
#  [108]
#  [109]]
# [0 0 0 0 0 1 1 1 1 1]
# a, b: 0.0116744221756 100.478968664

Answer 1

scikit-learn具有默认的正则逻辑回归。

如果仅更改intercept_scaling参数，则sklearn.linear_model.LogisticRegression中C参数值的更改会对结果产生类似的影响。

如果在intercept_scaling参数中进行修改，则正则化会对逻辑回归中的偏差估计产生影响。当该参数的值在较高侧时，则降低了对偏差的正则化影响。每official documentation：

截距变为intercept_scaling * synthetic_feature_weight。

注意！合成特征权重受l1 / l2正则化像所有其他功能。减轻正规化的影响合成特征权重（因此在截距上）必须增加intercept_scaling。

希望它有所帮助！

Answer 2

感谢@Prem，这确实是解决方案：

C = 1e1  
intercept_scaling=1e3    # very high numbers make it unstable in practice
model = LogisticRegression(C=C, intercept_scaling=intercept_scaling)

为什么sklearn逻辑回归可以规范权重和截距？

2 个答案: