Question

我正在尝试将 SAS PROC LOGISTIC 的结果与 Python 3 中的 sklearn 相匹配。SAS 使用无惩罚回归，我可以在 sklearn.linear_model.LogisticRegression 中使用选项 {{1} } 或 C = 1e9。

这应该是故事的结尾，但是当我使用公共 data set from UCLA 并尝试复制 penalty='none' 和 FEMALE 的 their multiple regression 时，我仍然注意到一个小的差异MATH。

这是我的 Python 脚本：

hiwrite

产生：

<块引用>

# module imports
import pandas as pd
from sklearn.linear_model import LogisticRegression

# read in the data
df = pd.read_sas("~/Downloads/hsb2-4.sas7bdat")

# FE
df["hiwrite"] = df["write"] >= 52

print("\n\n")
print("Multiple Regression of Female and Math on hiwrite:")
feature_cols = ['female','math']

y=df["hiwrite"]
X=df[feature_cols]

# sklearn output
model = LogisticRegression(fit_intercept = True, C = 1e9)
mdl = model.fit(X, y)
print(mdl.intercept_)
print(mdl.coef_)

加州大学洛杉矶分校从 SAS 得到这个结果：

<块引用>

Multiple Regression of Female and Math on hiwrite:
[-10.36619688]
[[1.63062846 0.1978864 ]]

这很接近，但是正如您所看到的，截距参数估计值在小数点后第 3 位不同，Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -10.3651 1.5535 44.5153 <.0001 FEMALE 1 1.6304 0.4052 16.1922 <.0001 MATH 1 0.1979 0.0293 45.5559 <.0001 上的估计值在小数点后第 4 位不同。我尝试更改其他一些参数（如 female 和 tol 以及 max_iter），但没有改变结果。我还尝试了 solver 中的 Logit - 它匹配 statsmodel.api，而不是 SAS。 R 在截距和第一个系数上与 Python 匹配，但在第二个系数上与 SAS 和 Python 略有不同......

更新：我去SAS的社区寻找答案，有人提到这可能是由于迭代最大似然算法收敛的差异。我觉得这听起来不错，尽管我已经尝试使用 Python 中的可用选项来解决这个问题。

对错误的来源以及如何使 Python 与 SAS 匹配有任何想法吗？

SAS PROC LOGISTIC 和 Python sklearn LogisticRegression 之间的小偏差的来源（未惩罚）？

0 个答案: