Question

我正在尝试SciKit Learn。我以为我会尝试加权逻辑回归，但是当我使用sample_weight参数初始化它时，我会从sklearn的LogisticRegression对象中得到无意义的预测。

这是一个演示问题的玩具示例。我已经设置了一个非常简单的数据集，其中包含一个功能和一个二进制目标输出。

feat  target  weight
A       0       1
A       0       1
A       1       1
A       1       1
B       0       1
B       0       1
B       0       1
B       1       W

因此，任何明智的逻辑回归都应该预测，当feat=A时，成功概率为0.5。 feat=B 的概率取决于权重 W：

如果W=1，那么看起来成功的可能性为0.25

如果W=3，这会平衡三个0，看起来有0.5的成功机会

如果W=9，实际上有九个1和三个0，那么0.75的成功几率。

R 中的加权逻辑回归给出正确的预测：

test <- function(final_weight) { feat <- c('A','A','A','A','B','B','B','B') target <- c(0, 0, 1, 1, 0, 0, 0, 1) weight <- c(1, 1, 1, 1, 1, 1, 1, final_weight) df = data.frame(feat, target, weight) m = glm(target ~ feat, data=df, family='binomial', weights=weight) predict(m, type='response') } test(1) # 1 2 3 4 5 6 7 8 #0.50 0.50 0.50 0.50 0.25 0.25 0.25 0.25 test(3) # 1 2 3 4 5 6 7 8 #0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 test(9) # 1 2 3 4 5 6 7 8 #0.50 0.50 0.50 0.50 0.75 0.75 0.75 0.75

大。 SciKit Learn中的但，使用LogisticRegression对象，在使用W=9时，我会不断得到无意义的预测。这是我的Python代码：

import pandas as pd from sklearn.linear_model import LogisticRegression from patsy import dmatrices def test(final_weight): d = { 'feat' : ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'], 'target' : [0, 0, 1, 1, 0, 0, 0, 1], 'weight' : [1, 1, 1, 1, 1, 1, 1, final_weight], } df = pd.DataFrame(d) print df, '\n' y, X = dmatrices('target ~ feat', df, return_type="dataframe") features = X.columns C = 1e10 # high value to prevent regularization solver = 'sag' # so we can use sample_weight lr = LogisticRegression(C=C, solver=solver) lr.fit(X, df.target, sample_weight=df.weight) print 'Predictions:', '\n', lr.predict_proba(X), '\n', '====' test(1) test(3) test(9)

这给出了以下输出（我删除了一些以使它更简洁一点）：

feat target weight ... 4 B 0 1 5 B 0 1 6 B 0 1 7 B 1 1 Predictions: [[ 0.50000091 0.49999909] ... [ 0.74997935 0.25002065]] ==== feat target weight ... 4 B 0 1 5 B 0 1 6 B 0 1 7 B 1 3 /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/sag.py:267: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge Predictions: [[ 0.49939191 0.50060809] ... [ 0.49967407 0.50032593]] ==== feat target weight ... 4 B 0 1 5 B 0 1 6 B 0 1 7 B 1 9 Predictions: [[ 0.00002912 0.99997088] # Nonsense predictions for A! ... [ 0.00000034 0.99999966]] # And for B too... ====

你可以看到，当我将最终体重设置为9（这似乎不是一个不合理的高重量）时，预测就毁了！不仅feat=B的预测很荒谬，而且 feat=A 的预测现在也很荒谬。

我的问题是

当最终体重为9时，为什么这些预测会出错？

我做错了什么或误解了吗？

更一般地说，如果有人在SciKit Learn中成功使用加权逻辑回归，并且获得与 R '给出的类似预测，我会非常感兴趣s glm(..., family='binomial')功能。

许多人提前感谢您提供任何帮助。

Answer 1

问题出现在求解器中：

#include <stdio.h>
void main(){    
    int i =0; int x = 5;

    for (i=0; i<x; i++);
        printf("This line used to print 5 times but now compiler is borked");

    return 1;    
}

对于大型数据集，使用随机解算器是常见的，其中iid假设您的训练示例。高样品重量不适用。

将解算器更改为solver = 'sag'后，结果与您在R中看到的结果相符。

lbfgs

sklearn LogisticRegression predict_proba（）在使用sample_weight参数时给出了错误的预测

1 个答案: