Question

我正在使用sklearn包中的LogisticRegression，并且有一个关于分类的快速问题。我为我的分类器建立了一条ROC曲线，结果证明我的训练数据的最佳阈值大约为0.25。我假设创建预测时的默认阈值是0.5。如何进行10倍交叉验证时，如何更改此默认设置以了解模型的准确度？基本上，我希望我的模型能够为大于0.25但不是0.5的任何人预测“1”。我一直在查看所有文档，我似乎无法到达任何地方。

提前感谢您的帮助。

Answer 1

我想给出一个实际的答案

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score

X, y = make_classification(
    n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
    n_features=20, n_samples=1000, random_state=10
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = LogisticRegression(class_weight="balanced")
clf.fit(X_train, y_train)
THRESHOLD = 0.25
preds = np.where(clf.predict_proba(X_test)[:,1] > THRESHOLD, 1, 0)

pd.DataFrame(data=[accuracy_score(y_test, preds), recall_score(y_test, preds),
                   precision_score(y_test, preds), roc_auc_score(y_test, preds)], 
             index=["accuracy", "recall", "precision", "roc_auc_score"])

通过将THRESHOLD更改为0.25，可以发现recall和precision的分数正在降低。但是，通过删除class_weight参数，accuracy会增加，但recall分数会下降。请参阅@接受的答案

Answer 2

您可以更改阈值，但该阈值为0.5，以使计算正确。如果您有一个不平衡的集合，则分类如下图所示。

您会看到类别1的预期非常差。 1类占人口的2％。在将结果变量平衡为50％至50％（使用过采样）后，0.5阈值到达了图表的中心。

Answer 3

特殊情况：一维逻辑回归

使用以下公式来计算将样本X标记为1和标记为0的区域的值。

from scipy.special import logit
thresh = 0.1
val = (logit(thresh)-clf.intercept_)/clf.coef_[0]

因此，可以使用以下更直接的方法来计算预测

preds = np.where(X>val, 1, 0)

Answer 4

出于完整性考虑，我想提一下另一种基于scikit概率计算using binarize优雅地生成预测的方法：

import numpy as np
from sklearn.preprocessing import binarize

THRESHOLD = 0.25

# This probabilities would come from logistic_regression.predict_proba()
y_logistic_prob =  np.random.uniform(size=10)

predictions = binarize(y_logistic_prob.reshape(-1, 1), THRESHOLD).ravel()

此外，我同意the considerations that Andreus makes，特别是2和3。请务必注意它们。

Answer 5

def find_best_threshold(threshould, fpr, tpr):
   t = threshould[np.argmax(tpr*(1-fpr))]
   # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
   print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
   return t

如果您想要找到最佳的真实正值率和nagatuve率，可以使用此功能

Answer 6

就我的算法而言还可以：

threshold = 0.1
LR_Grid_ytest_THR = ((model.predict_proba(Xtest)[:, 1])>= threshold).astype(int)

和：

print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print('Valuation for test data only  (new_threshold):')
    print(classification_report(ytest, LR_Grid_ytest_THR))

sklearn LogisticRegression并更改分类的默认阈值

6 个答案: