Question

我是熊猫与scikit学习的新手。我已经能够建立一个简单的模型-Bad和Good

df = pd.read_csv('pandas_model.csv', header=None, names=['label', 'resume'])
X = df.resume.astype('U').values
y = df.label

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)

vect = TfidfVectorizer()

vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

## create test
X_test_dtm = vect.transform(X_test)

logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

y_pred_class = logreg.predict(X_test_dtm)
score = metrics.accuracy_score(y_test, y_pred_class)
# print('LogReg Accuracy Score: ' % str(score))
print(score)
log_reg_cf = metrics.confusion_matrix(y_test, y_pred_class)
print(log_reg_cf)

混淆矩阵：

[[2696  165]
 [ 742  424]]

当它们应该为“是”时，似乎猜测了太多的数据点为“否”（742）。

我了解到SciKit学习使用.5作为阈值，根据predict_proba()分数做出决定。

我正在尝试组合起来以“测试”各种阈值-即代替.5，它是.4，它将把一些猜测的数据点从False Negative变为正确地猜测为Good。

logreg.predict_proba(X_test_dtm)

为我提供了分数的二维数组（差/好）

array([[0.59946085, 0.40053915], ## guessed as bad, but if the threshold was .6, it would be guessed as good. This is what I'm trying to run simulations on
       [0.89679281, 0.10320719],
       [0.328435  , 0.671565  ],
       ...,
       [0.50415322, 0.49584678],
       [0.84380259, 0.15619741],
       [0.85216752, 0.14783248]])

y_test.head()给了我真实的价值（顺便说一句，5369代表什么？行号？）

5369      Bad
11313     Bad
11899    Good
3856      Bad
1961      Bad

理想情况下，我正在尝试运行仿真以对所有X_train_dtm数据执行操作：

if X_train_dtm[0] (bad score) > .6 (instead of .5):
    then 
        resut = bad
    else
        result = good

，然后根据y_test()重新检查并重新检查准确性得分

在SciKit学习中似乎没有任何办法可以移动.5阈值，并且看起来我必须手动进行操作。

基本上试图使数据点“变硬”，使其被猜测为否

希望我已经说了这个问题，这样才有意义

我从标记为重复的问题中得到一个错误

from sklearn.metrics import precision_recall_curve
probs_y=logreg.predict_proba(X_test_dtm)
precision, recall, thresholds = precision_recall_curve(y_test, probs_y[:, 0])

ValueError: Data is not binary and pos_label is not specified

Answer 1

IIUC，您可以使用predict_proba来简单地做到这一点（至少对于二进制代码而言）：

probabilities = logreg.predict_proba(X_test_dtm)

threshold = 0.4
good = probabilities[:, 1]
predicted_good = good > threshold

如果good的概率大于0.5，这将为您提供二进制预测。

您可以轻松地概括上面的代码，以使用需要二进制预测的任何度量标准来测试您喜欢的任何阈值。

SciKit学习predict_proba-将阈值从0.5移至其他

1 个答案: