Question

我正在scikit learn中创建一个管道，

pipeline = Pipeline([
    ('bow', CountVectorizer()),  
    ('classifier', BernoulliNB()), 
])

使用交叉验证计算准确度

scores = cross_val_score(pipeline,  # steps to convert raw messages      into models
                     train_set,  # training data
                     label_train,  # training labels
                     cv=5,  # split data randomly into 10 parts: 9 for training, 1 for scoring
                     scoring='accuracy',  # which scoring metric?
                     n_jobs=-1,  # -1 = use all cores = faster
                     )

如何报告混淆矩阵而不是＆＃39;准确度？

Answer 1

您可以使用ItemGrabResult（See the scikit-learn docs）代替cross_val_predict。

而不是：

cross_val_score

你可以这样做：

from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, x, y, cv=10)

Answer 2

简短回答是“你不能”。

您需要了解cross_val_score和交叉验证之间的差异作为模型选择方法。 cross_val_score顾名思义，仅适用于得分。混淆矩阵不是分数，它是评估过程中发生的事情的一种总结。一个主要区别是得分应该返回可订购对象，特别是在scikit-learn中，浮动。因此，基于分数，您可以通过简单地比较b是否具有更高分数来判断方法b是否更好。你不能用混淆矩阵来做到这一点，这也就像名字所暗示的那样是一个矩阵。

如果你想获得多次评估运行的混淆矩阵（比如交叉验证），你必须手工完成，这在scikit-learn中并不是那么糟糕 - 它实际上是几行代码。

kf = cross_validation.KFold(len(y), n_folds=5)
for train_index, test_index in kf:

   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

   model.fit(X_train, y_train)
   print confusion_matrix(y_test, model.predict(X_test))

Answer 3

你可以做的是定义一个使用来自混淆矩阵的某些值的得分手。见here [link]。只是引用代码：

def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0] def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1] def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0] def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1] scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn), 'fp' : make_scorer(fp), 'fn' : make_scorer(fn)} cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)

这将为这四个得分手中的每一个执行交叉验证并返回评分字典cv_results，例如，使用包含混淆矩阵值的键test_tp，test_tn等来自每个交叉验证分组。

由此您可以重建平均混淆矩阵，但Xema的cross_val_predict似乎更为优雅。

请注意，这实际上不适用于cross_val_score;你需要cross_validate（在scikit-learn v0.19中介绍）。

旁注：您可以使用一个这些记分员（即矩阵的一个元素）通过网格搜索进行超参数优化。

*编辑：在[1,1]返回真阴性，而不是[0,0]

Answer 4

我认为您真正想要的是每次交叉验证运行获得的混淆矩阵的平均值。 @lejlot已经很好地解释了原因，我将通过计算混淆矩阵的平均值来升级他的答案：

在每次交叉验证中计算混淆矩阵。您可以使用以下内容：

conf_matrix_list_of_arrays = []
kf = cross_validation.KFold(len(y), n_folds=5)
for train_index, test_index in kf:

   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

   model.fit(X_train, y_train)
   conf_matrix = confusion_matrix(y_test, model.predict(X_test))
   conf_matrix_list_of_arrays .append(conf_matrix)

最后，您可以使用以下命令计算numpy数组（混淆矩阵）列表的平均值：

mean_of_conf_matrix_arrays = np.mean(conf_matrix_list_of_arrays, axis=0)

Answer 5

我是机器学习的新手。如果我理解正确，则混淆矩阵可以从4个值获得，分别是TP，FN，FP和TN。这4个值不能直接从得分中获得，但隐含在准确性，准确性和召回率中。

现在它有4个未知的TP，FN，FP和TN。

Eq1：tp /（tp + fp）= P

Eq2：tp /（tp + fn）= R

Eq3：（tp + tn）/（tp + fn + fp + tn）= A

[1]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfp%7D%3DP
[2]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%7D%7Btp%2Bfn%7D%3DR
[3]: https://chart.googleapis.com/chart?cht=tx&chl=%5Cfrac%7Btp%2Btn%7D%7Btp%2Bfn%2Bfp%2Btn%7D%3DA

假设未知数之一为1，则它变为3未知数和3个方程。相对值可以使用方程组求解。

PR A可以通过得分获得
cross_validate可以一次获取所有3个来源

def calculate_confusion_matrix_by_assume_tp_equal_to_1(r, p, a):
    # tp/(tp+fp)=P, tp/(tp+fn)=R, (tp+tn)/(tp+fn+fp+tn)=A
    fn = (1 / r) - 1
    fp = (1 / p) - 1
    tn = (1 - a - a * fn - a * fp) / (a - 1)
    return fn, fp, tn

使用混淆矩阵作为scikit中交叉验证的评分指标

5 个答案: