如何计算1对1集合的最终分类准确度,精度,召回率,f1分数,混淆矩阵?

时间:2017-09-16 05:47:26

标签: python scikit-learn classification svm confusion-matrix

考虑一个3类数据,比如 Iris数据

假设我们希望使用 Python的sklearn 对此多类数据进行二进制 SVM 分类。因此,我们有以下三个二进制分类问题:{class1, class2}, {class1, class3}, {class2, class3}

对于上述每个问题,我们都可以得到分类准确度,精度,召回率,f1分数和2x2混淆矩阵

我有以下问题:

  1. 如何组合这些3二元分类器的结果并获得与多类分类器等效的结果,即如何获得最终的分类准确度,精度,召回率,f1分数以及3精度,精度,召回,f1分数和2x2混淆矩阵的3x3混淆矩阵

  2. 假设我们对70%个班级组合有80%90%3准确度。我应该将最终准确度设为accuracy.mean() +/- accuracy.std(),,其他指标的准确度是否相同?

  3. 或者,我应该先得到最终的3x3混淆矩阵,并且从这个矩阵中,我应该计算准确度,精确度,召回率,f1分数?

  4. 多类分类如何在内部进行?它是否使用步骤3 中的策略?我对直接应用多类分类不感兴趣,但只对二进制分类感兴趣,并且得到的结果相当于多类分类。

  5. 现在,假设我们还希望使用上述kFold二元分类器执行3交叉验证。因此,对于每个折叠,我们将具有精度,精度,召回,f1分数和2x2混淆矩阵。在这种情况下,我可以将平均准确度设为accuracy.mean() +/- accuracy.std().

    此外,在kFold交叉验证的情况下,对于每个二进制分类问题,我可以通过为每个折叠添加2x2混淆矩阵来获得聚合的混淆矩阵。我还可以从这个聚合混淆矩阵中为每个二元分类器计算kFolds的平均准确度,精度等。但是,结果与accuracy.mean() +/- accuracy.std()之间的kFolds略有不同。我认为后者更可靠。

    1. 如何对每个二元分类问题使用kFold交叉验证,并获得最终的准确度,精度,召回率,f1分数和3x3混淆矩阵?
    2. 如果有人能够通过实施解答上述问题,我将不胜感激。

      以下是最低工作示例。请注意,其中一部分是伪代码,用于将数据加载并拆分为traintest集:

      import pandas as pd
      import numpy as np
      from sklearn.model_selection import KFold
      from sklearn import svm, datasets
      from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
      import time
      import os
      
      tic = time.clock()
      # Import data
      iris = datasets.load_iris()
      X = iris.data                    
      Y = iris.target
      
      # Now, suppose we have three separate sets {data1, target1}, {data2, target2}, {data3, target3}
      # for binaray classification.
      
      #dataset = [{data1 + data2, target1 + target2}, {data1+ data3, target1 + target3}, {data2 + data3, target2 + target3}]
      
      for d in dataset:
      
          #Import any pair, say, {data1 + data2, target1 + target2}. We will import 3 pairs one-by-one for 3 different binary classification problems.
      
          #data = data1 + data2
          #label = target1 + target2
      
          K = 10    #Number of folds
          for i in range(K):
              kf = KFold(n_splits=K, random_state=None, shuffle=False)
      
              cv = list(kf.split(data1))        
              trainIndex, testIndex = cv[i][0], cv[i][1]        
              trainData, testData = data.iloc[trainIndex], data.iloc[testIndex]
              trainData_label, testData_label = data_labe.iloc[trainIndex], data_labe.iloc[testIndex]
      
              # So now, we have Train, Test, Train_label, Test_label
      
      
              clf = []
              clf = svm.SVC(kernel='rbf')
      
              clf.fit(Train, Train_label)     
      
              predicted_label = clf.predict(Test)
      
      
              Accuracy_Score = accuracy_score(Test_label, predicted_label)
              Precision_Score = precision_score(Test_label, predicted_label,  average="macro")
              Recall_Score = recall_score(Test_label, predicted_label,  average="macro")
              F1_Score = f1_score(Test_label, predicted_label,  average="macro")
      
              print('Average Accuracy: %0.2f +/- (%0.1f) %%' % (Accuracy_Score.mean()*100, Accuracy_Score.std()*100))
              print('Average Precision: %0.2f +/- (%0.1f) %%' % (Precision_Score.mean()*100, Precision_Score.std()*100))
              print('Average Recall: %0.2f +/- (%0.1f) %%' % (Recall_Score.mean()*100, Recall_Score.std()*100))
              print('Average F1-Score: %0.2f +/- (%0.1f) %%' % (F1_Score.mean()*100, F1_Score.std()*100))
      
              CM = confusion_matrix(Test_label, predicted_label)
      
          print('-------------------------------------------------------------------------------')
      toc = time.clock()
      print("Total time to run the complete code = ", toc-tic)
      

0 个答案:

没有答案