Question

我有340个图像样本的训练集。在scikit-learn中训练SVM之后，是否有可能在train_test_split（）中犯了一个错误因为它只使用了84个样本并将这些测量结果归还给我：

Classification report for classifier SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.0, kernel=rbf, probability=False, shrinking=True, tol=0.001,
  verbose=False):
             precision    recall  f1-score   support

          1       0.60      0.64      0.62        14
          2       0.92      1.00      0.96        12
          3       1.00      1.00      1.00        10
          4       0.30      0.33      0.32         9
          5       0.67      0.80      0.73         5
          6       0.78      0.78      0.78         9
          7       0.64      0.69      0.67        13
          8       1.00      0.62      0.76        13

avg / total       0.75      0.73      0.73        85


Confusion matrix:
[[ 9  1  0  0  0  1  3  0]
 [ 0 12  0  0  0  0  0  0]
 [ 0  0 10  0  0  0  0  0]
 [ 4  0  0  3  0  0  2  0]
 [ 0  0  0  1  4  0  0  0]
 [ 0  0  0  2  0  7  0  0]
 [ 0  0  0  4  0  0  9  0]
 [ 2  0  0  0  2  1  0  8]]

使用所有340个样本我得到了这些措施：

Classification report for classifier SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.0, kernel=rbf, probability=True, shrinking=True, tol=0.001,
  verbose=False):
             precision    recall  f1-score   support

          1       0.56      0.95      0.71        37
          2       1.00      0.97      0.99        36
          3       1.00      1.00      1.00        21
          4       0.97      0.80      0.88        41
          5       0.83      0.95      0.89        21
          6       0.88      0.88      0.88        48
          7       0.98      0.81      0.89        73
          8       0.97      0.78      0.87        37

avg / total       0.91      0.87      0.88       314


Confusion matrix:
[[35  0  0  0  1  1  0  0]
 [ 1 35  0  0  0  0  0  0]
 [ 0  0 21  0  0  0  0  0]
 [ 5  0  0 33  0  1  1  1]
 [ 0  0  0  0 20  1  0  0]
 [ 6  0  0  0  0 42  0  0]
 [10  0  0  1  3  0 59  0]
 [ 5  0  0  0  0  3  0 29]]

并且在这两种情况下我得到错误的类预测：打印（clf.predict（[FV]））

它已经预先设定并且召回1.00值的第3类预测（）在21个样本中返回错误等级的14次！ 66％的时间是错的！

这是我的代码：

import csv
import string 

import numpy as np
from sklearn import svm, metrics
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC


features = list()
path = 'imgsingoleDUPLI/'

reader = csv.reader(open('features.csv', 'r'), delimiter='\t')
listatemp = list()
for row in reader:
    r = row[0]

    if (r != ','):
        numb = float(r)
        listatemp.append(numb)
    else:
        features.append(listatemp)
        listatemp = list()



print(len(features))

target = [        1,1,1,
          1,1,1,1,1,1,1,
          1,1,1,1,1,1,1,
          1,1,1,1,1,1,1,
          1,1,1,1,1,1,1,
          1,1,1,1,1,1,1,
          1,1,1,1,1,1,1,
          1,1,1,1,1,1,1,
          1,1,1,1,1,1,1,                   
          1,1,1,1,2,2,2,
          2,2,2,2,2,2,2,
          2,2,2,2,2,2,2,
          2,2,2,2,2,2,2,
          2,2,2,2,2,2,2,                  
          2,2,2,2,2,3,3,
          3,3,3,3,3,3,3,
          3,3,3,3,3,3,3,                  
          3,3,3,3,3,4,4,
          4,4,4,4,4,4,4,
          4,4,4,4,4,4,4,
          4,4,4,4,4,4,4,
          4,4,4,4,4,4,4,
          4,4,4,4,4,4,4,                  
          4,4,4,4,5,5,5,
          5,5,5,5,5,5,5,
          5,5,5,5,5,5,5,                  
          5,5,5,5,6,6,6,
          6,6,6,6,6,6,6,
          6,6,6,6,6,6,6,
          6,6,6,6,6,6,6,
          6,6,6,6,6,6,6,
          6,6,6,6,6,6,6,
          6,6,6,6,6,6,6,                  
          6,6,6,7,7,7,7,               
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,
          7,7,7,7,7,7,7,                  
          7,7,7,7,7,7,8,
          8,8,8,8,8,8,8,
          8,8,8,8,8,8,8,
          8,8,8,8,8,8,8,
          8,8,8,8,8,8,8,
          8,8,8,8,8,8,8,                  
          8]

X = features
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.25, random_state=42)

C = 1000.0

#clf = svm.SVC(kernel='rbf', C=C).fit(X, y)
#y_predicted = clf.predict(X)
clf = svm.SVC(kernel='rbf', C=C).fit(X_train, y_train)
y_predicted = clf.predict(X_test)

print "Classification report for classifier %s:\n%s\n" % (
    clf, metrics.classification_report(y_test, y_predicted))
print "Confusion matrix:\n%s" % metrics.confusion_matrix(y,_test y_predicted)

# feature vectors taken from class 3 of training set where predict() assing a different class

fv1 = [0.16666666666634455, 8.0779356694631609e-26, 7.6757837200946069e-22, 1.0, 1.0000000000034106]

fv2 = [0.22222222221979693, 0.012345679011806714, 0.0044444444443150974, 0.13333333333333333, 2.999999999956343]

fv3 = [0.22222222221979693, 0.012345679011806714, 0.0044444444443150974, 0.13333333333333333, 2.999999999956343]

fv4 = [0.16666666666662877, 0.0017361111111079532, 1.6133253119051825e-23, 1.0, 1.6666666666660603]

fv5 = [0.24813735017910915, 0.0088802547101916908, 0.0046856535169676481, 0.4666666666666667, 2.224609846181971]

fv6 = [0.16666666666662877, 0.0017361111111079532, 9.1196662533971301e-23, 1.0, 1.6666666666660603]

print(clf.predict([fv1]))

我的功能文件： https://docs.google.com/file/d/0ByS6Z5WRz-h2VThLMk9VYVh4ZE0/edit?usp=sharing

Answer 1

train_test_split(X, y, test_size=0.25)将随机取出25％的数据来制作测试集（在本例中为85个样本）并保留剩余的75％（在您的情况下应为255）以制作训练集

分类报告显示，在您的测试集中，您只在课程3中有10个样本，因此您无法观察到“在21个样本中返回14次错误课程”（此外，这意味着您没有使用测试集）评估）。

尝试更改random_state的值以生成不同的随机分割，并检查是否总是获得精度，并且对于不同的随机分割，为类3调用1.0。要自动执行此过程并计算测试分数的平均值，您可以使用cross validation执行ShuffleSplit。

当精度和召回率为1.00时，在svm中获得错误的类预测

1 个答案: