我有340个图像样本的训练集。在scikit-learn中训练SVM之后,是否有可能在train_test_split()中犯了一个错误 因为它只使用了84个样本并将这些测量结果归还给我:
Classification report for classifier SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel=rbf, probability=False, shrinking=True, tol=0.001,
verbose=False):
precision recall f1-score support
1 0.60 0.64 0.62 14
2 0.92 1.00 0.96 12
3 1.00 1.00 1.00 10
4 0.30 0.33 0.32 9
5 0.67 0.80 0.73 5
6 0.78 0.78 0.78 9
7 0.64 0.69 0.67 13
8 1.00 0.62 0.76 13
avg / total 0.75 0.73 0.73 85
Confusion matrix:
[[ 9 1 0 0 0 1 3 0]
[ 0 12 0 0 0 0 0 0]
[ 0 0 10 0 0 0 0 0]
[ 4 0 0 3 0 0 2 0]
[ 0 0 0 1 4 0 0 0]
[ 0 0 0 2 0 7 0 0]
[ 0 0 0 4 0 0 9 0]
[ 2 0 0 0 2 1 0 8]]
使用所有340个样本我得到了这些措施:
Classification report for classifier SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel=rbf, probability=True, shrinking=True, tol=0.001,
verbose=False):
precision recall f1-score support
1 0.56 0.95 0.71 37
2 1.00 0.97 0.99 36
3 1.00 1.00 1.00 21
4 0.97 0.80 0.88 41
5 0.83 0.95 0.89 21
6 0.88 0.88 0.88 48
7 0.98 0.81 0.89 73
8 0.97 0.78 0.87 37
avg / total 0.91 0.87 0.88 314
Confusion matrix:
[[35 0 0 0 1 1 0 0]
[ 1 35 0 0 0 0 0 0]
[ 0 0 21 0 0 0 0 0]
[ 5 0 0 33 0 1 1 1]
[ 0 0 0 0 20 1 0 0]
[ 6 0 0 0 0 42 0 0]
[10 0 0 1 3 0 59 0]
[ 5 0 0 0 0 3 0 29]]
并且在这两种情况下我得到错误的类预测: 打印(clf.predict([FV]))
它已经预先设定并且召回1.00值的第3类预测()在21个样本中返回错误等级的14次! 66%的时间是错的!
这是我的代码:
import csv
import string
import numpy as np
from sklearn import svm, metrics
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
features = list()
path = 'imgsingoleDUPLI/'
reader = csv.reader(open('features.csv', 'r'), delimiter='\t')
listatemp = list()
for row in reader:
r = row[0]
if (r != ','):
numb = float(r)
listatemp.append(numb)
else:
features.append(listatemp)
listatemp = list()
print(len(features))
target = [ 1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,1,1,1,
1,1,1,1,2,2,2,
2,2,2,2,2,2,2,
2,2,2,2,2,2,2,
2,2,2,2,2,2,2,
2,2,2,2,2,2,2,
2,2,2,2,2,3,3,
3,3,3,3,3,3,3,
3,3,3,3,3,3,3,
3,3,3,3,3,4,4,
4,4,4,4,4,4,4,
4,4,4,4,4,4,4,
4,4,4,4,4,4,4,
4,4,4,4,4,4,4,
4,4,4,4,4,4,4,
4,4,4,4,5,5,5,
5,5,5,5,5,5,5,
5,5,5,5,5,5,5,
5,5,5,5,6,6,6,
6,6,6,6,6,6,6,
6,6,6,6,6,6,6,
6,6,6,6,6,6,6,
6,6,6,6,6,6,6,
6,6,6,6,6,6,6,
6,6,6,6,6,6,6,
6,6,6,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,7,
7,7,7,7,7,7,8,
8,8,8,8,8,8,8,
8,8,8,8,8,8,8,
8,8,8,8,8,8,8,
8,8,8,8,8,8,8,
8,8,8,8,8,8,8,
8]
X = features
y = target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=42)
C = 1000.0
#clf = svm.SVC(kernel='rbf', C=C).fit(X, y)
#y_predicted = clf.predict(X)
clf = svm.SVC(kernel='rbf', C=C).fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print "Classification report for classifier %s:\n%s\n" % (
clf, metrics.classification_report(y_test, y_predicted))
print "Confusion matrix:\n%s" % metrics.confusion_matrix(y,_test y_predicted)
# feature vectors taken from class 3 of training set where predict() assing a different class
fv1 = [0.16666666666634455, 8.0779356694631609e-26, 7.6757837200946069e-22, 1.0, 1.0000000000034106]
fv2 = [0.22222222221979693, 0.012345679011806714, 0.0044444444443150974, 0.13333333333333333, 2.999999999956343]
fv3 = [0.22222222221979693, 0.012345679011806714, 0.0044444444443150974, 0.13333333333333333, 2.999999999956343]
fv4 = [0.16666666666662877, 0.0017361111111079532, 1.6133253119051825e-23, 1.0, 1.6666666666660603]
fv5 = [0.24813735017910915, 0.0088802547101916908, 0.0046856535169676481, 0.4666666666666667, 2.224609846181971]
fv6 = [0.16666666666662877, 0.0017361111111079532, 9.1196662533971301e-23, 1.0, 1.6666666666660603]
print(clf.predict([fv1]))
我的功能文件: https://docs.google.com/file/d/0ByS6Z5WRz-h2VThLMk9VYVh4ZE0/edit?usp=sharing
答案 0 :(得分:0)
train_test_split(X, y, test_size=0.25)
将随机取出25%的数据来制作测试集(在本例中为85个样本)并保留剩余的75%(在您的情况下应为255)以制作训练集
分类报告显示,在您的测试集中,您只在课程3中有10个样本,因此您无法观察到“在21个样本中返回14次错误课程”(此外,这意味着您没有使用测试集)评估)。
尝试更改random_state
的值以生成不同的随机分割,并检查是否总是获得精度,并且对于不同的随机分割,为类3调用1.0。要自动执行此过程并计算测试分数的平均值,您可以使用cross validation执行ShuffleSplit。