Question

我正在使用支持向量机，使用Python中的sci-kit学习。

我已经训练了模型，使用GridSearch和交叉验证来找到最佳参数，并且已经评估15％坚持集的最佳模型。

最后的混淆矩阵说我有0个错误分类后来，当我给它一个手写的数字时，模型给了我不正确的预测（我没有包含这个代码，以保持这个问题相当简短）。

因为SVM没有错误，而且后来无法正确预测，我已经错误地构建了这个SVM。

我的问题是：

我是否有权怀疑我使用交叉验证和GridSearch不正确？或者我给GridSearch参数有点荒谬，并给我错误的结果？

感谢你花时间和精力阅读这篇文章。

第1步：使用train_test_split函数将数据集拆分为85％/ 15％

X_train, X_test, y_train, y_test =
cross_validation.train_test_split(X, y, test_size=0.15,
random_state=0)

第2步：将GridSearchCV函数应用于训练集以调整分类器

C_range = 10.0 ** np.arange(-2, 9)
gamma_range = 10.0 ** np.arange(-5, 4)
param_grid = dict(gamma=gamma_range, C=C_range)
cv = StratifiedKFold(y=y, n_folds=3)

grid = GridSearchCV(SVC(), param_grid=param_grid, cv=cv)
grid.fit(X, y)

print("The best classifier is: ", grid.best_estimator_)

输出在这里：

('The best classifier is: ', SVC(C=10.0, cache_size=200,
class_weight=None, coef0=0.0, degree=3,
 gamma=0.0001, kernel='rbf', max_iter=-1, probability=False,
 random_state=None, shrinking=True, tol=0.001, verbose=False))

第3步：最后，评估剩余15％的调整分类器保持不变。

clf = svm.SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

clf.fit(X_train, y_train)

clf.score(X_test, y_test)
y_pred = clf.predict(X_test)

输出在这里：

precision recall f1-score support

      -1.0       1.00      1.00      1.00         6
       1.0       1.00      1.00      1.00        30

avg / total       1.00      1.00      1.00        36

Confusion Matrix:
[[ 6  0]
[ 0 30]]

Answer 1

您的测试集中的数据太少（其中一个类只有6个样本），因此您对模型的预测准确性充满信心。我建议每个类别至少标记150个样本，并在保持测试中保留50个样本以计算评估指标。

编辑：还看一下它无法预测的新样本：是相同范围内的特征值（例如[0,255]而不是[0,1]或[-1,1]训练和测试集的数字）？当您使用matplotlib绘制它们时，新数字“看起来”与测试集中的其他数字一样吗？

scikit-learn：SVM给我零误差，但无法预测

1 个答案: