Question

我正在用SVM classifier做一个香蕉探测器项目。我有358个图像样本用于训练，并使用test-size=0.2，random_state=42进行训练测试拆分。

这是我的数据集的外观：

我已用0或1标记每个图像作为文件名postfix。但是，classification_report(...)总是返回：

Accuracy: 0.7352941176470589
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
              precision    recall  f1-score   support

           0       0.74      1.00      0.85        50
           1       0.00      0.00      0.00        18

    accuracy                           0.74        68
   macro avg       0.37      0.50      0.42        68
weighted avg       0.54      0.74      0.62        68

类1在表摘要中始终具有0.00。

我的完整源代码：

import os
import zipfile
import numpy as np
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.externals import joblib
import cv2

zip_ref = zipfile.ZipFile("dataset.zip", "r")
zip_ref.extractall()
zip_ref.close()

path = "bananas_dataset"
img_files = [(os.path.join(root, name))
    for root, dirs, files in os.walk(path)
    for name in files if name.endswith((".jpg"))]

winSize = (32, 32)
blockSize = (16, 16)
blockStride = (8, 8)
cellSize = (8, 8)
nbins = 9
derivAperture = 1
winSigma = -1.
histogramNormType = 0
L2HysThreshold = 0.2
gammaCorrection = 1
nlevels = 64
useSignedGradients = True

hog = cv2.HOGDescriptor(winSize, blockSize, blockStride,
    cellSize, nbins, derivAperture, winSigma, histogramNormType,
    L2HysThreshold, gammaCorrection, nlevels, useSignedGradients)

features = np.zeros((1, 324), np.float32)
labels = np.zeros(1, np.int64)
for i in img_files:
    img = cv2.imread(i)
    resized_img = cv2.resize(img, winSize)
    descriptor = np.transpose(hog.compute(resized_img))
    features = np.vstack((features, descriptor))
    labels = np.vstack((labels, int(i[-5])))

features = np.delete(features, (0), axis=0)
labels = np.delete(labels, (0), axis=0).ravel()

X_train, X_test, y_train, y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.2,
                                                    random_state=42)
print("X_train: {}, y_train: {}".format(X_train.shape, y_train.shape))
print("X_test: {}, y_test: {}".format(X_test.shape, y_test.shape))

clf = svm.SVC()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Accuracy: {}".format(accuracy_score(y_test, y_pred)))

print("Classification report:")
print(classification_report(y_test, y_pred))
joblib.dump(clf, "banana_hog_svm_clf.pkl")

这导致我的预测过程始终返回类0作为结果。为什么会这样？

Answer 1

由于标签不平衡，可能会发生这种情况。例如，如果10％的标签属于1类，而90％的标签属于2类，则SVM将以90％的精度制作一个模型，其中所有内容都将被预测为2类。

如果您检查班级标签的分布情况，将会很有帮助。

二进制分类器的SVM训练总是给class 0

1 个答案: