Question

我正在尝试使用在Kaggle上找到的数据集来训练和评估预测模型，但我的精度为0，我想知道自己做错了什么

该代码适用于随机森林模型，但不适用于SVM或神经网络

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline

#loading dataset
recipes = pd.read_csv('epi_r.csv')

keep_col = ['rating','calories','protein','fat','sodium']
recipes = recipes[keep_col]
recipes = recipes.dropna()

#preprocessing data
bins = (-1, 4, 5)
group_names = ['bad','good']
recipes['rating'] = pd.cut(recipes['rating'].dropna(), bins = bins,           labels = group_names)
recipes['rating'].unique()

#bad=0; good=1
label_rating = LabelEncoder()

recipes['rating'] =        label_rating.fit_transform(recipes['rating'].astype(str))

#separate dataset as response variable and feature variables
x = recipes.drop('rating', axis=1)
y = recipes['rating']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size   = 0.20, random_state = 42)

#converts the values & levels the playing fields
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
#don't fit again b/c want to use the same fit
x_test = sc.transform(x_test)

clf=svm.SVC()
clf.fit(x_train,y_train)
pred_clf = clf.predict(x_test)

print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))



precision    recall  f1-score   support

       0       0.00      0.00      0.00      1465
       1       0.54      1.00      0.70      1708

   micro avg       0.54      0.54      0.54      3173
   macro avg       0.27      0.50      0.35      3173
weighted avg       0.29      0.54      0.38      3173

[[   0 1465]
 [   0 1708]]

/usr/local/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning:    Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)

这是我得到的结果，什么都没被正确预测

Answer 1

第1类的召回率为1.0，这意味着您的模型始终预测为“ 1”。您还可以从混淆矩阵中看到这一点，其中对于类别1正确预测了1708个值，但是对于类别1预测了1465个值。

总是预测单个值的模型是一个常见问题-它陷入了某种次优的解决方案中。使用不同类型的模型（例如，不同的内核）甚至选择不同的随机种子，您可能会发现通过标准化输入值（因此一栏不占优势）而感到幸运。

Answer 2

您只是没有找到正确的参数。例如，在您的情况下，您过度拟合。您应该尝试使用GridSearchCV为您的数据集找到最佳参数（尤其是内核，C和gamma）。

我对您的数据集做了一些尝试，并尝试了以下更改，

clf=SVC(kernel='sigmoid', C=10, verbose=True)
clf.fit(x_train,y_train)
pred_clf = clf.predict(x_test)
print(pred_clf)
print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))

输出：

......
Warning: using -h 0 may be faster
*
optimization finished, #iter = 6651
obj = -196704.694272, rho = 33.691873
nSV = 9068, nBSV = 9068
Total nSV = 9068
[LibSVM][0 1 1 ... 0 1 0]
              precision    recall  f1-score   support

           0       0.49      0.58      0.53      1465
           1       0.58      0.49      0.53      1708

    accuracy                           0.53      3173
   macro avg       0.53      0.53      0.53      3173
weighted avg       0.54      0.53      0.53      3173

[[843 622]
 [864 844]]

结果不是很好，但不是全部。

总结一下，请执行以下操作：

始终尝试进行交叉验证以为您的数据集找到一组好的参数
打开估算器的详细选项。这为您提供了发生情况的重要线索
始终首先尝试形象化和使用更简单的算法，例如我可能会尝试弄清楚数据是否是线性可分离的，尝试逻辑回归，然后再尝试诸如SVM或集成的方法。这些总是很难调整

训练和评估预测模型时的预测误差

2 个答案: