我正在尝试使用在Kaggle上找到的数据集来训练和评估预测模型,但我的精度为0,我想知道自己做错了什么
该代码适用于随机森林模型,但不适用于SVM或神经网络
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline
#loading dataset
recipes = pd.read_csv('epi_r.csv')
keep_col = ['rating','calories','protein','fat','sodium']
recipes = recipes[keep_col]
recipes = recipes.dropna()
#preprocessing data
bins = (-1, 4, 5)
group_names = ['bad','good']
recipes['rating'] = pd.cut(recipes['rating'].dropna(), bins = bins, labels = group_names)
recipes['rating'].unique()
#bad=0; good=1
label_rating = LabelEncoder()
recipes['rating'] = label_rating.fit_transform(recipes['rating'].astype(str))
#separate dataset as response variable and feature variables
x = recipes.drop('rating', axis=1)
y = recipes['rating']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 42)
#converts the values & levels the playing fields
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
#don't fit again b/c want to use the same fit
x_test = sc.transform(x_test)
clf=svm.SVC()
clf.fit(x_train,y_train)
pred_clf = clf.predict(x_test)
print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))
precision recall f1-score support
0 0.00 0.00 0.00 1465
1 0.54 1.00 0.70 1708
micro avg 0.54 0.54 0.54 3173
macro avg 0.27 0.50 0.35 3173
weighted avg 0.29 0.54 0.38 3173
[[ 0 1465]
[ 0 1708]]
/usr/local/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
这是我得到的结果,什么都没被正确预测
答案 0 :(得分:2)
第1类的召回率为1.0,这意味着您的模型始终预测为“ 1”。 您还可以从混淆矩阵中看到这一点,其中对于类别1正确预测了1708个值,但是对于类别1预测了1465个值。
总是预测单个值的模型是一个常见问题-它陷入了某种次优的解决方案中。使用不同类型的模型(例如,不同的内核)甚至选择不同的随机种子,您可能会发现通过标准化输入值(因此一栏不占优势)而感到幸运。
答案 1 :(得分:1)
您只是没有找到正确的参数。例如,在您的情况下,您过度拟合。您应该尝试使用GridSearchCV为您的数据集找到最佳参数(尤其是内核,C和gamma)。
我对您的数据集做了一些尝试,并尝试了以下更改,
clf=SVC(kernel='sigmoid', C=10, verbose=True)
clf.fit(x_train,y_train)
pred_clf = clf.predict(x_test)
print(pred_clf)
print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))
输出:
......
Warning: using -h 0 may be faster
*
optimization finished, #iter = 6651
obj = -196704.694272, rho = 33.691873
nSV = 9068, nBSV = 9068
Total nSV = 9068
[LibSVM][0 1 1 ... 0 1 0]
precision recall f1-score support
0 0.49 0.58 0.53 1465
1 0.58 0.49 0.53 1708
accuracy 0.53 3173
macro avg 0.53 0.53 0.53 3173
weighted avg 0.54 0.53 0.53 3173
[[843 622]
[864 844]]
结果不是很好,但不是全部。
总结一下,请执行以下操作: