我有以下数据框:
new_df =
BankNum | ID | Labels
0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low
我使用scikit的SVC
来预测标签'High'
,'Mod'
和'Low'
。我这样做:
new_df['BankNum'] = new_df['BankNum'].map(lambda x: x.replace('-',''))
new_df['BankNum'] = new_df.BankNum.astype(np.float128)
columns =['BankNum', 'ID']
le = LabelEncoder()
new_df['ID'] = le.fit_transform(new_df.ID)
new_df['Labels'] = le.fit_transform(new_df.Labels)
X_train, X_test, y_train, y_test = train_test_split(new_df[columns], new_df.Labels, test_size=0.2, random_state=42)
clf = svm.SVC(gamma=0.001, C=100., probability=True, random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=8)
print "Cross Validation Score: "
print scores.mean()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print "Accuracy: "
print(np.mean(predicted == y_test))
print(metrics.classification_report(y_test, predicted))
我有两个问题:
1。)对于分类报告,我得到这样的输出:
precision recall f1-score support
0 0.00 0.00 0.00 4780
1 0.94 1.00 0.97 104719
2 0.00 0.00 0.00 1425
avg / total 0.89 0.94 0.92 110924
为什么标签0& 2,获得0.00精度?这可能是因为阶级不平衡吗?大约有80893个高标签,11798个Mod标签& 279608低标签。或者SVm不是一个好的模型吗?
2.)我想获得每个预测的置信度分数。我用Google搜索并找到了如下内容:
p = clf.predict_proba( X_test )
auc = AUC(y_test, p[:,1] )
print "SVM AUC", auc
但我收到错误:raise ValueError("{0} format is not supported".format(y_typeValueError: multiclass format is not supported
如何获得每个预测的置信度,然后对其进行解释?非常感谢!!