Question

我有以下数据框：

new_df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

我使用scikit的SVC来预测标签'High'，'Mod'和'Low'。我这样做：

new_df['BankNum'] = new_df['BankNum'].map(lambda x: x.replace('-',''))
new_df['BankNum'] = new_df.BankNum.astype(np.float128)

columns =['BankNum', 'ID']
le = LabelEncoder()
new_df['ID'] = le.fit_transform(new_df.ID)

new_df['Labels'] = le.fit_transform(new_df.Labels)

X_train, X_test, y_train, y_test = train_test_split(new_df[columns], new_df.Labels, test_size=0.2, random_state=42)

    clf = svm.SVC(gamma=0.001, C=100., probability=True, random_state=42)

    scores = cross_val_score(clf, X_train, y_train, cv=8)
    print "Cross Validation Score: "
    print scores.mean()

    clf.fit(X_train, y_train)

    predicted = clf.predict(X_test)
    print "Accuracy: "
    print(np.mean(predicted == y_test))
    print(metrics.classification_report(y_test, predicted))

我有两个问题：

1。）对于分类报告，我得到这样的输出：

               precision    recall  f1-score   support

          0       0.00      0.00      0.00      4780
          1       0.94      1.00      0.97    104719
          2       0.00      0.00      0.00      1425

avg / total       0.89      0.94      0.92    110924

为什么标签0＆amp; 2，获得0.00精度？这可能是因为阶级不平衡吗？大约有80893个高标签，11798个Mod标签＆amp; 279608低标签。或者SVm不是一个好的模型吗？

2.）我想获得每个预测的置信度分数。我用Google搜索并找到了如下内容：

p = clf.predict_proba( X_test )
    auc = AUC(y_test, p[:,1] )
    print "SVM AUC", auc

但我收到错误：raise ValueError("{0} format is not supported".format(y_typeValueError: multiclass format is not supported

如何获得每个预测的置信度，然后对其进行解释？非常感谢!!

如何在机器学习模型python

0 个答案: