如何解释这个三角形的ROC AUC曲线?

时间:2015-10-19 07:31:27

标签: machine-learning scikit-learn roc auc precision-recall

我有10多个功能和十几个案例来训练逻辑回归以对人类进行分类。第一个例子是法语和非法语,第二个例子是英语和非英语。结果如下:

//////////////////////////////////////////////////////

1= fr
0= non-fr
Class count:
0    69109
1    30891
dtype: int64
Accuracy: 0.95126
Classification report:
             precision    recall  f1-score   support

          0       0.97      0.96      0.96     34547
          1       0.92      0.93      0.92     15453

avg / total       0.95      0.95      0.95     50000

Confusion matrix:
[[33229  1318]
 [ 1119 14334]]
AUC= 0.944717975754

//////////////////////////////////////////////////////

1= en
0= non-en
Class count:
0    76125
1    23875
dtype: int64
Accuracy: 0.7675
Classification report:
             precision    recall  f1-score   support

          0       0.91      0.78      0.84     38245
          1       0.50      0.74      0.60     11755

avg / total       0.81      0.77      0.78     50000

Confusion matrix:
[[29677  8568]
 [ 3057  8698]]
AUC= 0.757955582999

//////////////////////////////////////////////////////

然而,我得到一些非常奇怪的AUC曲线,具有三角形而不是锯齿状的圆形曲线。关于为什么我会这样形状的任何解释?我犯过任何可能的错误?

enter image description here enter image description here

代码:

    all_dict = []
    for i in range(0, len(my_dict)):
        temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
            + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
            + my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
            + my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
            )
        all_dict.append(temp_dict)

    newX = dv.fit_transform(all_dict)

    # Separate the training and testing data sets
    half_cut = int(len(df)/2.0)*-1
    X_train = newX[:half_cut]
    X_test = newX[half_cut:]
    y_train = y[:half_cut]
    y_test = y[half_cut:]

    # Fitting X and y into model, using training data
    #$$
    lr.fit(X_train, y_train)

    # Making predictions using trained data
    #$$
    y_train_predictions = lr.predict(X_train)
    #$$
    y_test_predictions = lr.predict(X_test)

    #print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
    print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])

    print 'Classification report:'
    print classification_report(y_test, y_test_predictions)
    #print sk_confusion_matrix(y_train, y_train_predictions)
    print 'Confusion matrix:'
    print sk_confusion_matrix(y_test, y_test_predictions)

    #print y_test[1:20]
    #print y_test_predictions[1:20]

    #print y_test[1:10]
    #print np.bincount(y_test)
    #print np.bincount(y_test_predictions)

    # Find and plot AUC
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print 'AUC=',roc_auc

    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

1 个答案:

答案 0 :(得分:6)

你做错了。根据文件:

y_score : array, shape = [n_samples]

    Target scores, can either be probability estimates of the positive class or confidence values.

因此在这一行:

roc_curve(y_test, y_test_predictions)

您应该将roc_curve的{​​{1}}函数结果(或decision_function结果中的两列中的一些)转换为实际预测。

查看这些示例http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py