Question

更新我随机地和如@Paul建议的那样独立地改组数据，我的分类器现在具有随机性能。

我有一个不平衡数据集，包含大约200,000个实例和50个预测变量。 不平衡与负类（即0级）的比率为4：1 。换句话说，负类产生约80％的样品，而阳性只产生20％的样品。

这是一个二元分类问题，其中我有一个0＆＃39; s和1＆＃39>的目标向量。

我一直在尝试使用逻辑回归和随机森林等几种分类器。

我使用来自Python skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=999)的交叉验证roc_curve和ROC sklearn v.018来评估它们

我的问题

每个验证折叠的我的ROC曲线几乎相同，我不知道为什么。 AUC是完全相同的，并且总是非常好（0.9）。虽然精确回忆曲线显示更差AUC=0.74（我认为它更准确）。

我尝试使用交叉验证的ROC示例：http://lijiancheng0614.github.io/scikit-learn/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py

ROC曲线Logistic回归

ROC曲线置信区间Logistic回归[放大]

精确召回曲线

问题：为什么模型的性能在每个折叠上看起来都相似？ AUC不应该至少略有差异吗？

以下代码

X, y = shuffle(X, y, random_state=0)
clasifier = linear_model.LogisticRegression(class_weight = "balanced")
clasifier.fit(X,y)
fig, ax1 = plt.subplots(figsize=(12, 8))
    mean_tpr = 0.0
    mean_fpr = linspace(0, 1, 100)

    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=999)

    for i, (train_index, test_index) in enumerate(skf.split(X,y)):
        # calculate the probability of each class assuming it to be positive
        probas_ = classifier.fit(X[train_index], y[train_index]).predict_proba(X[test_index])
        # Compute ROC curve and area under the curve
        fpr, tpr, thresholds = roc_curve(y[test_index], probas_[:, 1], pos_label=1)
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        roc_auc = auc(fpr, tpr)

        plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i+1, roc_auc))

    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Random', lw=2)


    mean_tpr /= n_folds
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)

    plt.plot(mean_fpr, mean_tpr, 'k--',
         label='Mean ROC (area = %0.2f)' % mean_auc, lw=3)
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate (1- specificity)', fontsize=18)
    plt.ylabel('True Positive Rate (sensitivity)', fontsize=18)

交叉验证ROC曲线中每个折叠的AUC相同[Python]

0 个答案: