Question

我正在研究用于二进制分类的分类器。数据不平衡，类别0为83.41％，类别1为16.59％。我正在使用Mathews相关系数来评估分类器的性能。另请注意，数据的维数（（211，800））少得多。

我正在使用Logistic回归来解决该问题。我使用GridSearchCV进行超级参数优化，并提出了以下最佳超级参数值：

最佳参数：{'C'：1000，'class_weight'：{1：0.83，0：0.17000000000000004}，'penalty'：'l1'，'solver'：'liblinear'}

最佳MCC 0.7045053547679334

我在一系列C值上绘制了验证曲线，以检查模型是否过拟合/欠拟合。

train_scores, test_scores = validation_curve(LogisticRegression(penalty='l1',
                                                                solver='liblinear',
                                                                class_weight={1: 0.83, 0: 0.17000000000000004}),
                                             X, y,'C', C, cv=5, scoring=make_scorer(matthews_corrcoef))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve with Logistic Regression")
plt.xlabel("C")
plt.ylabel("MCC")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(C, train_scores_mean, label="Training score",
             color="darkorange", lw=lw)
plt.fill_between(C, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.2,
                 color="darkorange", lw=lw)
plt.semilogx(C, test_scores_mean, label="Cross-validation score",
             color="navy", lw=lw)
plt.fill_between(C, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.2,
                 color="navy", lw=lw)
plt.legend(loc="best")
plt.show()

根据我对这条曲线的了解，它表明该模型倾向于过拟合，因为它在验证集和培训集上表现不佳。任何人都可以指出我如何在如此小的数据集上解决这个问题的方向。

Answer 1

您可以做很多事情：

使用SMOTE对少数群体进行过度采样。
减少GridSearchCV的迭代次数或使用RandomSearchCV。

如何避免数据不均衡的过拟合？

1 个答案: