Question

所以，我需要在scikit-learn中使用一些估算器，即LogisticRegression和SVM，但是我有一个问题，我有一个非常不平衡的数据集，需要运行Kfold交叉验证。事情有时我适合的折叠只能有一个可用的目标类。我想知道这些估计器是否有任何方法可以预先确定类的数量，也许就像传递它们一个目标的单热编码表示，如果所有的例子都来自一个类，那么形状就无关紧要了。目标矩阵将定义已经有的类数。

有没有办法用scikit-learn做到这一点？也许与另一个图书馆？我知道这两种算法都使用liblinear，也许在这种情况下我可以使用一些接口。

任何方式，谢谢你的时间。

编辑：分层折叠交叉验证对我没用，因为有时我的出现次数少于折叠次数。例如。可能会发生我有一个包含50个实例和3个类的数据集，但46个可以是一个类，2个是第二个类，2个是第三个类，虽然我可以进行3倍交叉验证，但我通常需要结果更多的折叠，加上即使有3折仍然留下一个类是唯一可用于一个折叠的情况。

Answer 1

说您需要收集更多数据的评论可能是正确的。但是，如果您认为自己的模型有足够的数据来学习有用的知识，则可以对少数类进行过度采样（或者对多数类进行采样不足，但这听起来像是过度采样的问题）。数据集中只有一个类，这使得您的模型几乎不可能学习有关该类的任何信息。

有些链接指向python中的过采样和欠采样库。著名的不平衡学习库很棒。

https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html

https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py

https://imbalanced-learn.org/en/stable/combine.html

您的案件听起来像是SMOTE的最佳人选。您还提到要更改比率。 imblearn.over_sampling.SMOTE中有一个名为ratio的参数，您可以在其中传递字典。您也可以使用百分比（请参见文档）。

SMOTE使用K-Nearest-Neighbors算法将“相似”的数据点设置为与采样点相似的数据点。与传统的过采样相比，这是一种功能更强大的算法，因为当您的模型获取训练数据时，它可以帮助避免模型记住特定示例关键点的问题。相反，smote创建了一个“相似”的数据点（可能在多维空间中），因此您的模型可以学习更好地进行概括。

注意：非常重要的一点是，不要对完整的数据集使用SMOTE。您必须仅在训练集上使用SMOTE（即在拆分后），然后在验证集和测试集上进行验证，以查看您的SMOTE模型是否执行了其他模型。如果不这样做，将导致数据泄漏，并且您将得到一个与您想要的模型不完全相似的模型。

from collections import Counter
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings

warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
X_resampled, y_resampled = sm.fit_sample(X_normalized, y)

print('Original dataset shape:', Counter(y))
print('Resampled dataset shape:', Counter(y_resampled))

X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape

smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)

print('TRAIN')
print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))

print('TEST')
print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))

Scikit通过预定义的课程数学习合适的估算器

1 个答案: