进行“留一出”交叉验证时如何应用过采样?

时间:2019-07-10 06:27:11

标签: python pandas machine-learning scikit-learn cross-validation

我正在处理不平衡的数据以进行分类,因此我以前尝试使用综合少数族裔过采样技术(SMOTE)对培训数据进行过采样。但是,这一次我认为我还需要使用“离开一个组”(LOGO)交叉验证,因为我想在每个简历中都排除一个主题。

我不确定是否可以很好地解释它,但是据我的理解,要使用SMOTE进行k折CV,我们可以在每次折叠中循环使用SMOTE,如我在这段代码on another post中所看到的。以下是在K折简历上实现SMOTE的示例。

from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  
    X_test = X[test_index]
    y_test = y[test_index]  
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model = ...  # classification model example
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

在没有SMOTE的情况下,我尝试执行此操作以执行LOGO CV。但是这样做,我将使用超不平衡数据集。

X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()

logo.get_n_splits(X_std, y, groups)

cv=logo.split(X_std, y, groups)

scores=[]
for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    model.fit(X_train, y_train.ravel())
    scores.append(model.score(X_test, y_test.ravel()))

我的问题将是: 我应该如何在“留一单”的CV循环中实施SMOTE,我对如何为综合训练数据定义组列表感到困惑。

我很乐意提供更多信息。谢谢!

1 个答案:

答案 0 :(得分:0)

这里LOOCV建议的方法对于省略交叉验证更为有意义。保留一组您将用作测试集的样本,并对另一组剩余样本进行过度采样。在所有过度采样的数据上训练分类器,并在测试集上测试分类器。

对于您而言,以下代码将是在LOGO CV循环内实现SMOTE的正确方法。

for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model.fit(X_train_oversampled, y_train_oversampled.ravel())
    scores.append(model.score(X_test, y_test.ravel()))