我正在为一个极不平衡的数据集工作,总共有44个样本用于我的研究项目。这是我使用“留一法交叉验证”的少数类的3/44个样本的二进制分类问题。如果我在LOOCV循环之前对整个数据集执行SMOTE过采样,则ROC曲线的预测准确性和AUC分别接近90%和0.9。但是,如果我仅对LOOCV循环内的训练集进行过采样,这恰好是更合乎逻辑的方法,则ROC曲线的AUC会低至0.3
我还尝试了精确调用曲线和分层的k倍交叉验证,但是由于在循环内外进行过采样,结果也面临类似的区别。 请建议我在什么地方进行过度采样,并在可能的情况下说明区别。
循环内过度采样:-
i=0
acc_dec = 0
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split
for train, test in loo.split(X): #Leave One Out Cross Validation
#Create training and test sets for split indices
X_train = X.loc[train]
y_train = Y.loc[train]
X_test = X.loc[test]
y_test = Y.loc[test]
#oversampling minority class using SMOTE technique
sm = SMOTE(sampling_strategy='minority',k_neighbors=1)
X_res, y_res = sm.fit_resample(X_train, y_train)
#KNN
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(X_res,y_res)
y_pred = clf.predict(X_test)
acc_dec = acc_dec + metrics.accuracy_score(y_test, y_pred)
y_test_dec.append(y_test.to_numpy()[0])
y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
i+=1
# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")
AUC:0.25
准确度:68.1%
循环外过采样:
acc_dec=0 #accuracy for decision tree classifier
y_test_dec=[] #Store y_test for every split
y_pred_dec=[] #Store probablity for positive label for every split
i=0
#Oversampling before the loop
sm = SMOTE(k_neighbors=1)
X, Y = sm.fit_resample(X, Y)
X=pd.DataFrame(X)
Y=pd.DataFrame(Y)
for train, test in loo.split(X): #Leave One Out Cross Validation
#Create training and test sets for split indices
X_train = X.loc[train]
y_train = Y.loc[train]
X_test = X.loc[test]
y_test = Y.loc[test]
#KNN
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(X_res,y_res)
y_pred = clf.predict(X_test)
acc_dec = acc_dec + metrics.accuracy_score(y_test, y_pred)
y_test_dec.append(y_test.to_numpy()[0])
y_pred_dec.append(clf.predict_proba(X_test)[:,1][0])
i+=1
# Compute ROC curve and ROC area for each class
fpr,tpr,threshold=metrics.roc_curve(y_test_dec,y_pred_dec,pos_label=1)
roc_auc = metrics.auc(fpr, tpr)
print(str(acc_dec/i*100)+"%")
AUC:0.99
准确度:90.24%
这两种方法如何导致如此不同的结果?我该怎么办?
答案 0 :(得分:1)
在拆分数据之前进行上采样(例如SMOTE)意味着测试集中会出现有关训练集的信息。有时称为“泄漏”。不幸的是,您的第一个设置是正确的。
Here's a post解决了这个问题。