我的数据集高度不平衡,我想执行SMOTE来平衡数据集,并执行交叉验证以测量准确性。但是,大多数现有教程仅使用单个 public class CarPolicyIdAttributeConverter implements GenericConverter {
@Override
public Set<ConvertiblePair> getConvertibleTypes() {
final ConvertiblePair[] pairs = new ConvertiblePair[] {
new ConvertiblePair(String.class, String.class)
};
return ImmutableSet.copyOf(pairs);
}
@Override
public Object convert(final Object source, final TypeDescriptor sourceType, final TypeDescriptor targetType) {
if (targetType.getAnnotation(Uppercase.class) != null) {
return ((String)source).toUppercase();
}
return source;
}
}
和training
迭代来执行SMOTE。
因此,我想知道使用交叉验证执行SMOTE的正确过程。
我当前的代码如下。但是,如上所述,它仅使用一次迭代。
testing
很高兴在需要时提供更多详细信息。
答案 0 :(得分:3)
您需要每次折叠执行 次。因此,您需要避免使用train_test_split
来支持KFold
:
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index] # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
X_test = X[test_index]
y_test = y[test_index] # See comment on ravel and y_train
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # Choose a model here
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
例如,您还可以将分数附加到外部定义的list
上。
答案 1 :(得分:1)
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
X_train, y_train = SMOTE().fit_sample(X_train, y_train)
....