函数imblearn.under_sampling.RandomUnderSampler
仅允许我通过dict
输入所需的欠采样百分比作为绝对数字,但是绝对数字会干扰(时间序列)交叉验证,而我没有相同级别的少数族裔样本每折一次。 (这会产生持续的错误:Originally, there is 11037 samples and 28546 samples are asked.
)
是否可以输入相对值,例如0类为80%,1类为20%,等等?
答案 0 :(得分:1)
我认为这是一个最小的工作示例。那解决了你的问题。 从馆藏进口柜台 从sklearn.datasets导入make_classification 从sklearn.model_selection导入KFold 从sklearn.pipeline导入管道 从imblearn.under_sampling导入RandomUnderSampler
def classify(datasets, labels, *args):
kf=KFold(n_splits=3)
for train_idx, test_idx in kf.split(X):
print('Original dataset shape {}'.format(Counter(labels[train_idx])))
train_x, train_y = datasets[train_idx], labels[train_idx]
test_x, test_y = datasets[test_idx], labels[test_idx]
ratio_dict = {}
for k,v in enumerate(args):
ratio_dict.update({k:int ((v/ 100) * Counter(train_y)[k])})
print(ratio_dict)
rus = RandomUnderSampler(random_state=42, ratio=ratio_dict)
X_res, y_res = rus.fit_sample(X, y)
print('Resampled dataset shape {}'.format(Counter(y_res)))
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
rus = classify(X, y, 10, 20)
我很想看看是否有人可以使用sklearn的Pipeline Framework来实现这一点。