Question

我的数据集是：enter image description here。前七列用于输入指标。最后五列用于输出。输出是由5个数字组成的数组，这些数字由零或一组成。我为此使用Keras功能API。每当我尝试对各个列重新采样数据时，即使我尝试对行进行切片，合并时也会遇到形状问题。

Answer 1

基本上，没有做到这一点的“简便”方法。唯一合乎逻辑的方法是，可以在设计矩阵上使用Label Powerset，然后根据创建的列重新采样-尽管在这种情况下，“手工”进行这种转换可能会更容易。

这是一种方法

import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
import pandas as pd


X0, y = make_classification()
_, X1 = make_multilabel_classification(n_classes=5, random_state=0)

# transform X1 by creating a powerset...
df_x1 = pd.DataFrame(X1, columns=[f'c{x}' for x in range(X1.shape[1])])
df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})
print(df_x1['dummy'].value_counts())  # shows imbalance
df_x1 = df_x1.reset_index()  # so that we know which rows are resampled
df_y1 = df_x1['dummy']
df_x1 = df_x1[[x for x in df_x1.columns if x != 'dummy']]

ros = RandomOverSampler()
X_sample, _ = ros.fit_resample(df_x1, df_y1)  # this is the resampled index

X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]

真正的秘诀是什么：

df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})

根据所选的5列重新编制索引

df_x1 = df_x1.reset_index()

然后在RandomOverSampler中使用哪个，并保证5列将保持平衡。

最后，我们可以选择采样索引，以生成已在X0, X1, y上成功重新采样的数据集和标签

X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]

如何处理多列的类不平衡？

1 个答案: