Question

我想用我拥有的一些数据训练二进制分类ML模型；像这样的东西：

df 

y   ch1_g1  ch2_g1  ch3_g1  ch1_g2  ch2_g2  ch3_g2
0   20      89      62      23      3       74
1   51      64      19      2       83      0
0   14      58      2       71      31      48
1   32      28      2       30      92      91
1   51      36      51      66      15      14
...

我的目标（y）取决于两组的三个特征，但是我的数据不平衡，y目标的值计数表明我的零个数多于比率约为2.68。我通过循环每一行并从组1到组2随机交换值来纠正此问题，反之亦然，就像这样：

for index,row in df.iterrows():

choice = np.random.choice([0,1])

if row['y'] != choice:

    df.loc[index, 'y'] = choice

    for column in df.columns[1:]:

        key = column.replace('g1', 'g2') if 'g1' in column else column.replace('g2', 'g1')

        df.loc[index, column] = row[key]

这样做会使比率降低到不超过1.3，因此我想知道使用熊猫方法是否存在更直接的指责。 ¿每个人都有一个想法如何做到这一点？

Answer 1

无论是否交换列都可以解决类不平衡问题，我将交换整个数据集，并在原始数据和交换数据之间随机选择：

# Step 1: swap the columns
df1 = pd.concat((df.filter(regex='[^(_g1)]$'),
                 df.filter(regex='_g1$')),
                axis=1)

# Step 2: rename the columns
df1.columns = df.columns

# random choice
np.random.seed(1)
is_original = np.random.choice([True,False], size=len(df))

# concat to make new dataset
pd.concat((df[is_original],df1[~is_original]))

输出：

   y  ch1_g1  ch2_g1  ch3_g1  ch1_g2  ch2_g2  ch3_g2
2  0      14      58       2      71      31      48
3  1      32      28       2      30      92      91
0  0      23       3      74      20      89      62
1  1       2      83       0      51      64      19
4  1      66      15      14      51      36      51

请注意，索引为1,4的行与g1交换了g2。

熊猫每行随机交换列值

1 个答案: