Question

我正在创建一个简单的代码，当目标变量具有两个以上的类时，该代码可以对数据帧进行下采样。</ p>

让df是我们的任意数据集，'TARGET_VAR'是具有两个以上类的分类变量。

import pandas as pd
label='TARGET_VAR' #define the target variable

num_class=df[label].value_counts() #creates list with the count of each class value
temp=pd.DataFrame() #create empty dataframe to be filled up

for cl in num_class.index: #loop through classes
    #iteratively downsample every class according to the smallest
    #class 'min(num_class)' and append it to the dataframe.
    temp=temp.append(df[df[label]==cl].sample(min(num_class)))

df=temp #redefine initial dataframe as the subsample one

del temp, num_class #delete temporary dataframe

现在我想知道，有没有办法以更精致的方式做到这一点？例如无需创建临时数据集？我试图找出一种方法来“向量化”多个类的操作，但没有成功。下面是我的想法，可以轻松地将其实现为2个类，但是我不知道如何将其扩展为多类情况。

如果您有2个班级，这将非常好用

 df= pd.concat([df[df[label]==num_class.idxmin()],\
 df[df[label]!=num_class.idxmin()].sample(min(num_class))])

这使您可以为其他类别选择正确数量的观察值，但不一定会相等地代表这些类别。

 df1= pd.concat([df[df[label]==num_class.idxmin()],\
 df[df[label]!=num_class.idxmin()].sample(min(num_class)*(len(num_class)-1))])

下采样超过2个类别

0 个答案: