我试图像下面的代码那样解决数据,但是我没有使用groupy和udf弄清楚它,而且发现udf无法返回数据帧。
是否有任何方法可以通过spark或其他方法来处理不平衡数据
ratio = 3
def balance_classes(grp):
picked = grp.loc[grp.editorsSelection == True]
n = round(picked.shape[0]*ratio)
if n:
try:
not_picked = grp.loc[grp.editorsSelection == False].sample(n)
except: # In case, fewer than n comments with `editorsSelection == False`
not_picked = grp.loc[grp.editorsSelection == False]
balanced_grp = pd.concat([picked, not_picked])
return balanced_grp
else: # If no editor's pick for an article, dicard all comments from that article
return None
comments = comments.groupby('articleID').apply(balance_classes).reset_index(drop=True)
答案 0 :(得分:0)
我通常使用这种逻辑来欠采样:
def resample(base_features,ratio,class_field,base_class):
pos = base_features.filter(col(class_field)==base_class)
neg = base_features.filter(col(class_field)!=base_class)
total_pos = pos.count()
total_neg = neg.count()
fraction=float(total_pos*ratio)/float(total_neg)
sampled = neg.sample(False,fraction)
return sampled.union(pos)
base_feature是具有这些功能的Spark数据框。 ratio是期望的正负比率。class_field是保存类的列的名称,base_class是类的ID