如何在python中处理不平衡数据?

时间:2019-11-13 04:54:33

标签: python

我有3500行和113列的数据集,输出特征是我想通过随机森林预测的类别,所有输入特征都是数字。但是问题是输出特征包含BAD和GOOD作为类别,GOOD与BAD的比率为30或更大,我是python领域的新手。如何进行?

1 个答案:

答案 0 :(得分:0)

由于train_test_split根据随机因素df拆分random_state,所以我最喜欢的获得良好拆分的方法是检查不同{{1} },然后选择最接近现实的方式来分散数据。因此,假设random_state的目标值为0和1:

df['target']

然后,您可以将最佳target_mean = df['target'].mean() rs_best = 0 min_difference = 1 for rs in range(50): X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.25, random_state=rs) cur_difference = abs(y_train.mean() - target_mean) if cur_difference < min_difference: rs_best, min_difference = rs, cur_difference print('Best random state split:', rs_best, min_difference) 值用于模型的实际拆分:

random_state