我有3500行和113列的数据集,输出特征是我想通过随机森林预测的类别,所有输入特征都是数字。但是问题是输出特征包含BAD和GOOD作为类别,GOOD与BAD的比率为30或更大,我是python领域的新手。如何进行?
答案 0 :(得分:0)
由于train_test_split
根据随机因素df
拆分random_state
,所以我最喜欢的获得良好拆分的方法是检查不同{{1} },然后选择最接近现实的方式来分散数据。因此,假设random_state
的目标值为0和1:
df['target']
然后,您可以将最佳target_mean = df['target'].mean()
rs_best = 0
min_difference = 1
for rs in range(50):
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.25, random_state=rs)
cur_difference = abs(y_train.mean() - target_mean)
if cur_difference < min_difference:
rs_best, min_difference = rs, cur_difference
print('Best random state split:', rs_best, min_difference)
值用于模型的实际拆分:
random_state