我正在使用决策树对案件进行分类。因为我的数据不平衡,所以我重复了少数派的课程,直到达到50:50的平衡。我确实知道,这是一种非常不寻常的方法,我也尝试了SMOTE函数。
假设我有5%的坏案例,然后是95%的好案例。我重复了不良数据,直到有50%的不良和50%的不良。代码如下。
#Count frequency of groups
tab <- table(train$case)
#Count number of rows to be added
no_of_rows <- max(tab) - min(tab)
#count number of rows which are already there in the dataframe for the minimum group
existing_rows <- which(train$case%in% names(which.min(tab)))
#Add new rows
new_df <- rbind(train, train[rep(existing_rows,no_of_rows/length(existing_rows)), ])
train <- new_df
#Check the count
table(train$case)
> table(train$case)
bad good
15316 15855
现在我要进行60:40的拆分。这意味着60%的坏案例和40%的好案例,但是我不知道该怎么做。
有人可以帮忙吗?谢谢。
答案 0 :(得分:1)
您可以阅读此示例。
train <- data.frame(id = 1:31171, case = c(rep("bad", 15316),
rep("good", 15855))) # a simulation of your train data frame
table(train$case)
# bad good
#15316 15855
prop.table(table(train$case))
# bad good
#0.4913541 0.5086459
# Now you want a new data (say train2) with 60% bad cases and 40% good cases
# First, you have to decide the size of train2 and this task strongly depends on your research question
# But here, let's assume we want to keep all the bad cases
# So all the 15316 bad cases should represent 60% of train2 sample size (instead of 49% in train)
# Therefore, the sample size of train2 should be 25527
15316/0.6
# [1] 25526.67
# This means that you have to add to the 15316 bad cases 10211 good cases
25527-15316
#[1] 10211
# You can now sample those 10211 good cases from the 15855 good cases in train
library(dplyr)
set.seed(1234)
good.cases <- train %>% filter(case == "good") %>% sample_n(10211)
# Now merge the bad cases with the good one you sampled
train2 <- train %>% filter(case == "bad") %>% bind_rows(good.cases)
# Check the distribution of cases
table(train2$case)
# bad good
#15316 10211
prop.table(table(train2$case))
# bad good
#0.5999922 0.4000078