使用分层采样为决策树学习拆分数据帧

时间:2020-03-20 12:52:59

标签: r dplyr decision-tree

我想使用分层抽样创建培训和测试样本集。我尝试环顾四周,但是找到的所有包都返回一个数据帧而不是一个表达式。我用来构建树的树包要求将子集作为表达式给出。

示例代码:

library(tree)
library(ISLR)
library(dplyr)

Carseats <- Carseats %>% mutate(High = factor(ifelse(Sales <= 8, "No", "Yes")))

set.seed(2)
train_sample <- sample(nrow(Carseats), nrow(Carseats) * 0.7)
carseats_test <- Carseats[-train_sample,]

tree.carseats <- tree(High~ . -Sales, Carseats, subset = train_sample)

是否可以修改上面的代码,以便使用分层进行采样?

1 个答案:

答案 0 :(得分:0)

您可以这样做:

library(tree)
library(ISLR)
library(dplyr)

Carseats <- Carseats %>% mutate(High = factor(ifelse(Sales <= 8, "No", "Yes")))

mean(Carseats$High == "Yes")
[1] 0.41

train_sample <- Carseats %>%
tibble::rownames_to_column() %>% 
group_by(High) %>%
sample_n(0.7*n()) %>%
mutate(rowname = as.numeric(rowname)) %>%
pull(rowname) 

carseats_test <- Carseats[-train_sample,]
mean(carseats_test$High == "Yes")
[1] 0.4132231

tree.carseats <- tree(High~ . -Sales, Carseats, subset = train_sample)