我构建了一个函数,该函数接受不同的输入数据帧并将其与代码一起使用,该代码具有与本示例相同的结构:
library(caret)
predictors_name <- c("weight", "Time", "Chick")
target <- "Diet"
dat <- as.data.frame(ChickWeight)
dat <- droplevels(dat[1:300,])
# set predictors
predictors <- dat[, c(which(names(dat) == predictors_name[1]),
which(names(dat) == predictors_name[2]),
which(names(dat) == predictors_name[3])
)]
response <- dat[, c(which(names(dat) == target))]
# specifiy trainControl
control <- trainControl(method="repeatedcv", number=10, repeats=10, search="grid")
# tune hyperparameter mtry
tunegrid <- expand.grid(mtry=c(1:length(predictors_name)))
set.seed(42)
rf_gridsearch <- train(x = predictors,
y = response,
data = dat,
method="rf",
ntree = 2500,
metric= "Accuracy",
tuneGrid=tunegrid,
trControl=control)
# set parameter mtry as a dataframe (so the tuneGrid parameter of train will take it)
params <- data.frame(mtry = rf_gridsearch$finalModel$mtry)
# specifiy trainControl again, but without search
control <- trainControl(method="repeatedcv", number=10, repeats=10, savePred =T)
# fit models with fixed hyperparameters and a different set seed
set.seed(43)
model <- train(x = predictors,
y = response,
method = "rf",
ntree = 2500,
metric = "Accuracy",
tuneGrid = params,
trControl = control,
importance = TRUE)
当我没有预测类别变量(diet
)而是一个度量变量(例如weight
)时,代码运行得更快。在我的真实数据中,我多次在循环中针对不同的度量标准响应变量执行了代码。这种随机森林回归比我真实数据中仅一个响应的分类要快得多:
度量响应变量(随机森林回归)的预测仅花费了2.5个小时,而具有完全相同的预测变量的分类响应变量(24个级别)的预测已经运行了24个小时,但仍未完成。我已经对大约35行的较小数据集使用了相同的代码进行分类。这只花了大约1-3分钟。较大的数据集有375行,并且模型的计算永远耗费时间:
Time difference of 1.409491 days
为什么除了响应变量之外的所有内容都与回归模型中的所有内容相同,计算随机森林分类模型都需要这么长时间?