是否可以通过微调训练数据上的超参数来创建多个随机森林模型,并对照所有模型检查测试数据性能并将其存储在csv文件中?
例如:-我有一个模型,mtry
是6,nodesize
是3,另外一个模型,其中mtry
是10而nodesize
是4我需要要做的是在测试数据上测试这两个模型的性能,并存储诸如混淆矩阵,敏感性和特异性之类的关键模型指标。
我尝试了以下代码
train_performance <- data.frame('TN'=0,'FP'=0,'FN'=0,'TP'=0,'accuracy'=0,'kappa'=0,'sensitivity'=0,'specificity'=0)
modellist <- list()
for (mtry in c(6,11)){
for (nodesize in c(2,3)){
fit_model <- randomForest(dv~., train_final,mtry = mtry, importance=TRUE, nodesize=nodesize,
sampsize = ceiling(.8*nrow(train_final)), proximity=TRUE,na.action = na.omit,
ntree=500)
Key_col <- paste0(mtry,"-",nodesize)
modellist[[Key_col]] <- fit_model
pred_train <- predict(fit_model, train_final)
cf <- confusionMatrix(pred_train, train_final$DV, mode = 'everything', positive = '1')
train_performance$TN <- cf$table[1]
train_performance$FP <- cf$table[2]
train_performance$FN <- cf$table[3]
train_performance$TP <- cf$table[4]
train_performance$accuracy=cf$overall[1]
train_performance$kappa=cf$overall[2]
train_performance$sensitivity=cf$byClass[1]
train_performance$specificity=cf$byClass[2]
train_performance$key=Key_col
}
}
答案 0 :(得分:1)
下面是使用caret
软件包的示例方法,介绍如何调整和训练随机森林模型,该模型可输出所有模型的精度参数:
library(randomForest)
library(mlbench)
library(caret)
# Load Dataset
data(Sonar)
dataset <- Sonar
x <- dataset[,1:60]
y <- dataset[,61]
# Create model with default paramters
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
metric <- "Accuracy"
set.seed(seed)
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)
输出:
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.8138384 0.6209924 0.0747572 0.1569159
使用Caret
进行调音:
随机搜索: 我们可以使用的一种搜索策略是尝试某个范围内的随机值。
# Random Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(seed)
mtry <- sqrt(ncol(x))
rf_random <- train(Class~., data=dataset, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)
plot(rf_random)
输出:
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
11 0.8218470 0.6365181 0.09124610 0.1906693
14 0.8140620 0.6215867 0.08475785 0.1750848
17 0.8030231 0.5990734 0.09595988 0.1986971
24 0.8042929 0.6002362 0.09847815 0.2053314
30 0.7933333 0.5798250 0.09110171 0.1879681
34 0.8015873 0.5970248 0.07931664 0.1621170
45 0.7932612 0.5796828 0.09195386 0.1887363
47 0.7903896 0.5738230 0.10325010 0.2123314
49 0.7867532 0.5673879 0.09256912 0.1899197
50 0.7775397 0.5483207 0.10118502 0.2063198
60 0.7790476 0.5513705 0.09810647 0.2005012
网格搜索: 另一个搜索是定义要尝试的算法参数的网格。
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
set.seed(seed)
tunegrid <- expand.grid(.mtry=c(1:15))
rf_gridsearch <- train(Class~., data=dataset, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_gridsearch)
plot(rf_gridsearch)
输出:
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
1 0.8377273 0.6688712 0.07154794 0.1507990
2 0.8378932 0.6693593 0.07185686 0.1513988
3 0.8314502 0.6564856 0.08191277 0.1700197
4 0.8249567 0.6435956 0.07653933 0.1590840
5 0.8268470 0.6472114 0.06787878 0.1418983
6 0.8298701 0.6537667 0.07968069 0.1654484
7 0.8282035 0.6493708 0.07492042 0.1584772
8 0.8232828 0.6396484 0.07468091 0.1571185
9 0.8268398 0.6476575 0.07355522 0.1529670
10 0.8204906 0.6346991 0.08499469 0.1756645
11 0.8073304 0.6071477 0.09882638 0.2055589
12 0.8184488 0.6299098 0.09038264 0.1884499
13 0.8093795 0.6119327 0.08788302 0.1821910
14 0.8186797 0.6304113 0.08178957 0.1715189
15 0.8168615 0.6265481 0.10074984 0.2091663
还有许多其他方法可以调整随机森林模型并存储这些模型的结果,其中两种是使用最广泛的方法。
此外,您还可以手动设置这些参数并训练和调整模型。