我正在比较各种ML方法,以尝试查看最适合我的数据的方法。为了提高可重复性,我使用了UCI ML存储库中的“占用率”数据-可在此处找到:http://archive.ics.uci.edu/ml/machine-learning-databases/00357/
在最后一行,predict
函数失败,并出现以下错误:
predict.randomForest(modelFit,newdata)中的错误: newdata中缺少训练数据
这是我的代码:
rm(list=ls()) # remove all variables from workspace
set.seed(01010)
# data from http://archive.ics.uci.edu/ml/machine-learning-databases/00357/
a<-read_csv("~/Documents/PhD/Analysis/1. Risk Score/occupancy_data/datatest.csv")
b<-read_csv("~/Documents/PhD/Analysis/1. Risk Score/occupancy_data/datatest2.csv")
c<-read_csv("~/Documents/PhD/Analysis/1. Risk Score/occupancy_data/datatraining.csv")
data <- rbind(a,b,c)
data$class<-as.factor(data$class)
data <- data %>% select(-Temperature,-date)
levels(data$class) <- c("A", "B") # classprob=true makes a variable for each class, with the class probs, but '0' and '1' aren't valid variable names
data_perm<-data[sample(nrow(data)),]
train <- data_perm[1:floor(0.6*nrow(data_perm)),]
xvalidate <- data_perm[(floor(0.6*nrow(data_perm))+1):floor(0.8*nrow(data_perm)),-ncol(data)]
yvalidate <- data_perm[(floor(0.6*nrow(data_perm))+1):floor(0.8*nrow(data_perm)),ncol(data)]
### Ensemble: Setting up training controls
# for the weak trainers
control_stacking <- trainControl(method="repeatedcv",
index = createFolds(train$class, 5),
savePredictions = "final",
classProb=TRUE)
# for the model combiner
stackControl <- trainControl(method="repeatedcv",
number=3,
repeats=2,
savePredictions=TRUE)
# define the hyperparameter list for each of the weak models to be included in the stack
models <- caretList(class~., data=train, trControl=control_stacking,
methodList=c('naive_bayes'),
tuneList=list(test=caretModelSpec(method='svmRadial')),
verbose=F)
temp <- caretStack(models, method="rf", metric="Accuracy", trControl=stackControl,verbose=F)
table(predict(temp, newdata=xvalidate))
请注意,如果没有tuneList
行,它也可以正常工作,并且也可以用于方法gbm
和rf
。
任何帮助都将不胜感激!