为什么我的图表打印出可变重要性的错误模型?

时间:2016-03-20 12:40:42

标签: r charts markdown random-forest

这似乎是一个非常明显的问题,但我已经通过代码查看了它似乎很好。挑战在于,当逐行运行时代码运行正常,但是当使用针对R降价文档编织时,它会选择错误的随机森林并打印出重要性。
我尝试过重新安装knitr但是没有用。

数据基于火车泰坦dataset

我有2个模型,一个叫modRF,另一个叫mod2。我想在mod 2上运行图表,但输出是modRF 你可以通过改变线来看到这一点 imp<-importance(mod2$finalModel)为modRF $ ...
就像我说的那样,当我逐行运行这个代码时,它起作用,在Rmarkdown(编织到HTML)中它会生成错误的图表。有人可以详细说明吗?

PS随机森林模型在我的机器上运行每个模型只需不到一分钟,因此运行此代码不应该花费太长时间。

提前感谢您的帮助,
J

这是我要复制的代码

suppressMessages(library(caret))
suppressMessages(library(randomForest))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
setwd("~/Kaggle/Titanic")

totaltrain<-read.csv("train.csv")

#Adding features for EDA
totaltrain$CabinYes<-as.numeric(!(totaltrain$Cabin)=="")
ageid<-data.frame("minage"=c(0,20,30,40,50,60),
                  "AgeLabel"=c("Under 20","20-30","30-40","40-50","50-60","60+"))
#vlookup TRUE equivalent
totaltrain$AgeBracket<-ageid[findInterval(totaltrain$Age,ageid$minage),2]
        #findInterval creates an index of which of the initial values most closely matches
        #the lookup... Then use with the age id index and return the second column
a<-c(1,2,3,5,7,8,12,13,14)
rates<-totaltrain[,a]
rates$AgeBracket<-as.character(rates$AgeBracket)
rates$AgeBracket[is.na(rates$AgeBracket)]<-"Unknown"
rates$AgeBracket<-as.factor(rates$AgeBracket)

rates$Survived<-as.factor(rates$Survived)
rates$Pclass<-as.factor(rates$Pclass)
rates$CabinYes<-as.factor(rates$CabinYes

```{r,cache=TRUE}
set.seed(4321)
inTrain <- createDataPartition(y=rates$Survived,
                               p=0.75, list=FALSE)
training<-rates[inTrain,]
testing<-rates[-inTrain,]
modRF<-train(Survived~.-PassengerId,data=training,method="rf",trControl=
                     trainControl(method="cv",number = 3, 
                                  allowParallel = T,))
pred<-predict(modRF,newdata=testing)
testing$PredRight<-pred==testing$Survived
sum(testing$PredRight)/length(pred)
```

b<-c(1,2,3,5,6,7,8,12,13)
rates2<-totaltrain[,b]
rates2$Age[is.na(rates2$Age)]<-0
#Model 2
set.seed(2072)

inTrain <- createDataPartition(y=rates$Survived,
                               p=0.75, list=FALSE)
training<-rates[inTrain,]
testing<-rates[-inTrain,]
mod2<-train(Survived~.-PassengerId,data=training,method="rf",trControl=
                     trainControl(method="cv",number = 3, 
                                  allowParallel = T,))
imp<-importance(mod2$finalModel)
impdf<-data.frame(Variables=row.names(imp),Importance=round(imp[,1],2))
rankimp<-impdf %>% mutate(Rank = paste0('#',dense_rank(-Importance)))
ggplot(rankimp, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
        geom_bar(stat='identity') + 
        geom_text(aes(x = Variables, y = 0.5, label = Rank),
                  hjust=0, vjust=0.55, size = 4, colour = 'red') +
        labs(x = 'Variables') +
        coord_flip() 

0 个答案:

没有答案