Question

这篇文章提到，由于自举和交叉验证，Caret rpart比rpart更准确：

Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?

虽然当我比较两种方法时，Caret rpart的准确度为0.4879，rpart的准确度为0.7347（我在下面复制了我的代码）。

除了Caret rpart的分类树只有几个节点（分裂）与rpart相比

有没有人理解这些差异？

谢谢！

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Loading libraries and the data

This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.

```{r section1, echo=TRUE}

# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)

# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download the datasets
training <- read.csv(url(wwwTrain))
testing  <- read.csv(url(wwwTest))

# create a partition with the training dataset 
inTrain  <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet  <- training[-inTrain, ]
dim(TrainSet)

# set seed for reproducibility        
set.seed(12345)

```
## Cleaning the data

```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet  <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)

# remove variables that are mostly NA
AllNA    <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet  <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)

# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet  <- TestSet[, -(1:5)]
dim(TrainSet)


```

## Prediction modelling

First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}

mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)

mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)

```

Second we build a similar model using rpart:
```{r section7, echo=TRUE}

# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)

# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree

```

Answer 1

一个简单的解释是你没有调整任何一个模型，并且在默认设置下，rpart表现得更好。

当您使用相同的参数时，您应该期望相同的性能。

让我们用caret进行一些调整：

set.seed(1)
mod_rpart <- train(classe ~ .,
                   method = "rpart",
                   data = TrainSet,
                   tuneLength = 50, 
                   metric = "Accuracy",
                   trControl = trainControl(method = "repeatedcv",
                                            number = 4,
                                            repeats = 5,
                                            summaryFunction = multiClassSummary,
                                            classProbs = TRUE))

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 4359  243   92  135   38
         B  446 2489  299  161  276
         C  118  346 2477  300   92
         D  190  377  128 2240  368
         E  188  152  254  219 2652

Overall Statistics

               Accuracy : 0.7628          
                 95% CI : (0.7566, 0.7688)
    No Information Rate : 0.2844          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.7009          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.8223   0.6900   0.7622   0.7332   0.7741
Specificity            0.9619   0.9214   0.9444   0.9318   0.9466
Pos Pred Value         0.8956   0.6780   0.7432   0.6782   0.7654
Neg Pred Value         0.9316   0.9253   0.9495   0.9469   0.9490
Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2339   0.1335   0.1329   0.1202   0.1423
Detection Prevalence   0.2611   0.1970   0.1788   0.1772   0.1859
Balanced Accuracy      0.8921   0.8057   0.8533   0.8325   0.8603

比使用默认设置（rpart）

的cp = 0.01好一点

如果我们设置由插入符号选择的最佳cp怎么样：

modFitDecTree <- rpart(classe ~ .,
                       data = TrainSet,
                       method = "class",
                       control = rpart.control(cp = mod_rpart$bestTune))

predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628

为什么rpart比R中的Caret rpart更准确

1 个答案: