我正在尝试使用caret
包编写决策树模型,但无法使其正常工作。
首先,我想看看该模型是否与rpart
包一起工作,在这里我可以运行相同的模型,并且可以运行-
# setup
set.seed(123)
library(rpart)
library(caret)
# reading the file containing spam data
spamD <- readr::read_tsv(
"https://raw.githubusercontent.com/WinVector/zmPDSwR/master/Spambase/spamD.tsv"
)
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> spam = col_character()
#> )
#> See spec(...) for full column specifications.
# creating training and testing datasets
spamTrain <- dplyr::filter(.data = spamD, rgroup >= 10)
spamTest <- dplyr::filter(.data = spamD, rgroup < 10)
# training the model (works)
(treemodel <- rpart::rpart(formula = spam == "spam" ~ .,
data = dplyr::select(spamTrain, -rgroup)))
#> n= 4143
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 4143 989.338600 0.39415880
#> 2) char.freq.bang< 0.0795 2381 308.352800 0.15287690
#> 4) word.freq.remove< 0.045 2210 199.699500 0.10045250
#> 8) char.freq.dollar< 0.164 2138 156.482700 0.07951356
#> 16) word.freq.free< 0.115 1968 110.044200 0.05945122 *
#> 17) word.freq.free>=0.115 170 36.476470 0.31176470 *
#> 9) char.freq.dollar>=0.164 72 14.444440 0.72222220 *
#> 5) word.freq.remove>=0.045 171 24.081870 0.83040940
#> 10) word.freq.george>=0.08 14 0.000000 0.00000000 *
#> 11) word.freq.george< 0.08 157 13.566880 0.90445860 *
#> 3) char.freq.bang>=0.0795 1762 355.060700 0.72020430
#> 6) capital.run.length.average< 2.3995 625 150.198400 0.40160000
#> 12) word.freq.free< 0.075 454 85.374450 0.25110130
#> 24) word.freq.remove< 0.045 409 60.611250 0.18092910
#> 48) word.freq.internet< 0.08 377 43.368700 0.13262600 *
#> 49) word.freq.internet>=0.08 32 6.000000 0.75000000 *
#> 25) word.freq.remove>=0.045 45 4.444444 0.88888890 *
#> 13) word.freq.free>=0.075 171 27.239770 0.80116960 *
#> 7) capital.run.length.average>=2.3995 1137 106.545300 0.89533860
#> 14) word.freq.hp>=0.41 51 6.745098 0.15686270 *
#> 15) word.freq.hp< 0.41 1086 70.681400 0.93001840
#> 30) word.freq.edu>=0.52 15 0.000000 0.00000000 *
#> 31) word.freq.edu< 0.52 1071 57.525680 0.94304390 *
但是,如果我使用caret
包运行相同的模型,则该模型不会运行-
# using `caret` package to do the same (doesn't work)
caret::train(
formula = spam == "spam" ~ .,
data = dplyr::select(spamTrain, -rgroup),
method = "rpart"
)
#> Something is wrong; all the RMSE metric values are missing:
#> RMSE Rsquared MAE
#> Min. : NA Min. : NA Min. : NA
#> 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
#> Median : NA Median : NA Median : NA
#> Mean :NaN Mean :NaN Mean :NaN
#> 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
#> Max. : NA Max. : NA Max. : NA
#> NA's :3 NA's :3 NA's :3
#> Error: Stopping
#> In addition: There were 26 warnings (use warnings() to see them)
答案 0 :(得分:1)
从?caret::train
中可以看到,没有formula
参数,而是形式。
另外,您需要重新格式化并过滤NAs
:
caret::train(
form= spam ~ .,
data = (dplyr::select(spamTrain, -rgroup) %>% filter(!is.na(word.freq.cs))),
method = "rpart"
)
最好!