在作业中,我们被要求对CART模型执行交叉验证。我尝试使用cvFit
中的cvTools
函数,但收到了一条奇怪的错误消息。这是一个最小的例子:
library(rpart)
library(cvTools)
data(iris)
cvFit(rpart(formula=Species~., data=iris))
我看到的错误是:
Error in nobs(y) : argument "y" is missing, with no default
traceback()
:
5: nobs(y)
4: cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K,
R = R, foldType = foldType, folds = folds, names = names,
predictArgs = predictArgs, costArgs = costArgs, envir = envir,
seed = seed)
3: cvFit(call, data = data, x = x, y = y, cost = cost, K = K, R = R,
foldType = foldType, folds = folds, names = names, predictArgs = predictArgs,
costArgs = costArgs, envir = envir, seed = seed)
2: cvFit.default(rpart(formula = Species ~ ., data = iris))
1: cvFit(rpart(formula = Species ~ ., data = iris))
看来y
必须cvFit.default
。但是:
> cvFit(rpart(formula=Species~., data=iris), y=iris$Species)
Error in cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K, :
'x' must have 0 observations
我做错了什么?哪个包允许我使用CART树进行交叉验证而无需自己编写代码? (我太懒了......)
答案 0 :(得分:16)
插入符号包使交叉验证变得轻而易举:
> library(caret)
> data(iris)
> tc <- trainControl("cv",10)
> rpart.grid <- expand.grid(.cp=0.2)
>
> (train.rpart <- train(Species ~., data=iris, method="rpart",trControl=tc,tuneGrid=rpart.grid))
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.94 0.91 0.0798 0.12
Tuning parameter 'cp' was held constant at a value of 0.2
答案 1 :(得分:4)
最后,我能够让它发挥作用。正如Joran所指出的那样,cost
参数需要进行调整。在我的情况下,我使用0/1丢失,这意味着我使用一个简单的函数来评估!=
而不是-
和y
之间的yHat
。此外,predictArgs
必须包含c(type='class')
,否则内部使用的predict
调用将返回概率向量,而不是最可能的分类。总结一下:
library(rpart)
library(cvTools)
data(iris)
cvFit(rpart, formula=Species~., data=iris,
cost=function(y, yHat) (y != yHat) + 0, predictArgs=c(type='class'))
(这使用了cvFit
的另一种变体。rpart
可以通过设置args=
参数来传递{{1}}的附加参数。)