R caretEnsemble CV长度不正确

时间:2017-08-29 14:23:49

标签: r cross-validation r-caret ensemble-learning

我正在尝试使用R中的包caretEnsemble来集合模型。这是一个可重复性最低的示例。如果这应该有额外的信息,请告诉我。

library(caret)
library(caretEnsemble)
library(xgboost)
library(plyr)


# Load iris data and convert to binary classification problem
data(iris)
data = iris
data$target = ifelse(data$Species == "setosa",1,0)
data = subset(data,select = -c(Species))

# Train control for models. 5 fold CV
set.seed(123)
index=createFolds(data$target, k=5,returnTrain = FALSE)
myControl = trainControl(method='cv', number=5,
                          returnResamp='none', classProbs=TRUE,
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=FALSE, allowParallel=TRUE,
                          summaryFunction=twoClassSummary,
                          index=index)

# Layer 1 models
model1 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "glm", family = "binomial", metric = "ROC")
model2 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "xgbTree", metric = "ROC",
               tuneGrid=expand.grid(nrounds = 50, max_depth=1, eta =  .05,                                                                                           gamma = .5, colsample_bytree = 1,min_child_weight=1, subsample=1))

# Stack models
all.models <- list(model1, model2)
names(all.models) <- c("glm","xgb")
class(all.models) <- "caretList"

stacked <- caretStack(all.models, method = "glm", family = "binomial", metric = "ROC",
                          trControl=trainControl(method='cv', number=5,
                          returnResamp='none', classProbs=TRUE,
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=FALSE, allowParallel=TRUE,
                          summaryFunction=twoClassSummary)
                          )

stacked

这是我关注的主要输出。

A glm ensemble of 2 base models: glm, xgb

Ensemble results:
Generalized Linear Model 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 480, 480, 480, 480, 480 
Resampling results:

  ROC        Sens  Spec 
  0.9509688  0.92  0.835

我的问题是基础数据集中有150行,因此5倍CV的每个折叠中有30行。如果你看“索引”,你会发现这是正常的。现在,如果你看一下“堆叠”的结果,你会发现每个折叠的元/堆叠模型的5倍长度是480。总共480 * 5 = 2400,比原始数据集大16倍。我不知道为什么会这样。

我的主要问题是:
1)每个折叠中的观察列表是否正确?
2)如果是这样,为什么会发生这种情况?

1 个答案:

答案 0 :(得分:0)

如果其他人偶然发现这个问题,请找出问题所在。我创建的索引是样本行外的指示符,因此代码应为:

myControl = trainControl(method='cv', number=5,
                          returnResamp='none', classProbs=TRUE,
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=FALSE, allowParallel=TRUE,
                          summaryFunction=twoClassSummary,
                          indexOut=index)

而不是index =它应该是indexOut =。数据对20%的数据进行了培训,之前预测为80%,这解释了重叠。现在正确设置了此选项,没有重叠。