我一直在使用caretEnsemble和插入符号包进行堆叠。我的数据是一个文档术语矩阵,带有一些额外的功能,如POS标记,目标是使用两个类进行情绪分析。 “Sentitr”表示对应于训练观察的情绪向量。 “Sentitest”测试集的向量。
我使用60:40分割
control <- trainControl(method="cv", number=10, savePredictions = "final", classProbs = TRUE,
summaryFunction = twoClassSummary,
index=createResample(sentitr, 10))
algorithmList <- c('pda', 'nnet', 'gbm', 'svmLinear', 'rf', 'C5.0', 'glmnet')
models <- caretList(trainset, sentitr, trControl=control, methodList=algorithmList)
# some model info
summary(models)
res = resamples(models)
summary(res)
modelCor(res)
# lda and nnet extremely closely correlated
stackcontrol <- trainControl(method="cv", number=5, savePredictions = "final", classProbs = TRUE,
summaryFunction = twoClassSummary)
# stacks
stack.c5.0 <- caretStack(models, method="C5.0", metric="ROC", trControl=stackcontrol)
summary(stack.c5.0)
stack.c50.pred = predict(stack.c5.0, newdata = testset, type = "raw")
stackc50.conf = confusionMatrix(stack.c50.pred, sentitest)
我每次将数据随机分配到60/40训练/测试集中时,我试图运行该模型10次。我在测试集上得到了以下分类准确度(我从混淆矩阵中提取)
X0.3225
1 0.3225
2 0.2550
3 0.7500
4 0.2675
5 0.2950
6 0.7825
7 0.2575
8 0.2875
9 0.2900
10 0.3275
这些是输出。如您所见,在两次模型迭代中实现了大约75-80%的准确度。这是预期的,并反映了我从拟合单个模型得到的结果。但是剩下的迭代会产生非常糟糕的准确性。在我看来,模型随机地混淆了测试误差和准确性。 任何想法导致这种行为的原因
当预测出现如此糟糕的准确性时,每次迭代,训练caretStack时都会出现以下错误:
2: In predict.C5.0(modelFit, newdata, trial = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
3: In predict.C5.0(modelFit, newdata, type = "prob", trials = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
4: In predict.C5.0(modelFit, newdata, trial = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials
5: In predict.C5.0(modelFit, newdata, type = "prob", trials = submodels$trials[j]) :
'trials' should be <= 9 for this object. Predictions generated using 9 trials