R和机器学习的新手,所以请原谅这个基本问题...
我正在试验kernlab库中的“垃圾邮件”数据集。并使用插入符号库中的函数。
目标:
预测“垃圾邮件”中58个剩余变量的“类型”
我尝试了两种不同的预处理方式:
train()
之前的预处理数据集
# preprocess all 57 predictors, leave out response variable #58
preproc = preProcess(trainset[-58], method = "BoxCox")
preprocTrain = predict(preproc, trainset[,-58])
preprocTrain$type = trainset$type
preprocTest = predict(preproc, testset[,-58])
preprocTest$type = testset$type
set.seed(123)
fit2 = train(type~., data=preprocTrain, method = "glm")
predict2 = predict(fit2, newdata = preprocTest)
confmat2 = confusionMatrix(predict2, preprocTest$type)
fit2$results
confmat2$overall
注意:
fit2 Accuracy = 0.93 and confmat2 Accuracy = 0.92
然后,
在preProcess
内使用train()
set.seed(123)
fit3 = train(type~., data=trainset, method="glm", preProcess = "BoxCox")
Predict using pre-processed test set from before
predict3 = predict(fit3, newdata = preprocTest)
confmat3 = confusionMatrix(predict3, preprocTest$type)
fit3$results
confmat3$overall
现在,
fit3 Accuracy = 0.93
但confmat3 Accuracy = 0.75
请帮助我理解为什么这种急剧下降? confmat3精度不应该与confmat2精度相同吗?区别在哪里?另外,在第二个预测中,我得到以下警告:
Warning messages:
1: In predict.BoxCoxTrans(object$bc[[i]], newdata[, i]) :
newdata should have values 0
2: In predict.BoxCoxTrans(object$bc[[i]], newdata[, i]) :
newdata should have values 0
谢谢!