这个问题是同一个帖子here的延续。以下是本书的最小工作示例:
Wehrens R. Chemometrics的R多元数据分析 自然科学和生命科学。第1版。海德堡;纽约: 斯普林格。 2011.(第250页)。
该示例来自本书及其包ChemometricsWithR
。它强调了使用交叉验证技术进行建模时的一些缺陷。
目标:
一种交叉验证的方法,使用相同的重复CV集来执行PLS
的已知策略,通常由LDA
或类似逻辑回归,SVM,C5.0,CART等表兄弟,具有caret
的精神{1}}包。因此,每次调用等待分类器之前都需要PLS,以便对PLS 得分空间进行分类,而不是对观察本身进行分类。在使用任何分类器建模之前,插入符号包中最近的方法是PCA
作为预处理步骤。下面是一个PLS-LDA程序,只有一个交叉验证来测试分类器的性能,没有10倍的CV或任何重复。下面的代码取自上面提到的书,但有一些更正否则会引发错误:
library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var
odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test
prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)
Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))
tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test
prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])
predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)
library(caret)
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret
输出:
Confusion Matrix and Statistics
Reference
Prediction bph control pca
bph 4 1 9
control 1 35 7
pca 34 4 68
Overall Statistics
Accuracy : 0.6564
95% CI : (0.5781, 0.7289)
No Information Rate : 0.5153
P-Value [Acc > NIR] : 0.0001874
Kappa : 0.4072
Mcnemar's Test P-Value : 0.0015385
Statistics by Class:
Class: bph Class: control Class: pca
Sensitivity 0.10256 0.8750 0.8095
Specificity 0.91935 0.9350 0.5190
Pos Pred Value 0.28571 0.8140 0.6415
Neg Pred Value 0.76510 0.9583 0.7193
Prevalence 0.23926 0.2454 0.5153
Detection Rate 0.02454 0.2147 0.4172
Detection Prevalence 0.08589 0.2638 0.6503
Balanced Accuracy 0.51096 0.9050 0.6643
然而,混淆矩阵与书中的不一致,无论如何,书中的代码确实破了,但这一个在这里与我合作!
备注:
虽然这只是一个简历,但目的是首先就此方法达成一致,sd
和mean
列车集应用于测试集,PLUS根据特定数量转换为PLS分数PC ncomp
。我希望在插入符号的每一轮CV中都能发生这种情况。如果作为代码的方法在这里是正确的,那么在修改插入符号包的代码时,它可以作为最小工作示例的良好开端。
旁注:
使用缩放和居中可能会非常混乱,我认为R中的一些PLS功能在内部进行缩放,无论是否有居中,我都不确定,因此在插入中构建自定义模型时应小心处理以避免缺少或多次缩放或居中(我对这些事情保持警惕)。
多重居中/缩放的危险
下面的代码只是为了说明多片段居中/缩放如何改变数据,这里只显示居中,但缩放的同样问题也适用。
set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)
输出:
xNotCentered xCentered1 xCentered2 xCentered3
2.035540e+00 1.897798e-16 -5.603699e-18 -5.332377e-18
如果我遗漏了本课程某处的内容,请发表评论。感谢。
答案 0 :(得分:8)
如果您希望将这些类型的模型与caret
相匹配,则需要在CRAN上使用最新版本。最后一次更新已创建,以便人们可以根据需要使用non-standard models。
我的方法是联合拟合PLS和其他模型(我在下面的例子中使用随机森林)并同时调整它们。因此,对于每个折叠,使用ncomp
和mtry
的2D网格。
“技巧”是将PLS加载附加到随机森林对象,以便在预测时间内使用它们。以下是定义模型的代码(仅限分类):
modelInfo <- list(label = "PLS-RF",
library = c("pls", "randomForest"),
type = "Classification",
parameters = data.frame(parameter = c('ncomp', 'mtry'),
class = c("numeric", 'numeric'),
label = c('#Components',
'#Randomly Selected Predictors')),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
mtry = 1:len)
grid <- subset(grid, mtry <= ncomp)
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the random forest object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- randomForest(scores, y, mtry = param$mtry, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)
},
prob = NULL,
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
这是对train
的调用:
library(ChemometricsWithR)
data(prostate)
set.seed(1)
inTrain <- createDataPartition(prostate.type, p = .90)
trainX <-prostate[inTrain[[1]], ]
trainY <- prostate.type[inTrain[[1]]]
testX <-prostate[-inTrain[[1]], ]
testY <- prostate.type[-inTrain[[1]]]
## These will take a while for these data
set.seed(2)
plsrf <- train(trainX, trainY, method = modelInfo,
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
## How does random forest do on its own?
set.seed(2)
rfOnly <- train(trainX, trainY, method = "rf",
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
只是为了踢,我得到了:
> getTrainPerf(plsrf)
TrainAccuracy TrainKappa method
1 0.7940423 0.65879 custom
> getTrainPerf(rfOnly)
TrainAccuracy TrainKappa method
1 0.7794082 0.6205322 rf
和
> postResample(predict(plsrf, testX), testY)
Accuracy Kappa
0.7741935 0.6226087
> postResample(predict(rfOnly, testX), testY)
Accuracy Kappa
0.9032258 0.8353982
最高
答案 1 :(得分:4)
根据Max的宝贵意见,我觉得需要有 IRIS 裁判,这是以分类着称,更重要的是Species
结果有两个以上的类,这将是一个很好的数据集来测试插入符号中的PLS-LDA自定义模型:
data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36
ctrl <- trainControl(method = "repeatedcv",
repeats = 5,
classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
data = training,
method = "pls",
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)
set.seed(2)
plsldaFitIris <- train(Species ~ .,
data = training,
method = modelInfo,
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsldaFitIris
plot(plsldaFitIris)
现在比较两个模型:
getTrainPerf(plsFitIris)
TrainAccuracy TrainKappa method
1 0.8574242 0.7852462 pls
getTrainPerf(plsldaFitIris)
TrainAccuracy TrainKappa method
1 0.975303 0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy Kappa
0.750 0.625
postResample(predict(plsldaFitIris, testing), testing$Species)
Accuracy Kappa
0.9444444 0.9166667
因此,最终存在 EXPECTED 差异,以及指标的改进。因此,这将支持Max的观点,因为贝叶斯plsda
函数的概率方法导致的两类问题都会导致相同的结果。
答案 2 :(得分:3)
plsr
和lda
都以自己的方式对数据进行集中我仔细研究了caret::preProcess ()
:正如现在定义的那样,您将无法使用PLS作为预处理方法,因为它受到监督,但caret::preProcess ()
仅使用无监督方法(没有办法交出因变量)。这可能会使修补变得相当困难。
因此,在插入符号框架内,您需要使用自定义模型。
答案 3 :(得分:0)
如果场景是自定义PLS-LDA类型的模型,根据Max(CARET的维护者)友情提供的代码,这段代码中的某些东西不是核心,但我没想到它,因为我在caret
晕影中使用了相同的声纳数据集,并尝试使用method="pls"
一次重现结果,另一次使用以下自定义模型为PLS-LDA,结果完全相同甚至最后一位数,这是荒谬的。对于基准测试,需要一个已知的数据集(我认为虹膜数据集的交叉验证的PLS-LDA适合这里,因为它以这种类型的分析而闻名,并且应该在某处对其进行交叉验证处理),一切应该是相同的(set.seed(xxx)和没有K-CV重复)除了有问题的代码,以便正确地比较和判断下面的代码:
modelInfo <- list(label = "PLS-LDA",
library = c("pls", "MASS"),
type = "Classification",
parameters = data.frame(parameter = c("ncomp"),
class = c("numeric"),
label = c("#Components")),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the lda object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- lda(scores, y, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$class
},
prob = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$posterior
},
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
根据Zach的要求,下面的代码用于插入符号中的method="pls"
,与CRAN上的插入符号完全相同:
library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
data = training,
method = "pls",
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness
现在,下面的代码是使用自定义模型PLS-LDA
对Sonar数据进行分类的试运行,该模型有问题,预计会提供除与仅使用PLS的数字相同的任何数字:
set.seed(108)
plsldaFitSon <- train(Class ~ .,
data = training,
method = modelInfo,
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
现在比较两个模型之间的结果:
getTrainPerf(plsFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 pls
getTrainPerf(plsldaFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom
postResample(predict(plsFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
所以,结果完全相同,不可能。好像没有添加lda
模型?