我正在使用优秀的R包,插入符号,并且我想在多个训练数据集的列表上运行列车功能。现在,我意识到列车功能的文档说数据参数必须是一个数据框,所以我试图做的事情可能是不可能的,这可能更好地被建议作为插入符号的增强,但我想知道是否有人试图这样做。
出于说明目的使用声纳数据,我创建了一个由两个数据帧组成的列表(两者都命名),每个数据帧都是一个单独的训练数据集。然后我使用mapply将列车功能应用于列表中的每个元素。不幸的是,我得到了可怕的结果。具体来说,我希望pls1.3..A [[2]]中的指标与pls1.3..B2中的指标相同。如你所见,它们不是。奇怪的是,pls1.3..A [[1]]匹配pls1.3..B1。有没有明显的事情我做错了,或者这可能是不可能的(现在)? (我在1.4 GHz Intel Core i5 Mac上运行R 3.1.1。)
可重现的代码(和输出已注释掉)如下:
require(doMC)
registerDoMC(cores = 2)
library(caret)
library(mlbench)
data(Sonar)
set.seed(1234)
inTrain <- createDataPartition(y = Sonar$Class,
p = .75,
list = FALSE)
training <- Sonar[ inTrain,]
training2 <- Sonar[-inTrain,]
both <- list(training, training2)
#both_test <- list(training[c(1:100),], training2[c(1:35),]) #SILLY test data for functionality testing only
set.seed(1234)
labels <- list()
for(i in 1:length(both)) {
labels[i] <- list(both[[i]]$Class)
}
#NEW CODE -- ADDED BASED ON @Josh W's comment -- removing the label (Class) variable from the feature matrix
both <- lapply(both, function(x) {
subset(x[,c(1:60)])
})
#NEW CODE -- changed from using the formula implementation of caret to the x (feature matrix), y (label/outcome vector)
pls1.3..A <- mapply(function(x,y) train(x, y, method = "pls", preProc = c("center", "scale")), x = both, y = labels, SIMPLIFY = FALSE)
pls1.3..A
#[[1]]
#Partial Least Squares
#157 samples
# 60 predictor
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 157, 157, 157, 157, 157, 157, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6889679 0.3756821 0.06015197 0.11605511
# 2 0.7393776 0.4742204 0.04962609 0.09775688
# 3 0.7410997 0.4793703 0.04856698 0.09412599
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 3.
#[[2]]
#Partial Least Squares
#51 samples
#60 predictors
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 51, 51, 51, 51, 51, 51, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6452693 0.2929118 0.08076455 0.1525176
# 2 0.6468405 0.2902136 0.09686340 0.1790924
# 3 0.6559113 0.3087227 0.08025215 0.1547317
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 3.
set.seed(1234)
pls1.3..B1 <- train(both[[1]],
labels[[1]],
method = "pls",
preProc = c("center", "scale"))
pls1.3..B1
#Partial Least Squares
#157 samples
# 60 predictor
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 157, 157, 157, 157, 157, 157, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6889679 0.3756821 0.06015197 0.11605511
# 2 0.7393776 0.4742204 0.04962609 0.09775688
# 3 0.7410997 0.4793703 0.04856698 0.09412599
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 3.
set.seed(1234)
pls1.3..B2 <- train(both[[2]],
labels[[2]],
method = "pls",
preProc = c("center", "scale"))
pls1.3..B2
#Partial Least Squares
#51 samples
#60 predictors
# 2 classes: 'M', 'R'
#Pre-processing: centered, scaled
#Resampling: Bootstrapped (25 reps)
#Summary of sample sizes: 51, 51, 51, 51, 51, 51, ...
#Resampling results across tuning parameters:
# ncomp Accuracy Kappa Accuracy SD Kappa SD
# 1 0.6127279 0.2518488 0.11925682 0.1959400
# 2 0.6792163 0.3618657 0.09386771 0.1776549
# 3 0.6673662 0.3343716 0.07524373 0.1476405
#Accuracy was used to select the optimal model using the largest value.
#The final value used for the model was ncomp = 2.
答案 0 :(得分:0)
如果您使用以下内容,您将得到您期望的结果(接近什么):
set.seed(1234)
pls1.3..B <- train(labels[[2]]~ .,
data = both[[2]],
method = "pls",
preProc = c("center", "scale"))
pls1.3..B
我相信这是因为您指定了公式的方式。 object ~ .
公式使用的数据中的所有内容都不是列object
。在mapply调用中指定,它是basically external object ~ entire data.frame
,包括类标签。所以我相信这就像在数据集中使用您的响应变量进行训练一样。