Question

我正在使用优秀的R包，插入符号，并且我想在多个训练数据集的列表上运行列车功能。现在，我意识到列车功能的文档说数据参数必须是一个数据框，所以我试图做的事情可能是不可能的，这可能更好地被建议作为插入符号的增强，但我想知道是否有人试图这样做。

出于说明目的使用声纳数据，我创建了一个由两个数据帧组成的列表（两者都命名），每个数据帧都是一个单独的训练数据集。然后我使用mapply将列车功能应用于列表中的每个元素。不幸的是，我得到了可怕的结果。具体来说，我希望pls1.3..A [[2]]中的指标与pls1.3..B2中的指标相同。如你所见，它们不是。奇怪的是，pls1.3..A [[1]]匹配pls1.3..B1。有没有明显的事情我做错了，或者这可能是不可能的（现在）？（我在1.4 GHz Intel Core i5 Mac上运行R 3.1.1。）

可重现的代码（和输出已注释掉）如下：

    require(doMC)
    registerDoMC(cores = 2) 

    library(caret) 
    library(mlbench) 
    data(Sonar) 
    set.seed(1234) 
    inTrain <- createDataPartition(y = Sonar$Class, 
                                   p = .75, 
                                    list = FALSE) 

    training <- Sonar[ inTrain,] 
    training2  <- Sonar[-inTrain,] 

    both <- list(training, training2) 
    #both_test <- list(training[c(1:100),], training2[c(1:35),]) #SILLY test data for functionality testing only 

    set.seed(1234) 

    labels <- list() 
    for(i in 1:length(both)) { 
        labels[i] <- list(both[[i]]$Class) 
        } 

    #NEW CODE -- ADDED BASED ON @Josh W's comment -- removing the label (Class) variable from the feature matrix
    both <- lapply(both, function(x) {
        subset(x[,c(1:60)])
        })

    #NEW CODE -- changed from using the formula implementation of caret to the x (feature matrix), y (label/outcome vector)

    pls1.3..A <- mapply(function(x,y) train(x, y, method = "pls", preProc = c("center", "scale")), x = both, y = labels, SIMPLIFY = FALSE) 
    pls1.3..A 

    #[[1]]
    #Partial Least Squares 

    #157 samples
    # 60 predictor
    #  2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 157, 157, 157, 157, 157, 157, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD  
    #  1      0.6889679  0.3756821  0.06015197   0.11605511
    #  2      0.7393776  0.4742204  0.04962609   0.09775688
    #  3      0.7410997  0.4793703  0.04856698   0.09412599

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 3. 

    #[[2]]
    #Partial Least Squares 

    #51 samples
    #60 predictors
    # 2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 51, 51, 51, 51, 51, 51, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD 
    #  1      0.6452693  0.2929118  0.08076455   0.1525176
    #  2      0.6468405  0.2902136  0.09686340   0.1790924
    #  3      0.6559113  0.3087227  0.08025215   0.1547317

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 3.          

    set.seed(1234)
    pls1.3..B1 <- train(both[[1]],
                    labels[[1]],
                    method = "pls",
                    preProc = c("center", "scale"))
    pls1.3..B1
    #Partial Least Squares 

    #157 samples
    # 60 predictor
    #  2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 157, 157, 157, 157, 157, 157, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD  
    #  1      0.6889679  0.3756821  0.06015197   0.11605511
    #  2      0.7393776  0.4742204  0.04962609   0.09775688
    #  3      0.7410997  0.4793703  0.04856698   0.09412599

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 3. 

    set.seed(1234)
    pls1.3..B2 <- train(both[[2]],
                    labels[[2]],
                    method = "pls",
                    preProc = c("center", "scale"))
    pls1.3..B2

    #Partial Least Squares 

    #51 samples
    #60 predictors
    # 2 classes: 'M', 'R' 

    #Pre-processing: centered, scaled 
    #Resampling: Bootstrapped (25 reps) 

    #Summary of sample sizes: 51, 51, 51, 51, 51, 51, ... 

    #Resampling results across tuning parameters:

    #  ncomp  Accuracy   Kappa      Accuracy SD  Kappa SD 
    #  1      0.6127279  0.2518488  0.11925682   0.1959400
    #  2      0.6792163  0.3618657  0.09386771   0.1776549
    #  3      0.6673662  0.3343716  0.07524373   0.1476405

    #Accuracy was used to select the optimal model using  the largest value.
    #The final value used for the model was ncomp = 2.

Answer 1

如果您使用以下内容，您将得到您期望的结果（接近什么）：

set.seed(1234) 
pls1.3..B <- train(labels[[2]]~ ., 
                   data = both[[2]], 
                   method = "pls", 
                   preProc = c("center", "scale")) 
pls1.3..B

我相信这是因为您指定了公式的方式。 object ~ .公式使用的数据中的所有内容都不是列object。在mapply调用中指定，它是basically external object ~ entire data.frame，包括类标签。所以我相信这就像在数据集中使用您的响应变量进行训练一样。

R - 我可以将插入符号中的列车功能应用于数据帧列表吗？

1 个答案: