Question

在尝试使用bundle update rails并使用randomForest按名称选择/删除数据框列时，我最终得到了一个奇怪的行为：

select

最终出现以下错误：

randomForest.default中的错误（x = select（Boston [train，]， - medv），y = select（Boston [train，：响应长度必须与预测变量相同另外：警告信息：在randomForest.default（x = select（Boston [train，]， - medv），y = select（波士顿[火车，：响应具有五个或更少的唯一值。你确定要做回归吗？

然而，当我用基数R定义x，y，xtest，ytest时，公式起作用：

library(MASS)
library(dplyr)
library(purrr)
library(randomForest)

train = base::sample(1:nrow(Boston), nrow(Boston)/2)
glimpse(Boston)
p <- ncol(Boston) - 1
ps <- 1:p
map_dbl(ps, ~mean(randomForest(x = select(Boston[train,], -medv), 
                           y = select(Boston[train,], medv), 
                           xtest = select(Boston[-train,], -medv),
                           ytest = select(Boston[-train,], medv),
                           mtry = .x, ntree = 500)$test$mse))

[1] 119.9225 132.5212 136.7131 139.7398 142.9167 144.2151 145.0587 146.9056 148.7087 148.1903 150.3910 [12] 151.5579 151.2323

所以我检查了这两种不同的子集化我的数据集的方法是否给出了相同的结果......而且是。

map_dbl(ps, ~mean(randomForest(x = Boston[train, -14], 
                           y = Boston[train, 14], 
                           xtest = Boston[-train, -14],
                           ytest = Boston[train, 14],
                           mtry = .x, ntree = 500)$test$mse))

所有这些都会导致all(select(Boston[train,], -medv) == Boston[train, -14]) all(select(Boston[train,], medv) == Boston[train, 14]) all(select(Boston[-train,], -medv) == Boston[-train, -14]) all(select(Boston[-train,], medv) == Boston[-train, 14])。为什么使用TRUE的第一个子集化方法最终会导致select模型中的错误？使用他们的名字删除列的另一种方法是什么？（像randomForest这样的东西显然不起作用。

Answer 1

问题在于randomForest中的y。它们需要是向量而不是data.frames。

如果您使用dplyr::select，则始终返回data.frame。

str(dplyr::select(Boston, medv)
'data.frame':   506 obs. of  1 variable:
 $ medv: num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

与通过基础R

选择单个色谱柱相比

str(Boston[, 14])
 num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

要在选择1列时获得与dplyr相同的结果，您需要在data.frame单列选择中使用drop = FALSE。

str(Boston[, 14, drop = FALSE])
'data.frame':   506 obs. of  1 variable:
 $ medv: num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

为了使您的代码更正，您可以使用as_vector中的purrr强制将包含medv的data.frame转换为矢量。

map_dbl(ps, ~mean(randomForest(x = dplyr::select(Boston[train,], -medv), 
                               y = as_vector(dplyr::select(Boston[train,], medv)), 
                               xtest = dplyr::select(Boston[-train,], -medv),
                               ytest = as_vector(dplyr::select(Boston[-train,], medv)),
                               mtry = .x, ntree = 500)$test$mse)) 



[1] 22.36214 15.52031 13.24707 12.22685 12.32809 11.82220 11.91149 11.65336 12.05399 12.16599 12.63174 12.79196 12.41167

Answer 2

运行以下代码，我们可以看到第二行和第四行实际上是不同的。

identical(select(Boston[train,], -medv), Boston[train, -14])
# [1] TRUE
identical(select(Boston[train,], medv), Boston[train, 14])
# [1] FALSE
identical(select(Boston[-train,], -medv), Boston[-train, -14])
# [1] TRUE
identical(select(Boston[-train,], medv), Boston[-train, 14])
# [1] FALSE

密钥是select(Boston[train,], medv)返回数据框，但Boston[train, 14]返回一个向量。看起来我们需要为y和ytest参数提供一个向量。

因此，以下内容将起作用，因为dplyr包中的pull会返回一个向量。

map_dbl(ps, ~mean(randomForest(x = select(Boston[train,], -medv), 
                               y = pull(Boston[train,], medv), 
                               xtest = select(Boston[-train,], -medv),
                               ytest = pull(Boston[-train,], medv),
                               mtry = .x, ntree = 500)$test$mse))

我们也可以使用purrr包中的pluck。

map_dbl(ps, ~mean(randomForest(x = select(Boston[train,], -medv), 
                               y = pluck(Boston[train,], "medv"), 
                               xtest = select(Boston[-train,], -medv),
                               ytest = pluck(Boston[-train,], "medv"),
                               mtry = .x, ntree = 500)$test$mse))

最后一件事，我认为对于你的第二个例子，ytest参数应该是Boston[-train, 14]，你没有减号。

使用tidyverse和基础R删除列 - 差异

2 个答案: