Question

我正在尝试使用随机森林模型作为我正在测试的几种模型之一，包括神经网络（nnet和neuralnet）都使用方便的caret包。随机森林模型支持使用因子，因此对于此模型，不是使用dummyVars()将因子转换为数字对比，我认为我只是将它们作为因子。这在训练步骤（train()）中工作正常：

library(caret)

#Set dependent
seed = 123
y = "Sepal.Length"

#Partition (iris) data into train and test sets
set.seed(seed)
train.idx = createDataPartition(y = iris[,y], p = .8, list = FALSE)
train.set = iris[train.idx,]
test.set = iris[-train.idx,]

train.set = data.frame(train.set)
test.set = data.frame(test.set)

#Select features
features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")
mod.features = paste(features, collapse = " + ")

#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))

#Train model
mod <- train(mod.formula, data = train.set,
             method = "rf")

但是当我尝试使用extractPrediction()时，它失败了：

#Test model with extractPrediction()
testPred = extractPrediction(models = list(mod),
                             testX = test.set[,features],
                             testY = test.set[,y])

predict.randomForest（modelFit，newdata）中的错误：中的变量新数据中缺少训练数据

现在，据我所知，这是因为在调用train（）期间，为因子创建了1-hot编码/对比，因此创建了一些新的变量名称。似乎基本的predict（）方法即使有以下因素也能正常工作：

#Test model with predict()
testPred = predict(mod$finalModel, 
                   newData = test.set[, features])

当我使用dummyVars()将我的因子转换为数字对比时，extractPrediction()工作正常：

#Train and test model using dummyVar
data.dummies = dummyVars(~.,data = iris)
data = predict(data.dummies, newdata = iris)

set.seed(seed)
train.idx = createDataPartition(y = data[,y], p = .8, list = FALSE)
train.set = data[train.idx,]
test.set = data[-train.idx,]

features = c("Sepal.Width", "Petal.Length", "Petal.Width", "Species.setosa",
             "Species.versicolor", "Species.virginica")
mod.features = paste(features, collapse = " + ")

#Create formula
mod.formula = as.formula(paste(y, mod.features, sep = " ~ "))

train.set = data.frame(train.set)
test.set = data.frame(test.set)

mod <- train(mod.formula, data = train.set,
             method = "rf")

testPred = extractPrediction(models = list(mod),
                             testX = test.set[,features],
                             testY = test.set[,y])

任何人都可以向我解释为什么会这样吗？让extractPrediction()使用在我的多模型测试管道中使用的因子会很棒。我想我可以在开始时使用dummyVars()转换所有内容，但我很想知道为什么extractPrediction()在这种情况下不使用因素，即使在{{1}时也是如此确实有效。

Answer 1

如果您使用默认的函数界面，而不是使用公式的界面，那么您应该有业务。

set.seed(1234)
mod_formula <- train(
    Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
  , data = iris
  , method = "rf")

test_formula <- extractPrediction(
    models = list(mod_formula)
)

set.seed(1234)
mod_default <- train(
    y = iris$Sepal.Length
  , x = iris[, c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species')]
  , method = "rf")

test_default <- extractPrediction(
  models = list(mod_default)
)

extractPrediction（）是否支持因素？

1 个答案: