Question

我在R中的机器学习任务中经常使用插入符号，我非常喜欢它。

但我面临以下问题：

我用插入符号训练模型，用lm()
当我想要获取新数据时，我会：predict(model, new_data)
当new_data在我的预测变量中包含缺失值时，预测会返回无预测，而不是说NA

是否可以：

返回new_data中所有行的预测，如果不可能，则预测为NA或
返回预测+预测对应的数据帧的行号？

E.g。就像mlr-package一样，id-column显示预测对应的行：

以下是mlr-predict页面的链接，其中包含更多详细信息： mlr-package: predict with row-id

任何帮助都非常感谢！

Answer 1

通过在数据集中创建包含行名称的新列，可以在运行caret::train()之前识别缺少值的案例，因为这些列默认为数据框中的行号。

使用Sonar包中的mlbench数据集作为插图：

library(mlbench)
data(Sonar)
library(caret)
set.seed(95014)

# add row numbers
Sonar$rowId <- rownames(Sonar)
# create training & testing data sets

inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
# set column 60 to NA for some values in test data
testing[48:51,60] <- NA
testing[!complete.cases(testing),"rowId"]

...和输出：

> testing[!complete.cases(testing),"rowId"]
[1] "193" "194" "200" "206"

然后，您可以对测试数据集中具有完整案例的行运行predict()。再次使用带有随机森林模型的Sonar数据集和3倍交叉验证来加速处理：

fitControl <- trainControl(method = "cv",number = 3)
fit <- train(x,y, method="rf",data=Sonar,trControl = fitControl)
predicted <- predict(fit,testing[complete.cases(testing),])

处理这种情况的另一种方法是使用插补策略来消除模型中自变量的缺失值。我关于Github的文章Strategies for Handling Missing Values链接到关于该主题的一些研究论文。

使用插入符号获取行号以进行预测

1 个答案: