Question

R的predict函数可以使用newdata参数，其文档为：

newdata一个可选的数据框，用于查找要预测的变量。如果省略，则使用拟合值。

但我发现根据模型的拟合程度并不完全正确。例如，以下代码按预期工作：

x <- rnorm(200, sd=10)
y <- x + rnorm(200, sd=1)
data <- data.frame(x, y)
train = sample(1:length(x), size=length(x)/2, replace=F)
dataTrain <- data[train,]
dataTest <- data[-train,]
m <- lm(y ~ x, data=dataTrain)
head(predict(m,type="response"))
head(predict(m,newdata=dataTest,type="response"))

但如果模型适合：

m2 <- lm(dataTrain$y ~ dataTrain$x)
head(predict(m2,type="response"))
head(predict(m2,newdata=dataTest,type="response"))

最后两行将产生完全相同的结果。 predict函数以忽略newdata参数的方式工作，即它根本无法真正计算对新数据的预测。

当然，罪魁祸首是lm(y ~ x, data=dataTrain)与lm(dataTrain$y ~ dataTrain$x)。但我没有找到任何提到这两者之间差异的文件。这是一个众所周知的问题吗？

我正在使用R 2.15.2。

Answer 1

请参阅?predict.lm和注释部分，我在下面引用：

Note:

     Variables are first looked for in ‘newdata’ and then searched for
     in the usual way (which will include the environment of the
     formula used in the fit).  A warning will be given if the
     variables found are not of the same length as those in ‘newdata’
     if it was supplied.

虽然它没有以“同名”等方式陈述行为，但就公式而言，您传递给它的术语是foo$var形式，并且没有这样的变量在newdata中或在R将遍历以寻找它们的搜索路径中的名称。

在你的第二种情况下，你完全滥用模型公式表示法;这个想法是简洁而象征性地描述模型。简洁和重复数据对象 ad nauseum 不兼容。

您注意到的行为完全与记录的行为一致。简单来说，您使用术语data$x和data$y拟合模型，然后尝试预测术语x和y。就R而言，这些是不同的名称，因而是不同的东西，并且它们与它们不匹配是正确的。

向R提供新数据预测功能

1 个答案: