Question

我正在使用艾姆斯，爱荷华州的房价数据集。

我有一套火车和测试装置。测试集缺少因变量SalePrice。（SalePrice没有专栏）。

我已经完成了一个线性模型，现在我正在尝试预测测试集上的销售价格值。但是在这样做时，无论使用何种模型，我总是会获得SalePrice的相同预测值。

然后在尝试计算RMSE时，我得到NA。

这是我的模特：

lm2 <- lm(SalePrice ~ 
       GarageCars + 
       Neighborhood + 
       I(OverallQual^2) + OverallQual + 
       OverallQual*GrLivArea + 
       log2(LotArea) + 
       log2(GrLivArea) + 
       KitchenQual +
       I(TotalBsmtSF^2) +
       TotalBsmtSF
       , data=train)

# Add an empty column to the test set, 
# to be later filled in by predictions 
# (Is this even necessary?):
test[, "SalePrice"] <- NA

# My predictions:
predictions <- predict(lm2, newdata = test)
head(predictions)
   1        2        3        4        5        6 
121093.5 170270.7 170029.5 187012.1 239359.2 172962.1

无论使用何种型号，我都会获得相同的值。我怀疑我只是不理解predict（）。我怀疑我只是根据我的火车组而不是我的测试组得到预测值。

我知道变量名称需要与模型中使用的名称完全匹配，但预测的其他方面我不理解？我是否需要在测试集中执行相同的预测变量变换？我必须创建变量来保存它们吗？

然后我计算出RMSE：

# Formula function for calculating RMSE:
rmse <- function(actual, pred) sqrt(mean((actual-pred)^2))

# Calculate rmse on test set:
rmse(test$SalePrice, predictions))
[1] NA

你可以告诉我我做错了什么吗？如果您需要查看数据，请告诉我。

无论模型中的特征如何，线性预测始终相同

0 个答案: