Question

过去，我曾将lm函数与matrix类型的数据和data.frame类型的数据一起使用。但是我想这是我第一次尝试使用没有predict的模型来使用data.frame。而且我不知道如何使它工作。

我阅读了其他一些问题（例如Getting Warning: " 'newdata' had 1 row but variables found have 32 rows" on predict.lm），并且我很确定我的问题与拟合模型后获得的系数名称有关。出于某种原因，系数名称是矩阵名称与列名称的粘贴...而我一直无法找到解决方法...

library(tidyverse)
library(MASS)

set.seed(1)
label <- sample(c(T,F), nrow(Boston), replace = T, prob = c(.6,.4))

x.train <- Boston %>% dplyr::filter(., label) %>%
  dplyr::select(-medv) %>% as.matrix()
y.train <- Boston %>% dplyr::filter(., label) %>%
  dplyr::select(medv) %>% as.matrix()
x.test <- Boston %>% dplyr::filter(., !label) %>%
  dplyr::select(-medv) %>% as.matrix()
y.test <- Boston %>% dplyr::filter(., !label) %>%
  dplyr::select(medv) %>% as.matrix()

fit_lm <- lm(y.train ~ x.train)
fit_lm2 <- lm(medv ~ ., data = Boston, subset = label)
predict(object = fit_lm, newdata = x.test %>% as.data.frame()) %>% length() 
predict(object = fit_lm2, newdata = x.test %>% as.data.frame()) %>% length()
# they get different numbers of predicted data
# the first one gets a number a results consistent with x.train

任何帮助都将受到欢迎。

Answer 1

由于我无法使用此软件包，因此无法修复您的tidyverse代码。但是我能够解释为什么predict在第一种情况下失败。

让我只使用内置数据集trees进行演示：

head(trees, 2)
#  Girth Height Volume
#1   8.3     70   10.3
#2   8.6     65   10.3

使用lm的正常方法是

fit <- lm(Girth ~ ., trees)

变量名（在~的RHS上）是

attr(terms(fit), "term.labels")
#[1] "Height" "Volume"

使用newdata时，您需要在predict中提供这些变量。

predict(fit, newdata = data.frame(Height = 1, Volume = 2))
#       1 
#11.16125

现在，如果您使用矩阵拟合模型：

X <- as.matrix(trees[2:3])
y <- trees[[1]]
fit2 <- lm(y ~ X)
attr(terms(fit2), "term.labels")
#[1] "X"

您现在需要在newdata中为predict提供的变量是X，而不是Height或Girth。请注意，由于X是矩阵变量，因此在将其馈送到数据帧时，需要用I()保护它。

newdat <- data.frame(X = I(cbind(1, 2)))
str(newdat)
#'data.frame':  1 obs. of  1 variable:
# $ X: AsIs [1, 1:2] 1 2

predict(fit2, newdat)
#       1 
#11.16125

cbind(1, 2)没有列名也没关系。重要的是，此矩阵在X中被命名为newdat。

当我使用矩阵变量拟合和预测模型时，predic.lm给出错误的预测值数量

1 个答案: