Question

进行线性回归的标准方法是这样的：

l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)

然后使用predict(l, new_data)进行预测，其中new_data是一个包含与公式匹配的列的数据框。但是lm()会返回一个lm对象，这个列表包含大多数情况下无关紧要的东西。这包括原始数据的副本，以及一组命名的向量和数组的长度/大小：

R> str(l)
List of 12
 $ coefficients : Named num [1:3] 3.587 -0.257 0.364
  ..- attr(*, "names")= chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
 $ residuals    : Named num [1:150] 0.2 -0.3 -0.126 -0.174 0.3 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ effects      : Named num [1:150] -37.445 -2.279 -0.914 -0.164 0.313 ...
  ..- attr(*, "names")= chr [1:150] "(Intercept)" "Petal.Length" "Petal.Width" "" ...
 $ rank         : int 3
 $ fitted.values: Named num [1:150] 3.3 3.3 3.33 3.27 3.3 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ assign       : int [1:3] 0 1 2
 $ qr           :List of 5
  ..$ qr   : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
  .. ..- attr(*, "assign")= int [1:3] 0 1 2
  ..$ qraux: num [1:3] 1.08 1.1 1.01
  ..$ pivot: int [1:3] 1 2 3
  ..$ tol  : num 1e-07
  ..$ rank : int 3
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 147
 $ xlevels      : Named list()
 $ call         : language lm(formula = Sepal.Width ~ Petal.Length + Petal.Width, data = iris)
 $ terms        :Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
  .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
  .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
  .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
  .. ..- attr(*, "order")= int [1:2] 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
 $ model        :'data.frame':  150 obs. of  3 variables:
  ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
  ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
  ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
  .. .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
  .. .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
  .. .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
  .. .. ..- attr(*, "order")= int [1:2] 1 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
 - attr(*, "class")= chr "lm"

这些东西占用了大量空间，而lm对象最终比原始数据集大了一个数量级：

R> object.size(iris)
7088 bytes
R> object.size(l)
52704 bytes

这不是一个小的数据集的问题，但是对于生成450mb lm对象的170Mb数据集来说，这确实是个问题。即使将所有返回选项设置为false，lm对象仍然是原始数据集的5倍：

R> ls <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris, model=FALSE, x=FALSE, y=FALSE, qr=FALSE)
R> object.size(ls)
30568 bytes

有没有办法在R中拟合模型，然后能够预测新输入数据的输出值而不存储大量额外的不必要数据？换句话说，有没有办法只存储模型系数，但仍然可以使用这些系数来预测新数据？

编辑：我想，除了不存储所有多余的数据之外，我还对使用lm的方式非常感兴趣，因此它甚至无法计算数据 - 它是只是浪费了CPU时间......

Answer 1

您可以使用biglm：

m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)

由于biglm未将数据存储在输出对象中，因此您需要在进行预测时提供数据：

p <- predict(m, newdata=iris)

biglm使用的数据量与参数数量成正比：

> object.size(m)
6720 bytes
> d <- rbind(iris, iris)
> m <- biglm(Sepal.Width ~ Petal.Length + Petal.Width, data=d)
> object.size(m)
6720 bytes

biglm还允许您使用update方法使用新的数据块更新模型。使用此功能，您还可以在完整数据集不适合内存时估计模型。

Answer 2

计算预测值所需的lm对象的唯一组件是terms和coefficients。但是，如果您删除了predict.lm组件（这是计算逐个字词的效果和标准错误所需的），则需要按qr投诉自己的预测功能。这样的事情应该做。

m <- lm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
m$effects <- m$fitted.values <- m$residuals <- m$model <- m$qr <-
     m$rank <- m$assign <- NULL

predict0 <- function(object, newdata)
{
    mm <- model.matrix(terms(object), newdata)
    mm %*% object$coefficients
}

predict0(m, iris[1:10,])

Answer 3

我认为有两种方法可以解决这个问题：

使用lm然后修剪脂肪。对于非常好的和有益的讨论，请参阅例如here和here。这不会解决＆＃34;计算时间＆＃34;问题。
请勿使用lm。

如果您选择第二个选项，您可以轻松自己编写矩阵运算，以便只获得预测值。如果您更喜欢使用固定程序，可以尝试其他实现最小二乘法的包，例如： fastLm - 来自RcppArmadillo - 包（或其Eigen版本，或其他人指出biglm）的fastLm，其中存储的信息少得多。使用这种方法有一些好处，例如提供公式界面等等。 l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris) library(biglm) m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris) library(RcppArmadillo) a <- fastLm(Sepal.Length ~ Petal.Length + Petal.Width, iris) object.size(l) # 52704 bytes object.size(m) # 6664 bytes object.size(a) # 6344 bytes也很快，如果你需要计算时间。

为了比较，这里有一个小基准：

{{1}}

R中的线性回归没有在内存中复制数据？

3 个答案: