求解正规方程会得到与使用`lm`不同的系数?

时间:2016-10-18 16:34:10

标签: r regression linear-regression lm least-squares

我想使用lm和普通矩阵代数计算简单回归。但是,从矩阵代数中获得的回归系数只是使用lm得到的回归系数的一半,我不知道为什么。

这是代码

boot_example <- data.frame(
  x1 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L),
  x2 = c(0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L),
  x3 = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L),
  x4 = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L),
  x5 = c(1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L),
  x6 = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L),
  preference_rating = c(9L, 7L, 5L, 6L, 5L, 6L, 5L, 7L, 6L)
  )
dummy_regression <- lm("preference_rating ~ x1+x2+x3+x4+x5+x6", data = boot_example)
dummy_regression

Call:
lm(formula = "preference_rating ~ x1+x2+x3+x4+x5+x6", data = boot_example)

Coefficients:
(Intercept)           x1           x2           x3           x4           x5           x6  
     4.2222       1.0000      -0.3333       1.0000       0.6667       2.3333       1.3333 

###The same by matrix algebra
X <- matrix(c(
c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), #upper var
c(0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L), #upper var
c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), #country var
c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), #country var
c(1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), #price var
c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L) #price var
), 
nrow = 9, ncol=6)

Y <- c(9L, 7L, 5L, 6L, 5L, 6L, 5L, 7L, 6L)

#Using standardized (mean=0, std=1) "z" -transformation Z = (X-mean(X))/sd(X) for all predictors
X_std <- apply(X, MARGIN = 2, FUN = function(x){(x-mean(x))/sd(x)})

##If constant shall be computed as well, uncomment next line 
#X_std <- cbind(c(rep(1,9)),X_std)

#using matrix algebra formula
solve(t(X_std) %*% X_std) %*% (t(X_std) %*% Y)

           [,1]
[1,]  0.5000000
[2,] -0.1666667
[3,]  0.5000000
[4,]  0.3333333
[5,]  1.1666667
[6,]  0.6666667

有没有人在我的矩阵计算中看到错误?

提前谢谢!

1 个答案:

答案 0 :(得分:3)

lm未执行标准化。如果您想通过lm获得相同的结果,则需要:

X1 <- cbind(1, X)  ## include intercept

solve(crossprod(X1), crossprod(X1,Y))

#           [,1]
#[1,]  4.2222222
#[2,]  1.0000000
#[3,] -0.3333333
#[4,]  1.0000000
#[5,]  0.6666667
#[6,]  2.3333333
#[7,]  1.3333333

我不想重复我们应该使用crossprod。请参阅Ridge regression with glmnet gives different coefficients than what I compute by “textbook definition”?

的“后续”部分