按组运行回归并将拟合值和/或残差分配回原始data.frame?

时间:2013-11-25 23:30:32

标签: r

我试图从ols获得分组预测,并将预测分配回原始data.frame中的列。在sas中,这可以通过以下方式完成:

proc reg data = ds;
by group_var1 group_var2;
model y= x;
output out = ds r = resd p = pred;
run;

上面的代码会正确地将预测分配给pred变量,并将残差分配给原始数据集中正确“块”中的resd变量。在R中,我最接近的是如下:

d <- data.frame(x = rnorm(20), y = rnorm(20), g1 = c(rep('a', 10), rep('b', 10)), g2 = rep(c(rep('c', 5), rep('d', 5)), 2))
fun <- function(d) predict(lm(y ~ x, data = d), d)
d['predict'] <- unlist(by(d, d[, c('g1', 'g2')], fun))

             x           y g1 g2     predict
1  -0.53089730  0.26437295  a  c -0.01569909
2  -1.70298591 -0.58804638  a  c -0.01960800
3   0.31134574 -0.96575392  a  c -0.01289022
4   1.03821508  0.36409789  a  c -0.01046612
5  -1.05180195  0.84922972  a  c -0.01743631
6   0.94785058  0.16559779  a  d  0.55257659
7  -0.11779401 -2.31900972  a  d  0.65193420
8  -0.87618526 -2.29891776  a  d -0.54113668
9   0.28450262 -0.68698073  a  d -1.48699542
10  0.44388469 -1.54596297  a  d -1.22664063
11 -1.15656711  1.70991300  b  c -0.26021797
12 -1.26949128 -0.05968582  b  c -1.67447336
13  0.08648475 -1.56257791  b  c -2.68096176
14  1.16149361 -1.40203666  b  c -1.14057100
15  0.86558927 -0.73587454  b  c -0.92904930
16 -1.15168500  0.55817377  b  d  1.34287585
17  1.17898623 -0.84767449  b  d  0.12289709
18 -0.61372747  1.55786932  b  d  1.06128454
19 -0.31233192  1.15423216  b  d  0.90352047
20  0.61869842  1.42415426  b  d  0.41617707

但是,我觉得这种方法并不健全,因为unlist只是简单地展平输出而没有任何匹配组的方法。我确实看到了预测错误分配给其他群体的情况。是否有一种聪明而有力的方法来实现这一目标?

1 个答案:

答案 0 :(得分:3)

R中有很多选项,例如使用plyr

library(plyr)
ddply(d,.(g1,g2),fun)
 g1 g2          1           2          3          4          5
1  a  c  0.4218236 -0.20147871  0.4544890 -0.2330185  0.2070659
2  a  d  1.0728608  1.08735907  1.0239526  0.8740484  1.4750465
3  b  c -0.1288420  0.20727212  0.2407177  0.4017079 -0.5237893
4  b  d -0.6598871 -0.06777359 -0.2847552 -1.6400449 -0.2821023

或使用data.table

library(data.table)
DT <- as.data.table(d)
DT[,as.list(predict(lm(y~x))),by= c('g1', 'g2')]

  g1 g2          1           2          3          4          5
1:  a  c  0.4218236 -0.20147871  0.4544890 -0.2330185  0.2070659
2:  a  d  1.0728608  1.08735907  1.0239526  0.8740484  1.4750465
3:  b  c -0.1288420  0.20727212  0.2407177  0.4017079 -0.5237893
4:  b  d -0.6598871 -0.06777359 -0.2847552 -1.6400449 -0.2821023