使用Data.table滚动回归 - 更新?

时间:2015-12-12 20:30:08

标签: r data.table regression

我试图在data.table中运行滚动回归。有许多问题可以解决我想要做的事情,但它们通常都是3年以上,并提供不优雅的答案。 (参见:here,例如)

我想知道data.table包是否有更新更直观/更快?

这是我想要做的。我的代码如下所示:

DT<-data.table(
  Date = seq(as.Date("2000/1/1"), by = "day", length.out = 1000),
  x1=rnorm(1000),
  x2=rnorm(1000),
  x3=rnorm(1000),
  y=rnorm(1000),
  country=rep(c("a","b","c","d"), each=25))

我想在x1,x2和x3上,在180天的滚动窗口中按国家/地区对y进行回归,并按日期存储系数。

理想情况下,语法看起来像这样:

DT[,.(coef.x1 := coef(y~x1+x2+x3)[2] , 
coef.x2 := coef(y~x1+x2+x3)[3], 
coef(y~x1+x2+x3)[4],
by=c("country",ROLLING WINDOW)]

...但更优雅/尽可能避免重复! :)

由于某种原因,我还没有让rollapply语法对我有用。

谢谢!

编辑:

谢谢@michaelchirico。

你的建议接近我的目标 - 也许它可以修改代码来接收它但是又一次,我被卡住了。

这是对我需要的更仔细的阐述。一些代码:

DT<-data.table(
  Date = rep(seq(as.Date("2000/1/1"), by = "day", length.out = 10),times=3), #same dates per country

  x1=rep(rnorm(10),time=3), #x1's repeat - same per country
  x2=rep(rnorm(10), times=3),#x2's repeat - same per country
  x3=rep(rnorm(10), times=3), #x3's repeat - same per country
  y=rnorm(30), #y's do not repeat and are unique per country per day
  country=rep(c("a","b","c"), each=10))

#to calculate the coefficients by individual  country: 
a<-subset(DT,country=="a")
b<-subset(DT,country=="b")

window<-5 #declare window
coefs.a<-coef(lm(y~x1+x2+x3, data=a[1:window]))#initialize my coef variable
coefs.b<-coef(lm(y~x1+x2+x3, data=b[1:window]))#initialize my coef variable

##calculate coefficients per window

for(i in 1:(length(a$Date)-window)){
  coefs.a<-rbind(coefs.a, coef(lm(y~x1+x2+x3, data=a[(i+1):(i+window-1)])))
  coefs.b<-rbind(coefs.b, coef(lm(y~x1+x2+x3, data=b[(i+1):(i+window-1)])))
 }

此数据集与前一个数据集的不同之处在于日期和x1,x2,x3都重复。我的每个国家都是独一无二的。

在我的实际数据集中,我有120个国家/地区。我可以为每个国家计算这个,但它非常慢,然后我必须将所有系数重新加入到单个数据集中以分析结果。

是否有类似于您提议的最终单个data.table,所有观察结果?

再次感谢!!

2 个答案:

答案 0 :(得分:0)

It's still not clear exactly what you're after, but here's a shot which should be close (only minor adjustments need be made depending on the details):

I can't really speak to speed.

TT <- DT[ , uniqueN(Date), by = country][ , max(V1)]
window <- 5
#pre-declare a matrix of windows; each column represents
#one of the possible windows of days
windows <- matrix(1:TT, nrow = TT + 1, ncol = max(TT - window + 1, 1))[1:window, ]

DT[ , {
  #not all possible windows necessarily apply to each
  #  country; subset to find only the relevant windows
  windowsj <- windows[ , 1:(uniqueN(Date) - window + 1)]
  #lapply returns a list (which can be readily assigned with :=)
  lapply(1:ncol(windowsj),
         function(ii){
           #subset to relevant rows
           .SD[windowsj[ , ii],
               #regress, extract
               lm(y ~ x1 + x2 + x3)$coefficients]})},
  by = country]

Comparing the result of this to your coefs.a and coefs.b:

    country         V1          V2         V3          V4          V5          V6
 1:       a -0.8764867  0.46169717  2.6712128  2.66304537  1.18928600  0.53553900
 2:       a -1.0135961  0.03985467  0.6015446  0.61316724  0.24177034  0.86369780
 3:       a -0.1807617 -0.25767309 -2.9492897 -3.05092528 -0.04310375  0.62317993
 4:       a -0.6664342 -0.30732907 -0.3362091 -0.25776715  1.04419854  1.02294125
 5:       b  0.9548685  0.77461810 -0.5100818 -0.57726788 -0.73285223 -1.64196684
 6:       b  0.7179429  0.46107110  0.1732915  0.23262455  0.23258149  3.63679221
 7:       b  0.1639778 -0.22249382  1.4539881  0.58725270  0.54879762 -0.27115275
 8:       b  0.6192641  0.12706750  0.2671673  0.79569434  0.69031761  2.27769679
 9:       c  0.2722200  0.07279085 -0.7709578 -0.74590575 -0.15773196  0.03178821
10:       c  0.8890314  0.74213624  0.4440650  0.34939003  0.50531166  0.16550026
11:       c  0.1589915  0.20531447  0.9931054  1.25495206 -0.01543296 -0.09887655
12:       c  0.7198967  0.70536869  0.4508445  0.02028332 -0.54705588 -0.64246579

> coefs.a
        (Intercept)          x1          x2         x3
coefs.a  -0.8764867 -1.01359605 -0.18076171 -0.6664342
          0.4616972  0.03985467 -0.25767309 -0.3073291
          2.6712128  0.60154458 -2.94928969 -0.3362091
          2.6630454  0.61316724 -3.05092528 -0.2577671
          1.1892860  0.24177034 -0.04310375  1.0441985
          0.5355390  0.86369780  0.62317993  1.0229412

(i.e. it's the same, just transposed)

答案 1 :(得分:0)

frollapply仅接受数字矢量输入和输出,因此我们必须在行索引中用sapply()编写自己的数字。

window <- 180
DT[, 
   {
     data.table(t(sapply(seq_len(.N - window + 1),
                         function(k) lm(y ~ x1 + x2 + x3, 
                                        data = .SD[k:(k + window)])$coefficients)))
   }, 
   by = country] 
##      country (Intercept)         x1          x2          x3
##   1:       a  0.10163170 0.09561343 -0.11123725 -0.06489867
##   2:       a  0.11029460 0.08927926 -0.10657563 -0.06035072
##   3:       a  0.11328084 0.08856627 -0.10521865 -0.06278259
##   4:       a  0.12348242 0.07503412 -0.10483616 -0.06638923
##   5:       a  0.13285512 0.09268086 -0.11239769 -0.04068656
##  ---                                                       
## 280:       d  0.08249204 0.06252626 -0.06965884 -0.09680134
## 281:       d  0.07864977 0.05395658 -0.06137728 -0.10774067
## 282:       d  0.07937867 0.06996970 -0.07991358 -0.11377039
## 283:       d  0.07654691 0.06546692 -0.06824516 -0.10902969
## 284:       d  0.06123857 0.08590249 -0.05117317 -0.11728684
``