data.table方式对按最后n年分组的列执行操作

时间:2017-08-28 06:59:48

标签: r data.table

以下是我想通过可重现的示例实现的示例。

我有data.table个月份作为时间ID。我想对过去5年,过去10年等数据进行一些计算,直到最后一个月。 (即最后5*12个月,过去10*12个月等)

我有办法做到这一点,但我怀疑它经历了许多不必要的中间变量。

library(lubridate) #For easy creation of time-series
library(data.table)
set.seed(5)
DT <- data.table(
  Month = as.Date(sapply(0:329, function(i)(as.Date('1990-01-01')%m+%months(i))), origin = '1970-01-01'), 
  Value = round(runif(330, min = 20, max = 40), digits = 2)
)

> DT
          Month Value
  1: 1990-01-01 24.00
  2: 1990-02-01 33.70
  3: 1990-03-01 38.34
  4: 1990-04-01 25.69
  5: 1990-05-01 22.09
 ---                 
326: 2017-02-01 20.91
327: 2017-03-01 38.96
328: 2017-04-01 28.91
329: 2017-05-01 26.09
330: 2017-06-01 35.16


## Create a vector of the first months marking the start of the 60 or 120 month period
last.month <- max(DT[['Month']])
first.months <- as.Date(sapply(seq(5, 25, by = 5), function(i)(last.month 
%m-% months(i*12 - 1))), origin = '1970-01-01')

## Construction of table of interest
yrs <- paste0(seq(5, 25, by = 5), 'Yrs')
features <- data.table(
  Period = factor(yrs, levels = yrs), Feature.1 = as.numeric(NA), 
  Feature.2 = as.numeric(NA)
)
for(i in 1:nrow(features)){
  DT_n <- DT[Month>=first.months[i], ]
  set(features, i, 'Feature.1', DT_n[, mean(Value)]) #mean used as an example operation
  set(features, i, 'Feature.2', DT_n[, var(Value)]) #var used as an example operation
}

最后,这是我感兴趣的表 -

> features
   Period Feature.1 Feature.2
1:   5Yrs  29.68817  35.80375
2:  10Yrs  29.25542  39.50981
3:  15Yrs  29.64950  37.41900
4:  20Yrs  29.63454  34.51793
5:  25Yrs  29.84373  35.90916

data.table用于实现此目标的最佳方法是什么?在不必要的变量减少或效率方面的任何改进都值得赞赏。

谢谢!

2 个答案:

答案 0 :(得分:1)

这是您可以试用的另一种data.table方法。构建first.monthsyrs向量后,您可以将它们放入单独的data.table中:

m <- data.table(firstmonths = first.months, yrs = yrs, key = "yrs")

然后使用非equi连接来计算结果:

rbindlist(lapply(yrs, function(y) {
  DT[m[y], on = .(Month >= firstmonths), .(mean = mean(Value), 
                                           var = var(Value), 
                                           Period = y)]
}))

#       mean      var Period
#1: 29.68817 35.80375   5Yrs
#2: 29.25542 39.50981  10Yrs
#3: 29.64950 37.41900  15Yrs
#4: 29.63454 34.51793  20Yrs
#5: 29.84373 35.90916  25Yrs

答案 1 :(得分:1)

另一种方法:

rbindlist(lapply(first.months, 
                 function(m) data.table(val_mean = mean(DT[Month >= m]$Value),
                                        val_var = var(DT[Month >= m]$Value)))
          )[, Period := yrs][]

给出:

   val_mean  val_var Period
1: 29.68817 35.80375   5Yrs
2: 29.25542 39.50981  10Yrs
3: 29.64950 37.41900  15Yrs
4: 29.63454 34.51793  20Yrs
5: 29.84373 35.90916  25Yrs

或上述方法的变体setNamesidcol - rbindlist的参数:

rbindlist(setNames(lapply(first.months,
                          function(m) data.table(val_mean = mean(DT$Value[DT$Month >= m]),
                                                 val_var = var(DT$Value[DT$Month >= m]))),
                   yrs),
          idcol = 'Period')

给出:

   Period val_mean  val_var
1:   5Yrs 29.68817 35.80375
2:  10Yrs 29.25542 39.50981
3:  15Yrs 29.64950 37.41900
4:  20Yrs 29.63454 34.51793
5:  25Yrs 29.84373 35.90916