在data.table

时间:2019-09-12 17:35:11

标签: r data.table

我有一个增长率数据集,想要创建一个链指数,将基准年的值设置为100。我面临的问题是该过程是迭代的。对于t中的值,我需要将t-1中的链索引乘以t中的(1 + growth_rate),以某种方式我无法对data.table中的每个组执行此操作。

考虑此示例数据集。

library(data.table)

x1 <- c(NA, runif(9))
x2 <- c(NA, runif(9))


DT <- data.table(
  time = rep(1:10, 2),
  growth_rate = c(x1, x2),
  idx = c(rep("group1",10),
          rep("group2",10))
)
    time growth_rate    idx
 1:    1          NA group1
 2:    2  0.82593921 group1
 3:    3  0.48084893 group1
 4:    4  0.65483959 group1
 5:    5  0.87944148 group1
 6:    6  0.78886104 group1
 7:    7  0.87714854 group1
 8:    8  0.87268452 group1
 9:    9  0.93289483 group1
10:   10  0.05558125 group1
11:    1          NA group2
12:    2  0.36341183 group2
13:    3  0.21488630 group2
14:    4  0.17622914 group2
15:    5  0.50420764 group2
16:    6  0.08646833 group2
17:    7  0.28408027 group2
18:    8  0.20252834 group2
19:    9  0.16940959 group2
20:   10  0.60843486 group2

我尝试了

first_value = DT[, .(first_value = .I[c(1L)]), by="idx"]$first_value

DT[first_value,ChainIndex := 100]

DT[,ChainIndex := shift(ChainIndex, type="lag", n=1)*(1+growth_rate), by=idx]

并使用循环(由于我的数据集包含许多组和行,因此我想避免这种情况)

for (row in 1:nrow(DT))
{ 

  if (row %in% first_value)
  {DT[row, ChainIndex := as.numeric(100)]}

  else 
  {DT[row, ChainIndex := shift(ChainIndex, type = "lag", n=1)*(1+growth_rate), by=idx]}

}

但是,两个过程都没有为每一行执行此索引链接。最后,每个组在第一年的ChainIndex应该为100,在所有其他年份的ChainIndex(t-1)*(1 + growth_rate)。有人可以帮我吗?

2 个答案:

答案 0 :(得分:2)

不确定,因为您期望的输出丢失了,但这可能有用...

#set the growth-rate of the first row of each group to 0
DT[ is.na(growth_rate), growth_rate := 0 ]
#calculate the cumulative product (= growth_rate + 1 )
DT[, chain := 100 * cumprod( growth_rate + 1 ), by = .(idx) ]
#reset the first rows back to NA
DT[ growth_rate == 0, growth_rate := NA_real_ ][]

    time growth_rate    idx      chain
 1:    1          NA group1   100.0000
 2:    2  0.82593921 group1   182.5939
 3:    3  0.48084893 group1   270.3940
 4:    4  0.65483959 group1   447.4587
 5:    5  0.87944148 group1   840.9725
 6:    6  0.78886104 group1  1504.3829
 7:    7  0.87714854 group1  2823.9502
 8:    8  0.87268452 group1  5288.3677
 9:    9  0.93289483 group1 10221.8586
10:   10  0.05558125 group1 10790.0023
11:    1          NA group2   100.0000
12:    2  0.36341183 group2   136.3412
13:    3  0.21488630 group2   165.6390
14:    4  0.17622914 group2   194.8295
15:    5  0.50420764 group2   293.0640
16:    6  0.08646833 group2   318.4047
17:    7  0.28408027 group2   408.8572
18:    8  0.20252834 group2   491.6624
19:    9  0.16940959 group2   574.9547
20:   10  0.60843486 group2   924.7772

data.table组中每个第一个growth_rate的值设置为idx的另一种(更多0方法)是:

DT[ DT[, .(.I[1L]), by=idx]$V1, growth_rate := 0][]

答案 1 :(得分:1)

cumprod函数很容易解决这个问题。

library(data.table)

x1 <- c(NA, runif(9))
x2 <- c(NA, runif(9))

DT <- data.table(
  time = rep(1:10, 2),
  growth_rate = c(x1, x2),
  idx = c(rep("group1",10),
          rep("group2",10))
)

DT[
  i  = order(time),
  j  = `:=`(
    value = 100 * cumprod(1 + ifelse(is.na(growth_rate), 0, growth_rate))),
  by = idx]

print(DT)

    time growth_rate    idx     value
 1:    1          NA group1  100.0000
 2:    2  0.95908608 group1  195.9086
 3:    3  0.25566986 group1  245.9965
 4:    4  0.55565852 group1  382.6866
 5:    5  0.15934976 group1  443.6676
 6:    6  0.73005207 group1  767.5681
 7:    7  0.38046874 group1 1059.6037
 8:    8  0.11186212 group1 1178.1333
 9:    9  0.24389118 group1 1465.4696
10:   10  0.05880406 group1 1551.6452
11:    1          NA group2  100.0000
12:    2  0.39967710 group2  139.9677
13:    3  0.25459351 group2  175.6026
14:    4  0.07636151 group2  189.0119
15:    5  0.65243776 group2  312.3303
16:    6  0.37214618 group2  428.5629
17:    7  0.93790246 group2  830.5131
18:    8  0.57050829 group2 1304.3276
19:    9  0.06343531 group2 1387.0681
20:   10  0.20862719 group2 1676.4482

如果您要在time中进行插补,则会变得更加复杂。