R-滞后变量按组计算

时间:2019-04-15 14:04:59

标签: r dplyr data.table lag

使用以下数据集:

set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 6)
dest <- rep(c("GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR", "DEU"), 6)
year <- rep(c(rep(1998, 10), rep(1999, 10), rep(2000, 10)), 2)
type <- rep(c(1,2,3,4,5), 12)
# type <- sample(1:10, size=length(origin), replace=TRUE)
a <- sample(100:10000, size=length(origin), replace=TRUE)
b <- sample(1000:100000, size=length(origin), replace=TRUE)
data.df <- as.data.frame(cbind(origin, dest, year, type, a,b))
rm(origin, year, dest, type, a,b)

例如,我想计算以下操作:

  • [a t + 1 ijk -a t ijk ] * b t ik

i是type,j origin和k dest。我决定先用lag.a计算a dplyr的滞后时间:

data.df <- data.df %>%
            group_by(origin, dest, type) %>%
            mutate(lag.a = lag(a, n = 1, default = NA))

即使我不太了解R如何能独自理解,我仍然认为这种方法是正确的……

顺便说一句,这样做我获得了对应于第一部分的结果(a t + 1 ijk -a t ijk )。我的问题是我现在不知道该怎么做(lag.a t + 1 ijk * b t ik )...有什么想法吗?

如果可能的话,我想要一个解决方案(dplyrdata.table),并且将lag变量不突变到数据集中,以使其不会过度压缩。

1 个答案:

答案 0 :(得分:1)

您的代码中有几个问题。首先,像这样创建您的data.frame

data.df <- data.frame(origin, dest, year, type, a, b)

这将保留所有向量的类。请注意,如果您不希望origindest成为因素,只需在stringsAsFactors = FALSE函数中使用参数data.frame()

接下来,按如下所示创建新变量:

data.df2 <- data.df %>%
  group_by(origin, dest, type) %>%
    arrange(year) %>% 
    mutate(new_var = (a - lag(a)) * b) %>%
  ungroup()

在这里,new_var是您想要的变量。您是对的,dplyr不知道滞后值来自上一个时间段。因此,您必须使用arrange(year)