为不同的组创建多列滞后变量

时间:2018-01-25 15:52:10

标签: r dplyr time-series purrr

我试图将滞后变量列添加到我的数据框中。我遇到了麻烦,因为我有几个小组(我的例子中的国家/地区),我想这样做。

library(tidyverse)

df <- tribble(
  ~year, ~country, ~variable, 
  #--|--|----
  1997, "USA", 28,
  1998, "USA", 40,
  1999, "USA", 30,
  2000, "USA", 39,
  2001, "USA", 55,
  2002, "USA", 53,
  2003, "USA", 64,
  2004, "USA", 40,
  2005, "USA", 30,
  2006, "USA", 39,
  2007, "USA", 55,
  2008, "USA", 53,
  2009, "USA", 71,
  2010, "USA", 44,
  2011, "USA", 40,
  2012, "USA", 17,
  2013, "USA", 39,
  2014, "USA", 55,
  2015, "USA", 53,
  1997, "France", 13,
  1998, "France", 42,
  1999, "France", 37,
  2000, "France", 11,
  2001, "France", 55,
  2002, "France", 53,
  2003, "France", 31,
  2004, "France", 10,
  2005, "France", 30,
  2006, "France", 37,
  2007, "France", 54,
  2008, "France", 58,
  2009, "France", 50,
  2010, "France", 40,
  2011, "France", 49,
  2012, "France", 14,
  2013, "France", 34,
  2014, "France", 53,
  2015, "France", 50
)
nlags <- 1:10
df_lags <- map(.x = nlags,
               .f = ~ lag(df$variable, .x)) %>% 
  as.data.frame
names(df_lags) <- paste0("lag_", nlags)

df <- df %>% 
  bind_cols(df_lags)

这大概是正确的,但是滞后它也会跨群体延迟!所以,之后,第20行看起来像这样:

---------------------------------
| 1997 | France | 13 | 53 | ... | 
---------------------------------

但是53取自USA组,而它应该只是NA

我试过这个:

df %>% 
  group_by(country) %>% 
  map(.x = nlags,
      .f = ~ lag(variable, .x))

但这不起作用:

Error in lag(variable, .x) : object 'variable' not found

有什么想法吗?

2 个答案:

答案 0 :(得分:3)

使用data.table

可以更轻松
library(data.table)
setDT(df)[, paste0("lag_", nlags) := shift(variable, nlags), country]

答案 1 :(得分:2)

这可能很有用。我们可以按country拆分数据框,对每个country执行相同的操作,然后合并结果。 df2是最终输出。

library(tidyverse)

nlags <- 1:10

df2 <- df %>%
  split(.$country) %>%
  map_dfr(function(df){
    df_lags <- map(nlags, ~lag(df$variable, .x)) %>%
      as.data.frame() %>%
      setNames(paste0("lag_", nlags))
    df <- bind_cols(df, df_lags)
  })