根据其他时间序列列的引导或滞后创建多个列

时间:2018-04-17 04:00:22

标签: r

遇到问题,需要你的帮助。我有一个数据,可以作为一个面板对待,但有点不同,因为可能有多个时间序列'对于每个ID,请参阅下面的示例

set.seed(100)
## create data
mydf<-data.frame(ID = c(rep('A',7),rep('B',3)),
                 year =c(c(2001:2003),c(2006:2009),c(2001:2003)),
                 x = rnorm(10),
                 y = rnorm(10))

 mydf
    ID year           x           y
 1:  A 2001 -0.50219235  0.08988614
 2:  A 2002  0.13153117  0.09627446
 3:  A 2003 -0.07891709 -0.20163395
 4:  A 2006  0.88678481  0.73984050
 5:  A 2007  0.11697127  0.12337950
 6:  A 2008  0.31863009 -0.02931671
 7:  A 2009 -0.58179068 -0.38885425
 8:  B 2001  0.71453271  0.51085626
 9:  B 2002 -0.82525943 -0.91381419
10:  B 2003 -0.35986213  2.31029682

出于某些特殊原因,我想保留每个ID的所有时间序列至少连续三次观察,因此可能会导致一个ID的多个时间序列,如您所见有两个时间序列ID == A满足这个条件。我想创建变量的引导和滞后xy

如果每个ID只有一个连续的时间序列,我可以简单地使用:

anscols.Lead1=paste("Lead.1",c('x','y'),sep="_")
mydf[,(anscols.Lead1):=shift(.SD,1,NA,type="lead"),.SDcols=c('x','y'),by=ID]

或者如果我只需要操作一列,我也可以使用:

tp.mydf<-pdata.frame(mydf,c("ID","year"))
tp.mydf$lag1x<-lag(tp.mydf$x)

但是,对于非连续的时间序列和多列,data.table方式不起作用(结果):

mydf
    ID year           x           y    Lead.1_x    Lead.1_y
 1:  A 2001 -0.50219235  0.08988614  0.13153117  0.09627446
 2:  A 2002  0.13153117  0.09627446 -0.07891709 -0.20163395
 3:  A 2003 -0.07891709 -0.20163395  0.88678481  0.73984050
 4:  A 2006  0.88678481  0.73984050  0.11697127  0.12337950
 5:  A 2007  0.11697127  0.12337950  0.31863009 -0.02931671
 6:  A 2008  0.31863009 -0.02931671 -0.58179068 -0.38885425
 7:  A 2009 -0.58179068 -0.38885425          NA          NA
 8:  B 2001  0.71453271  0.51085626 -0.82525943 -0.91381419
 9:  B 2002 -0.82525943 -0.91381419 -0.35986213  2.31029682
10:  B 2003 -0.35986213  2.31029682          NA          NA

我想要的是:

mydf
    ID year           x           y    Lead.1_x    Lead.1_y
 1:  A 2001 -0.50219235  0.08988614  0.13153117  0.09627446
 2:  A 2002  0.13153117  0.09627446 -0.07891709 -0.20163395
 3:  A 2003 -0.07891709 -0.20163395          NA          NA
 4:  A 2006  0.88678481  0.73984050  0.11697127  0.12337950
 5:  A 2007  0.11697127  0.12337950  0.31863009 -0.02931671
 6:  A 2008  0.31863009 -0.02931671 -0.58179068 -0.38885425
 7:  A 2009 -0.58179068 -0.38885425          NA          NA
 8:  B 2001  0.71453271  0.51085626 -0.82525943 -0.91381419
 9:  B 2002 -0.82525943 -0.91381419 -0.35986213  2.31029682
10:  B 2003 -0.35986213  2.31029682          NA          NA

任何人都知道如何解决这个问题?

==================编辑,完全基于Shah的答案,只是为了清楚这些粉丝的检查:

mydf.newgrp<-mydf %>%
  group_by(ID, group = cumsum(c(T, diff(year) != 1))) 
setDT(mydf.newgrp)
anscols.Lead1=paste("Lead.1",c('x','y'),sep="_")
mydf.newgrp[,(anscols.Lead1):=shift(.SD,1,NA,type="lead"),.SDcols=c('x','y'),by=group]
mydf.newgrp

2 个答案:

答案 0 :(得分:4)

使用dplyr,我们可以创建一个新的分组变量(group),其中两个year值之间的差值大于1.然后按ID分组{ {1}}然后计算group值。

lead

如果我们需要选择很多列,我们可以使用library(dplyr) mydf %>% group_by(ID, group = cumsum(c(T, diff(year) != 1))) %>% mutate(Lead_x = lead(x), Lead_y = lead(y)) %>% select(-group) # group ID year x y Lead_x Lead_y # <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> # 1 1 A 2001 -0.502 0.0899 0.132 0.0963 # 2 1 A 2002 0.132 0.0963 - 0.0789 - 0.202 # 3 1 A 2003 -0.0789 -0.202 NA NA # 4 2 A 2006 0.887 0.740 0.117 0.123 # 5 2 A 2007 0.117 0.123 0.319 - 0.0293 # 6 2 A 2008 0.319 -0.0293 - 0.582 - 0.389 # 7 2 A 2009 -0.582 -0.389 NA NA # 8 3 B 2001 0.715 0.511 - 0.825 - 0.914 # 9 3 B 2002 -0.825 -0.914 - 0.360 2.31 #10 3 B 2003 -0.360 2.31 NA NA

mutate_at

分组变量cols <- c("x", "y") mydf %>% group_by(ID, group = cumsum(c(T, diff(year) != 1))) %>% mutate_at(cols, .funs = funs(lead = lead(.))) %>% select(-group) # group1 ID year x y x_lead y_lead # <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> # 1 1 A 2001 -0.502 0.0899 0.132 0.0963 # 2 1 A 2002 0.132 0.0963 - 0.0789 - 0.202 # 3 1 A 2003 -0.0789 -0.202 NA NA # 4 2 A 2006 0.887 0.740 0.117 0.123 # 5 2 A 2007 0.117 0.123 0.319 - 0.0293 # 6 2 A 2008 0.319 -0.0293 - 0.582 - 0.389 # 7 2 A 2009 -0.582 -0.389 NA NA # 8 3 B 2001 0.715 0.511 - 0.825 - 0.914 # 9 3 B 2002 -0.825 -0.914 - 0.360 2.31 #10 3 B 2003 -0.360 2.31 NA NA 的输出结果为

group

答案 1 :(得分:2)

使用data.table,我们可以更改by以包含分组变量

library(data.table)
setDT(mydf)[, paste0("Lead.1_", names(mydf)[3:4]) := 
    shift(.SD, type = 'lead'), by = .(ID, cumsum(year - shift(year, fill = year[1]) != 1))]
mydf
#    ID year           x           y    Lead.1_x    Lead.1_y
# 1:  A 2001 -0.50219235  0.08988614  0.13153117  0.09627446
# 2:  A 2002  0.13153117  0.09627446 -0.07891709 -0.20163395
# 3:  A 2003 -0.07891709 -0.20163395          NA          NA
# 4:  A 2006  0.88678481  0.73984050  0.11697127  0.12337950
# 5:  A 2007  0.11697127  0.12337950  0.31863009 -0.02931671
# 6:  A 2008  0.31863009 -0.02931671 -0.58179068 -0.38885425
# 7:  A 2009 -0.58179068 -0.38885425          NA          NA
# 8:  B 2001  0.71453271  0.51085626 -0.82525943 -0.91381419
# 9:  B 2002 -0.82525943 -0.91381419 -0.35986213  2.31029682
#10:  B 2003 -0.35986213  2.31029682          NA          NA

如果其他列不需要shift,我们可以指定.SDcols

nm1 <- names(mydf)[3:4]
setDT(mydf)[, paste0("Lead.1_", nm1) := 
    shift(.SD, type = 'lead'), 
   by = .(ID, cumsum(year - shift(year, fill = year[1]) != 1)), .SDcols = nm1]