遇到问题,需要你的帮助。我有一个数据,可以作为一个面板对待,但有点不同,因为可能有多个时间序列'对于每个ID
,请参阅下面的示例
set.seed(100)
## create data
mydf<-data.frame(ID = c(rep('A',7),rep('B',3)),
year =c(c(2001:2003),c(2006:2009),c(2001:2003)),
x = rnorm(10),
y = rnorm(10))
mydf
ID year x y
1: A 2001 -0.50219235 0.08988614
2: A 2002 0.13153117 0.09627446
3: A 2003 -0.07891709 -0.20163395
4: A 2006 0.88678481 0.73984050
5: A 2007 0.11697127 0.12337950
6: A 2008 0.31863009 -0.02931671
7: A 2009 -0.58179068 -0.38885425
8: B 2001 0.71453271 0.51085626
9: B 2002 -0.82525943 -0.91381419
10: B 2003 -0.35986213 2.31029682
出于某些特殊原因,我想保留每个ID
的所有时间序列至少连续三次观察,因此可能会导致一个ID
的多个时间序列,如您所见有两个时间序列ID == A
满足这个条件。我想创建变量的引导和滞后x
,y
。
如果每个ID
只有一个连续的时间序列,我可以简单地使用:
anscols.Lead1=paste("Lead.1",c('x','y'),sep="_")
mydf[,(anscols.Lead1):=shift(.SD,1,NA,type="lead"),.SDcols=c('x','y'),by=ID]
或者如果我只需要操作一列,我也可以使用:
tp.mydf<-pdata.frame(mydf,c("ID","year"))
tp.mydf$lag1x<-lag(tp.mydf$x)
但是,对于非连续的时间序列和多列,data.table方式不起作用(结果):
mydf
ID year x y Lead.1_x Lead.1_y
1: A 2001 -0.50219235 0.08988614 0.13153117 0.09627446
2: A 2002 0.13153117 0.09627446 -0.07891709 -0.20163395
3: A 2003 -0.07891709 -0.20163395 0.88678481 0.73984050
4: A 2006 0.88678481 0.73984050 0.11697127 0.12337950
5: A 2007 0.11697127 0.12337950 0.31863009 -0.02931671
6: A 2008 0.31863009 -0.02931671 -0.58179068 -0.38885425
7: A 2009 -0.58179068 -0.38885425 NA NA
8: B 2001 0.71453271 0.51085626 -0.82525943 -0.91381419
9: B 2002 -0.82525943 -0.91381419 -0.35986213 2.31029682
10: B 2003 -0.35986213 2.31029682 NA NA
我想要的是:
mydf
ID year x y Lead.1_x Lead.1_y
1: A 2001 -0.50219235 0.08988614 0.13153117 0.09627446
2: A 2002 0.13153117 0.09627446 -0.07891709 -0.20163395
3: A 2003 -0.07891709 -0.20163395 NA NA
4: A 2006 0.88678481 0.73984050 0.11697127 0.12337950
5: A 2007 0.11697127 0.12337950 0.31863009 -0.02931671
6: A 2008 0.31863009 -0.02931671 -0.58179068 -0.38885425
7: A 2009 -0.58179068 -0.38885425 NA NA
8: B 2001 0.71453271 0.51085626 -0.82525943 -0.91381419
9: B 2002 -0.82525943 -0.91381419 -0.35986213 2.31029682
10: B 2003 -0.35986213 2.31029682 NA NA
任何人都知道如何解决这个问题?
==================编辑,完全基于Shah的答案,只是为了清楚这些粉丝的检查:
mydf.newgrp<-mydf %>%
group_by(ID, group = cumsum(c(T, diff(year) != 1)))
setDT(mydf.newgrp)
anscols.Lead1=paste("Lead.1",c('x','y'),sep="_")
mydf.newgrp[,(anscols.Lead1):=shift(.SD,1,NA,type="lead"),.SDcols=c('x','y'),by=group]
mydf.newgrp
答案 0 :(得分:4)
使用dplyr
,我们可以创建一个新的分组变量(group
),其中两个year
值之间的差值大于1.然后按ID
分组{ {1}}然后计算group
值。
lead
如果我们需要选择很多列,我们可以使用library(dplyr)
mydf %>%
group_by(ID, group = cumsum(c(T, diff(year) != 1))) %>%
mutate(Lead_x = lead(x), Lead_y = lead(y)) %>%
select(-group)
# group ID year x y Lead_x Lead_y
# <int> <fct> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 A 2001 -0.502 0.0899 0.132 0.0963
# 2 1 A 2002 0.132 0.0963 - 0.0789 - 0.202
# 3 1 A 2003 -0.0789 -0.202 NA NA
# 4 2 A 2006 0.887 0.740 0.117 0.123
# 5 2 A 2007 0.117 0.123 0.319 - 0.0293
# 6 2 A 2008 0.319 -0.0293 - 0.582 - 0.389
# 7 2 A 2009 -0.582 -0.389 NA NA
# 8 3 B 2001 0.715 0.511 - 0.825 - 0.914
# 9 3 B 2002 -0.825 -0.914 - 0.360 2.31
#10 3 B 2003 -0.360 2.31 NA NA
mutate_at
分组变量cols <- c("x", "y")
mydf %>%
group_by(ID, group = cumsum(c(T, diff(year) != 1))) %>%
mutate_at(cols, .funs = funs(lead = lead(.))) %>%
select(-group)
# group1 ID year x y x_lead y_lead
# <int> <fct> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 A 2001 -0.502 0.0899 0.132 0.0963
# 2 1 A 2002 0.132 0.0963 - 0.0789 - 0.202
# 3 1 A 2003 -0.0789 -0.202 NA NA
# 4 2 A 2006 0.887 0.740 0.117 0.123
# 5 2 A 2007 0.117 0.123 0.319 - 0.0293
# 6 2 A 2008 0.319 -0.0293 - 0.582 - 0.389
# 7 2 A 2009 -0.582 -0.389 NA NA
# 8 3 B 2001 0.715 0.511 - 0.825 - 0.914
# 9 3 B 2002 -0.825 -0.914 - 0.360 2.31
#10 3 B 2003 -0.360 2.31 NA NA
的输出结果为
group
答案 1 :(得分:2)
使用data.table
,我们可以更改by
以包含分组变量
library(data.table)
setDT(mydf)[, paste0("Lead.1_", names(mydf)[3:4]) :=
shift(.SD, type = 'lead'), by = .(ID, cumsum(year - shift(year, fill = year[1]) != 1))]
mydf
# ID year x y Lead.1_x Lead.1_y
# 1: A 2001 -0.50219235 0.08988614 0.13153117 0.09627446
# 2: A 2002 0.13153117 0.09627446 -0.07891709 -0.20163395
# 3: A 2003 -0.07891709 -0.20163395 NA NA
# 4: A 2006 0.88678481 0.73984050 0.11697127 0.12337950
# 5: A 2007 0.11697127 0.12337950 0.31863009 -0.02931671
# 6: A 2008 0.31863009 -0.02931671 -0.58179068 -0.38885425
# 7: A 2009 -0.58179068 -0.38885425 NA NA
# 8: B 2001 0.71453271 0.51085626 -0.82525943 -0.91381419
# 9: B 2002 -0.82525943 -0.91381419 -0.35986213 2.31029682
#10: B 2003 -0.35986213 2.31029682 NA NA
如果其他列不需要shift
,我们可以指定.SDcols
nm1 <- names(mydf)[3:4]
setDT(mydf)[, paste0("Lead.1_", nm1) :=
shift(.SD, type = 'lead'),
by = .(ID, cumsum(year - shift(year, fill = year[1]) != 1)), .SDcols = nm1]