让我们说,我们有以下
library(data.table); library(zoo)
dt <- data.table(grp = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3), period = seq.Date(from = as.Date('2014-01-01'), to = as.Date('2014-05-01'), by = 'month'), x=c(1:15), y=c(11:25))
dt[, period:=as.yearmon(period, '%Y-%m-%d')]
返回,
grp period x y
1: 1 Jan 2014 1 11
2: 1 Feb 2014 2 12
3: 1 Mar 2014 3 13
4: 1 Apr 2014 4 14
5: 1 May 2014 5 15
6: 2 Jan 2014 6 16
7: 2 Feb 2014 7 17
8: 2 Mar 2014 8 18
9: 2 Apr 2014 9 19
10: 2 May 2014 10 20
11: 3 Jan 2014 11 21
12: 3 Feb 2014 12 22
13: 3 Mar 2014 13 23
14: 3 Apr 2014 14 24
15: 3 May 2014 15 25
我想使用与x
相关的值更新列y
和March 2014
。我期望的回报如下:
grp period x y
1: 1 Jan 2014 1 11
2: 1 Feb 2014 2 12
3: 1 Mar 2014 3 13
4: 1 Apr 2014 3 13
5: 1 May 2014 3 13
6: 2 Jan 2014 6 16
7: 2 Feb 2014 7 17
8: 2 Mar 2014 8 18
9: 2 Apr 2014 8 18
10: 2 May 2014 8 18
11: 3 Jan 2014 11 21
12: 3 Feb 2014 12 22
13: 3 Mar 2014 13 23
14: 3 Apr 2014 13 23
15: 3 May 2014 13 23
我尝试过以下代码,但它只使用row 3
中的值。
dt[which(period > dt[3, period]),`:=`(x=dt[3, x], y = dt[3, y]), by=grp]
你能提出建议吗?
答案 0 :(得分:4)
您可以使用x
替换y
之后的所有March 2014
和NA
值,然后使用na.locf()
:
dt[period > "March 2014",`:=`(x=NA,y=NA)][,`:=`(x=na.locf(x), y=na.locf(y))]
# grp period x y
# 1: 1 Jan 2014 1 11
# 2: 1 Feb 2014 2 12
# 3: 1 Mar 2014 3 13
# 4: 1 Apr 2014 3 13
# 5: 1 May 2014 3 13
# 6: 2 Jan 2014 6 16
# 7: 2 Feb 2014 7 17
# 8: 2 Mar 2014 8 18
# 9: 2 Apr 2014 8 18
#10: 2 May 2014 8 18
#11: 3 Jan 2014 11 21
#12: 3 Feb 2014 12 22
#13: 3 Mar 2014 13 23
#14: 3 Apr 2014 13 23
#15: 3 May 2014 13 23
答案 1 :(得分:3)
再看一遍,我认为这是一种非常干净的方式(假设排序):
cols = c("x", "y")
dt[period >= "Mar 2014", (cols) := .SD[1L], by=grp, .SDcols = cols]
另一种方法是使用滚动连接:
dt[period >= "Mar 2014", c("x", "y") :=
.SD[period == "Mar 2014"][.SD, on=.(grp, period), roll=TRUE, .(x.x, x.y)]
]
第二个选项如何运作
主要文档中涵盖了以下所有内容,可通过键入?data.table
来访问。
DT[i, (cols) := e]
会覆盖cols
所选行中的i
。
仔细查看e
,我们会看到.SD
,它只适用于DT[i, ...]
。我们可以从DT[i, ...]
中取出DT[i]
代替.SD
。从那里,我们可以简化e
以查看其工作原理:
> mySD = DT[period >= "Mar 2014"]
> mySD
grp period x y
1: 1 Mar 2014 3 13
2: 1 Apr 2014 4 14
3: 1 May 2014 5 15
4: 2 Mar 2014 8 18
5: 2 Apr 2014 9 19
6: 2 May 2014 10 20
7: 3 Mar 2014 13 23
8: 3 Apr 2014 14 24
9: 3 May 2014 15 25
> mySD[period == "Mar 2014"]
grp period x y
1: 1 Mar 2014 3 13
2: 2 Mar 2014 8 18
3: 3 Mar 2014 13 23
> mySD[period == "Mar 2014"][mySD, on=.(grp, period)]
grp period x y i.x i.y
1: 1 Mar 2014 3 13 3 13
2: 1 Apr 2014 NA NA 4 14
3: 1 May 2014 NA NA 5 15
4: 2 Mar 2014 8 18 8 18
5: 2 Apr 2014 NA NA 9 19
6: 2 May 2014 NA NA 10 20
7: 3 Mar 2014 13 23 13 23
8: 3 Apr 2014 NA NA 14 24
9: 3 May 2014 NA NA 15 25
> mySD[period == "Mar 2014"][mySD, on=.(grp, period), roll=TRUE]
grp period x y i.x i.y
1: 1 Mar 2014 3 13 3 13
2: 1 Apr 2014 3 13 4 14
3: 1 May 2014 3 13 5 15
4: 2 Mar 2014 8 18 8 18
5: 2 Apr 2014 8 18 9 19
6: 2 May 2014 8 18 10 20
7: 3 Mar 2014 13 23 13 23
8: 3 Apr 2014 13 23 14 24
9: 3 May 2014 13 23 15 25
> mySD[period == "Mar 2014"][mySD, on=.(grp, period), roll=TRUE, .(x.x, x.y)]
x.x x.y
1: 3 13
2: 3 13
3: 3 13
4: 8 18
5: 8 18
6: 8 18
7: 13 23
8: 13 23
9: 13 23
答案 2 :(得分:1)
dplyr
的一个选项。过滤period
的数据大于等于Mar 2014
,并将期间x
的{{1}}和y
值分配给按Mar 2014
分组的所有行
grp