如何按组更新特定行的常量值?

时间:2017-01-04 10:00:43

标签: r data.table

让我们说,我们有以下

library(data.table); library(zoo)
dt <- data.table(grp = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3), period = seq.Date(from = as.Date('2014-01-01'), to = as.Date('2014-05-01'), by = 'month'), x=c(1:15), y=c(11:25))
dt[, period:=as.yearmon(period, '%Y-%m-%d')]

返回,

    grp   period  x  y
 1:   1 Jan 2014  1 11
 2:   1 Feb 2014  2 12
 3:   1 Mar 2014  3 13
 4:   1 Apr 2014  4 14
 5:   1 May 2014  5 15
 6:   2 Jan 2014  6 16
 7:   2 Feb 2014  7 17
 8:   2 Mar 2014  8 18
 9:   2 Apr 2014  9 19
10:   2 May 2014 10 20
11:   3 Jan 2014 11 21
12:   3 Feb 2014 12 22
13:   3 Mar 2014 13 23
14:   3 Apr 2014 14 24
15:   3 May 2014 15 25

我想使用与x相关的值更新列yMarch 2014。我期望的回报如下:

    grp   period  x  y
 1:   1 Jan 2014  1 11
 2:   1 Feb 2014  2 12
 3:   1 Mar 2014  3 13
 4:   1 Apr 2014  3 13
 5:   1 May 2014  3 13
 6:   2 Jan 2014  6 16
 7:   2 Feb 2014  7 17
 8:   2 Mar 2014  8 18
 9:   2 Apr 2014  8 18
10:   2 May 2014  8 18
11:   3 Jan 2014 11 21
12:   3 Feb 2014 12 22
13:   3 Mar 2014 13 23
14:   3 Apr 2014 13 23
15:   3 May 2014 13 23

我尝试过以下代码,但它只使用row 3中的值。

dt[which(period > dt[3, period]),`:=`(x=dt[3, x], y = dt[3, y]), by=grp]

你能提出建议吗?

3 个答案:

答案 0 :(得分:4)

您可以使用x替换y之后的所有March 2014NA值,然后使用na.locf()

dt[period > "March 2014",`:=`(x=NA,y=NA)][,`:=`(x=na.locf(x), y=na.locf(y))]
#    grp   period  x  y
# 1:   1 Jan 2014  1 11
# 2:   1 Feb 2014  2 12
# 3:   1 Mar 2014  3 13
# 4:   1 Apr 2014  3 13
# 5:   1 May 2014  3 13
# 6:   2 Jan 2014  6 16
# 7:   2 Feb 2014  7 17
# 8:   2 Mar 2014  8 18
# 9:   2 Apr 2014  8 18
#10:   2 May 2014  8 18
#11:   3 Jan 2014 11 21
#12:   3 Feb 2014 12 22
#13:   3 Mar 2014 13 23
#14:   3 Apr 2014 13 23
#15:   3 May 2014 13 23

答案 1 :(得分:3)

再看一遍,我认为这是一种非常干净的方式(假设排序):

cols = c("x", "y")
dt[period >= "Mar 2014", (cols) := .SD[1L], by=grp, .SDcols = cols]

另一种方法是使用滚动连接:

dt[period >= "Mar 2014", c("x", "y") := 
  .SD[period == "Mar 2014"][.SD, on=.(grp, period), roll=TRUE, .(x.x, x.y)]
]

第二个选项如何运作

主要文档中涵盖了以下所有内容,可通过键入?data.table来访问。

DT[i, (cols) := e]会覆盖cols所选行中的i

仔细查看e,我们会看到.SD,它只适用于DT[i, ...]。我们可以从DT[i, ...]中取出DT[i]代替.SD。从那里,我们可以简化e以查看其工作原理:

> mySD = DT[period >= "Mar 2014"]
> mySD
   grp   period  x  y
1:   1 Mar 2014  3 13
2:   1 Apr 2014  4 14
3:   1 May 2014  5 15
4:   2 Mar 2014  8 18
5:   2 Apr 2014  9 19
6:   2 May 2014 10 20
7:   3 Mar 2014 13 23
8:   3 Apr 2014 14 24
9:   3 May 2014 15 25
> mySD[period == "Mar 2014"]
   grp   period  x  y
1:   1 Mar 2014  3 13
2:   2 Mar 2014  8 18
3:   3 Mar 2014 13 23
> mySD[period == "Mar 2014"][mySD, on=.(grp, period)]
   grp   period  x  y i.x i.y
1:   1 Mar 2014  3 13   3  13
2:   1 Apr 2014 NA NA   4  14
3:   1 May 2014 NA NA   5  15
4:   2 Mar 2014  8 18   8  18
5:   2 Apr 2014 NA NA   9  19
6:   2 May 2014 NA NA  10  20
7:   3 Mar 2014 13 23  13  23
8:   3 Apr 2014 NA NA  14  24
9:   3 May 2014 NA NA  15  25
> mySD[period == "Mar 2014"][mySD, on=.(grp, period), roll=TRUE]
   grp   period  x  y i.x i.y
1:   1 Mar 2014  3 13   3  13
2:   1 Apr 2014  3 13   4  14
3:   1 May 2014  3 13   5  15
4:   2 Mar 2014  8 18   8  18
5:   2 Apr 2014  8 18   9  19
6:   2 May 2014  8 18  10  20
7:   3 Mar 2014 13 23  13  23
8:   3 Apr 2014 13 23  14  24
9:   3 May 2014 13 23  15  25
> mySD[period == "Mar 2014"][mySD, on=.(grp, period), roll=TRUE, .(x.x, x.y)]
   x.x x.y
1:   3  13
2:   3  13
3:   3  13
4:   8  18
5:   8  18
6:   8  18
7:  13  23
8:  13  23
9:  13  23

答案 2 :(得分:1)

dplyr的一个选项。过滤period的数据大于等于Mar 2014,并将期间x的{​​{1}}和y值分配给按Mar 2014分组的所有行

grp