在R中:基于条件的第一行除外的重复行

时间:2015-02-27 14:45:16

标签: r data.table

我有一个data.table dt:

names <- c("john","mary","mary","mary","mary","mary","mary","tom","tom","tom","mary","john","john","john","tom","tom")
dates <- c(as.Date("2010-06-01"),as.Date("2010-06-01"),as.Date("2010-06-05"),as.Date("2010-06-09"),as.Date("2010-06-13"),as.Date("2010-06-17"),as.Date("2010-06-21"),as.Date("2010-07-09"),as.Date("2010-07-13"),as.Date("2010-07-17"),as.Date("2010-06-01"),as.Date("2010-08-01"),as.Date("2010-08-05"),as.Date("2010-08-09"),as.Date("2010-09-03"),as.Date("2010-09-04"))
shifts_missed <- c(2,11,11,11,11,11,11,6,6,6,1,5,5,5,0,2)
shift <- c("Day","Night","Night","Night","Night","Night","Night","Day","Day","Day","Day","Night","Night","Night","Night","Day")

df <- data.frame(names=names, dates=dates, shifts_missed=shifts_missed, shift=shift)
dt <- as.data.table(df)

names   dates       shifts_missed   shift
john    2010-06-01  2               Day
mary    2010-06-01  11              Night
mary    2010-06-05  11              Night
mary    2010-06-09  11              Night
mary    2010-06-13  11              Night
mary    2010-06-17  11              Night
mary    2010-06-21  11              Night
tom     2010-07-09  6               Day
tom     2010-07-13  6               Day
tom     2010-07-17  6               Day
mary    2010-06-01  1               Day
john    2010-08-01  5               Night
john    2010-08-05  5               Night
john    2010-08-09  5               Night
tom     2010-09-03  0               Night
tom     2010-09-04  2               Day

最终,我想要的是获得以下内容:

names   dates       shifts_missed   shift    count
john    2010-06-01  2               Day      1
mary    2010-06-01  11              Night    1
mary    2010-06-05  11              Night    1
mary    2010-06-09  11              Night    1
mary    2010-06-13  11              Night    1
mary    2010-06-17  11              Night    1
mary    2010-06-21  11              Night    1
tom     2010-07-09  6               Day      1
tom     2010-07-13  6               Day      1
tom     2010-07-17  6               Day      1
mary    2010-06-01  1               Day      1
john    2010-08-01  5               Night    1
john    2010-08-05  5               Night    1
john    2010-08-09  5               Night    1
tom     2010-09-03  0               Night    0
tom     2010-09-04  2               Day      1
john    2010-06-01  2               Night    1
mary    2010-06-05  11              Day      1
mary    2010-06-09  11              Day      1
mary    2010-06-13  11              Day      1
mary    2010-06-17  11              Day      1
mary    2010-06-21  11              Day      1
tom     2010-07-09  6               Night    1
tom     2010-07-13  6               Night    1
tom     2010-07-17  6               Night    1
john    2010-08-05  5               Day      1
john    2010-08-09  5               Day      1
tom     2010-09-04  2               Night    1

如您所见,数据的后半部分几乎与上半部分重复。但是,如果shift_missed = 0,则不应该重复,如果shifting_missed是奇数,则第一行不应重复,但其余行应该重复。然后它应该在count列中为所有添加1,除非在shift_missed = 0时。

我已经看到了一些可以解释的答案!重复或唯一,但shift_missed中的这些值并不是唯一的。我确定这不是过于复杂,可能是一个多步骤的过程,但我无法弄清楚如何隔离奇数shift_missed列的第一行。

1 个答案:

答案 0 :(得分:1)

dt[, is.in := if(shifts_missed[1] %% 2 == 0) T else c(F, rep(T, .N-1))
   , by = .(names, shift)]
rbind(dt, dt[is.in & shifts_missed != 0]) 

添加额外的列部分应该是显而易见的。