根据data.table中另一列的连续值查找一列的总和

时间:2016-02-18 21:21:17

标签: r sum data.table conditional-statements

我有一个data.table,如下所示:

    dput(DT)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), Job = structure(c(6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L), .Label = c("f1", "f2", "f3", "f4", "f5", "h1", "h2", "h3"), class = "factor"), Duration = c(2L, 3L, 4L, 4L, 3L, 2L, 1L, 0L, 2L, 3L, 4L, 5L, 4L, 0L), Outsourced = structure(c(1L,2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("N","Y"), class = "factor")), .Names = c("ID", "Job", "Duration", "Outsourced"), row.names = c(NA, -14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x103003178>)

给出了

         ID      Job     Duration Outsourced
 1:       1       h1        2          N
 2:       1       h2        3          Y
 3:       1       h3        4          Y
 4:       1       f1        4          Y
 5:       1       f2        3          N
 6:       1       f3        2          N
 7:       1       f4        1          N
 8:       1       f5        0          N
 9:       2       h1        2          N
10:       2       h2        3          Y
11:       2       f1        4          Y
12:       2       f2        5          N
13:       2       f3        4          N
14:       2       f4        0          N

对于外包列中连续“Y”的所有作业,我希望总和为Duration。此外,如果活动属于不同的ID,则不应将它们视为连续的。一个ID可能有多个连续的“Y”作业。

所以对于这个例子,正确的答案就像是

        ID V1
1:       1 11
2:       2  7

目前,我使用rle在外包列中找到“Y”的运行长度,然后我尝试使用ifs来完成剩下的工作,但我认为这可以更优雅地完成...... 谢谢

1 个答案:

答案 0 :(得分:1)

按照上面的@docendo discimus建议,我设法通过添加一个&#34; unique&#34;来获得我想要的东西。语句:

DT[, NewCol := sum(Duration), by = list(ID, rleid(Outsourced))][Outsourced == "N", NewCol := NA]
DT[!is.na(NewCol), unique(NewCol), ID]

编辑:要涵盖包含许多具有相同持续时间的外包活动的案例,第二个陈述应更改为:

DT[!is.na(NewCol), sum(rle(NewCol)$values), ID]