我有一个data.table,如下所示:
dput(DT)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), Job = structure(c(6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L), .Label = c("f1", "f2", "f3", "f4", "f5", "h1", "h2", "h3"), class = "factor"), Duration = c(2L, 3L, 4L, 4L, 3L, 2L, 1L, 0L, 2L, 3L, 4L, 5L, 4L, 0L), Outsourced = structure(c(1L,2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("N","Y"), class = "factor")), .Names = c("ID", "Job", "Duration", "Outsourced"), row.names = c(NA, -14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x103003178>)
给出了
ID Job Duration Outsourced
1: 1 h1 2 N
2: 1 h2 3 Y
3: 1 h3 4 Y
4: 1 f1 4 Y
5: 1 f2 3 N
6: 1 f3 2 N
7: 1 f4 1 N
8: 1 f5 0 N
9: 2 h1 2 N
10: 2 h2 3 Y
11: 2 f1 4 Y
12: 2 f2 5 N
13: 2 f3 4 N
14: 2 f4 0 N
对于外包列中连续“Y”的所有作业,我希望总和为Duration
。此外,如果活动属于不同的ID
,则不应将它们视为连续的。一个ID
可能有多个连续的“Y”作业。
所以对于这个例子,正确的答案就像是
ID V1
1: 1 11
2: 2 7
目前,我使用rle
在外包列中找到“Y”的运行长度,然后我尝试使用ifs来完成剩下的工作,但我认为这可以更优雅地完成......
谢谢
答案 0 :(得分:1)
按照上面的@docendo discimus建议,我设法通过添加一个&#34; unique&#34;来获得我想要的东西。语句:
DT[, NewCol := sum(Duration), by = list(ID, rleid(Outsourced))][Outsourced == "N", NewCol := NA]
DT[!is.na(NewCol), unique(NewCol), ID]
编辑:要涵盖包含许多具有相同持续时间的外包活动的案例,第二个陈述应更改为:
DT[!is.na(NewCol), sum(rle(NewCol)$values), ID]