如何在data.table中使用group by时返回缺少组合的NA

时间:2017-02-12 07:56:51

标签: r data.table

我有像这样的data.table

if let textColor = prefs.array(forKey: "textColor") {
        lblFirst.textColor = UIColor(red: textColor[0] as! CGFloat, green: textColor[1] as! CGFloat, blue: textColor[2] as! CGFloat, alpha: textColor[3] as! CGFloat)
    }

我想过滤"持续时间" < = 2和(i.start,iend)的每个组合的组元素。我能够做到这一点,

library(data.table)    
tt1 <- structure(list(start = c(3, 4, 4, 4, 22, 4, 16), 
                      end = c(5, 40,40, 40, 25, 40, 18), 
                      u = c(1L, 2L, 2L, 2L, 3L, 2L, 4L), 
                      duration = c(2, 36, 36, 36, 3, 36, 2), 
                      i.start = c(3, 3, 29, 20, 20, 14, 14), 
                      i.end = c(5, 5, 31, 22, 22, 16, 16), 
                      q = c(7L, 7L, 8L, 9L, 1L, 10L, 10L), 
                      i.duration = c(2, 2, 2, 2, 2, 2, 2)), row.names = c(NA,-7L),
                 class = c("data.table", "data.frame"), 
                 .Names = c("start", "end", "u", "duration", "i.start", "i.end", "q", "i.duration"))

setDT(tt1)
> tt1
   start end u duration i.start i.end  q i.duration
1:     3   5 1        2       3     5  7          2
2:     4  40 2       36       3     5  7          2
3:     4  40 2       36      29    31  8          2
4:     4  40 2       36      20    22  9          2
5:    22  25 3        3      20    22  1          2
6:     4  40 2       36      14    16 10          2
7:    16  18 4        2      14    16 10          2

但是,我还希望NA(i.start,iend)组的持续时间> 2以及之前的结果返回NA。

> tt1[duration<=2, mean(duration), by =c("i.start","i.end"),nomatch=NA]
   i.start i.end V1
1:       3     5  2
2:      14    16  2

如何做到这一点?

1 个答案:

答案 0 :(得分:2)

如果你想保留所有的组,那么你可能需要在每个组中进行子集,而不是像现在这样做(在i表达式中)。

可以做任何一次

tt1[, mean(duration[duration <= 2]), by = .(i.start, i.end)]
#    i.start i.end  V1
# 1:       3     5   2
# 2:      29    31 NaN
# 3:      20    22 NaN
# 4:      14    16   2

或将其与if / else声明

结合使用
tt1[, if(any(duration <= 2)) mean(duration[duration <= 2]) else NA_real_, by = .(i.start, i.end)]
#    i.start i.end V1
# 1:       3     5  2
# 2:      29    31 NA
# 3:      20    22 NA
# 4:      14    16  2

实现这一目标的另一种(奇怪的)方法是首先只计算你需要的方法,然后再加入所有可能的小组

res <- tt1[duration <= 2, mean(duration), keyby = .(i.start, i.end)]
res[unique(tt1[, .(i.start, i.end)]), on = .(i.start, i.end)]
#    i.start i.end V1
# 1:       3     5  2
# 2:      29    31 NA
# 3:      20    22 NA
# 4:      14    16  2

或类似地

tt1[duration <= 2][unique(tt1[, .(i.start, i.end)]), on=.(i.start, i.end), 
  mean(duration), by=.EACHI]
#    i.start i.end V1
# 1:       3     5  2
# 2:      29    31 NA
# 3:      20    22 NA
# 4:      14    16  2