子集data.table基于在列中发生x次的观察

时间:2019-01-11 15:42:12

标签: r data.table

具有与此类似的数据:

ApplicationUser

我想通过两种方式进行子集化:

第一个子集,所有dt <- data.table(id = c("a","a","b","b","b","c","c","c","c","d","d","d","d","d"), quantity = c(6,6,7,7,7,8,8,1,1,9,9,9,2,2)) threshold <- 3 id quantity 1: a 6 2: a 6 3: b 7 4: b 7 5: b 7 6: c 8 7: c 8 8: c 1 9: c 1 10: d 9 11: d 9 12: d 9 13: d 2 14: d 2 都保留在id至少quantity次(每个threshold有3次)具有相同观察值的位置。输出应如下所示:

id

第二个子集,仅保留行,其中 id quantity 1: b 7 2: b 7 3: b 7 4: d 9 5: d 9 6: d 9 7: d 2 8: d 2 对每个quantity具有至少threshold次(3次)的相同观察值。输出应如下所示:

id

非常感谢。

3 个答案:

答案 0 :(得分:3)

# normally I'd use .SD, not .I, but you don't have anything else in your table
second = dt[, if (.N >= threshold) .I, by = .(id, quantity)][, -"V1"]

first = dt[unique(second$id), on = 'id']

答案 1 :(得分:3)

对于第一个子集,您可以执行以下操作:

dt[id %in% dt[, .N, by = .(id, quantity)][N >= threshold, unique(id)]]

给出:

   id quantity
1:  b        7
2:  b        7
3:  b        7
4:  d        9
5:  d        9
6:  d        9
7:  d        2
8:  d        2

第二个子集:

dt[dt[, .N, by = .(id, quantity)][N >= threshold, .(id, quantity)]
   , on = .(id, quantity)]

给出:

   id quantity
1:  b        7
2:  b        7
3:  b        7
4:  d        9
5:  d        9
6:  d        9

答案 2 :(得分:2)

base::rle()

第一个子集:

dt[, .SD[max(rle(quantity)[["lengths"]]) >= threshold], id]

   id quantity
1:  b        7
2:  b        7
3:  b        7
4:  d        9
5:  d        9
6:  d        9
7:  d        2
8:  d        2

第二子集:

dt[,{
      tmp <- rle(quantity)
      ind <- tmp[["lengths"]] >= threshold
      rep(tmp[["values"]][ind], tmp[["lengths"]][ind])
    }, 
   by = id]


   id V1
1:  b  7
2:  b  7
3:  b  7
4:  d  9
5:  d  9
6:  d  9