过滤掉不符合条件的组行

时间:2015-12-09 15:49:54

标签: r data.table

以下是用于此问题的代码:

set.seed(1337)
myDT <- data.table(Key1 = sample(letters, 500, replace = TRUE),
                   Key2 = sample(LETTERS[1:5], 500, TRUE),
                   Data = sample(1:26, 500, replace = TRUE))
setkey(myDT, Key1, Key2)
# showing what myDT looks like
> myDT
     Key1 Key2 Data
  1:    a    A    6
  2:    a    A    3
  3:    a    B    2
  4:    a    B   20
  5:    a    B   13
 ---               
496:    z    D   23
497:    z    E    3
498:    z    E   18
499:    z    E   11
500:    z    E    2

我想将myDT配对,只获取每个Key1,Key2对的最大数据值。例如。 (使用(Key1,Key2)表示一对)(a,A)我想摆脱Data为3的行并保持Data为6的行。对于(z,E)我想只保留数据为18的行。

在输入这个问题时,我找到了一个解决方案(我将在下面发布),但请帮助我知道如何解决这个问题。

3 个答案:

答案 0 :(得分:5)

My answer

myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)]
# if you are on 1.9.6 or lower use this one
myDT[order(-Data), .SD[1], by = .(Key1, Key2)]

Or from comments

unique(myDT[order(-Data)], by = c("Key1", "Key2"))

Benchmark on 50M rows.

library(dplyr)
library(data.table)
library(microbenchmark)
set.seed(1337)
n = 5e7
myDT <- data.table(Key1 = sample(letters, n, replace = TRUE),
                   Key2 = sample(LETTERS[1:5], n, TRUE),
                   Data = sample(1:26, n, replace = TRUE))
setkey(myDT, Key1, Key2)

microbenchmark(times = 10L,
               CathG = myDT[, .SD[which.max(Data)], by = .(Key1, Key2)],
               jangorecki = myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)],
               jangorecki.keeporder = myDT[order(-Data), head(.SD, 1), keyby = .(Key1, Key2)],
               nist = myDT %>% group_by(Key1,Key2) %>% summarise(Data = max(Data)),
               David = unique(myDT[order(-Data)], by = c("Key1", "Key2")))

#Unit: milliseconds
#                 expr       min        lq      mean   median        uq       max neval
#                CathG  659.6150  689.3035  733.9177  739.795  780.0075  811.1456    10
#           jangorecki 2844.7565 3026.3385 3089.6764 3097.332 3219.1951 3343.9919    10
# jangorecki.keeporder 2935.3733 3194.1606 3232.9297 3214.581 3308.0735 3411.4319    10
#                 nist  803.1921  844.5002 1011.7878 1007.755 1188.6127 1228.3869    10
#                David 3410.4853 3501.5918 3590.2382 3590.190 3652.8091 3803.9038    10

Previously posted benchmark on small data shows much different results, so I would say it heavily depends on data, not just volume but also cardinality (count of unique values) - maybe even more in some cases.

答案 1 :(得分:4)

另一种基于this Q的方法是:

 myDT[, .SD[which.max(Data)], by = .(Key1, Key2)]
 #    Key1 Key2 Data
 # 1:    a    A    6
 # 2:    a    B   20
 # 3:    a    C   25
 # 4:    a    E    7
 # 5:    b    A   25
 #---               
#119:    z    A   23
#120:    z    B   26
#121:    z    C   24
#122:    z    D   25
#123:    z    E   18

答案 2 :(得分:2)

使用dplyr

解决问题的更快更好的方法
myDT %>% group_by(Key1,Key2) %>% summarise(Data = max(Data))

要保留数据中的所有现有列,您可以使用slice代替summarise

myDT %>% group_by(Key1,Key2) %>% slice(which.max(Data))

请注意,这将为每个组返回1行,如果是tie,则它将是列Data的第一个最大行。