以下是用于此问题的代码:
set.seed(1337)
myDT <- data.table(Key1 = sample(letters, 500, replace = TRUE),
Key2 = sample(LETTERS[1:5], 500, TRUE),
Data = sample(1:26, 500, replace = TRUE))
setkey(myDT, Key1, Key2)
# showing what myDT looks like
> myDT
Key1 Key2 Data
1: a A 6
2: a A 3
3: a B 2
4: a B 20
5: a B 13
---
496: z D 23
497: z E 3
498: z E 18
499: z E 11
500: z E 2
我想将myDT
配对,只获取每个Key1,Key2对的最大数据值。例如。 (使用(Key1,Key2)表示一对)(a,A)我想摆脱Data为3的行并保持Data为6的行。对于(z,E)我想只保留数据为18的行。
在输入这个问题时,我找到了一个解决方案(我将在下面发布),但请帮助我知道如何解决这个问题。
答案 0 :(得分:5)
My answer
myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)]
# if you are on 1.9.6 or lower use this one
myDT[order(-Data), .SD[1], by = .(Key1, Key2)]
Or from comments
unique(myDT[order(-Data)], by = c("Key1", "Key2"))
Benchmark on 50M rows.
library(dplyr)
library(data.table)
library(microbenchmark)
set.seed(1337)
n = 5e7
myDT <- data.table(Key1 = sample(letters, n, replace = TRUE),
Key2 = sample(LETTERS[1:5], n, TRUE),
Data = sample(1:26, n, replace = TRUE))
setkey(myDT, Key1, Key2)
microbenchmark(times = 10L,
CathG = myDT[, .SD[which.max(Data)], by = .(Key1, Key2)],
jangorecki = myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)],
jangorecki.keeporder = myDT[order(-Data), head(.SD, 1), keyby = .(Key1, Key2)],
nist = myDT %>% group_by(Key1,Key2) %>% summarise(Data = max(Data)),
David = unique(myDT[order(-Data)], by = c("Key1", "Key2")))
#Unit: milliseconds
# expr min lq mean median uq max neval
# CathG 659.6150 689.3035 733.9177 739.795 780.0075 811.1456 10
# jangorecki 2844.7565 3026.3385 3089.6764 3097.332 3219.1951 3343.9919 10
# jangorecki.keeporder 2935.3733 3194.1606 3232.9297 3214.581 3308.0735 3411.4319 10
# nist 803.1921 844.5002 1011.7878 1007.755 1188.6127 1228.3869 10
# David 3410.4853 3501.5918 3590.2382 3590.190 3652.8091 3803.9038 10
Previously posted benchmark on small data shows much different results, so I would say it heavily depends on data, not just volume but also cardinality (count of unique values) - maybe even more in some cases.
答案 1 :(得分:4)
另一种基于this Q的方法是:
myDT[, .SD[which.max(Data)], by = .(Key1, Key2)]
# Key1 Key2 Data
# 1: a A 6
# 2: a B 20
# 3: a C 25
# 4: a E 7
# 5: b A 25
#---
#119: z A 23
#120: z B 26
#121: z C 24
#122: z D 25
#123: z E 18
答案 2 :(得分:2)
使用dplyr
解决问题的更快更好的方法myDT %>% group_by(Key1,Key2) %>% summarise(Data = max(Data))
要保留数据中的所有现有列,您可以使用slice
代替summarise
:
myDT %>% group_by(Key1,Key2) %>% slice(which.max(Data))
请注意,这将为每个组返回1行,如果是tie,则它将是列Data
的第一个最大行。