我想从data.table中获取唯一的行,给定列的子集和i
中的条件。最好的方法是什么? (在计算速度和短或可读语法方面“最佳”)
set.seed(1)
jk <- data.table(c1 = sample(letters,60,replace = TRUE),
c2 = sample(c(TRUE,FALSE),60, replace = TRUE),
c3 = sample(letters,60, replace = TRUE),
c4 = sample.int(10,60, replace = TRUE)
)
说我想找到c1
和c2
的唯一组合,其中c4
是10.我可以想到几种方法,但我不知道是什么是最佳的。要提取的列是否有键也可能很重要。
## works but gives an extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)]
## this removes extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)][,V1 := NULL]
## this seems like it could work
## but no j-expression with a keyby throws an error
jk[c4 >= 10, , keyby = list(c1,c2)]
## using unique with .SD
jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")]
答案 0 :(得分:4)
至少对我来说,最简单的是@Justin建议的unique(jk[c4 >= 10, list(c1, c2)])
或unique(jk[c4 >= 10, c("c1", "c2"), with = F])
。到目前为止,后者是四个建议中最快的,至少在我的笔记本电脑上是这样的:
microbenchmark(
a=jk[c4 >= 10, list(c1,c2), keyby = list(c1,c2)][,c("c1","c2"),with=F],
b=jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")],
c=unique(jk[c4>=10,list(c1,c2)]),
d=unique(jk[c4>=10,c("c1","c2"),with=F])
)
Unit: microseconds
expr min lq median uq max neval
a 1378.742 1456.676 1494.9380 1531.1395 2515.796 100
b 906.404 943.072 963.7790 997.4930 3805.846 100
c 1167.125 1201.988 1232.3500 1272.2250 2077.047 100
d 627.768 653.314 669.8625 683.8045 739.808 100