按组数量过滤data.table

时间:2015-07-28 04:17:31

标签: r data.table

假设我有data.table喜欢

sample<-data.table(id=c(1,1,2,2,3,3,3,4,4),
                   name=c("apple","apple","orange","orange",
                          "pear","pear","pear","banana","banana"),
               atr=c("pretty","ugly","bruised","delicious",
                     "pear-shaped","bruised","infested",
                     "too-ripe","perfect"),
               N=c(10,9,15,4,5,7,7,4,12))

我想基本上返回unique(sample[,list(id, name)]),除了我还希望atr列为最大N的值。如果有最高N的并列,那么我不在乎哪个两个被选中,但我只想挑一个。

这几乎可以正常使用merge(sample[,list(N=max(N)),by=list(id,name1)], sample,by=c("id","name1","N")),但由于梨有两个atr值,这两个值相关,因此返回两个梨。除了没有给出预期的结果,我还假设/希望有一种方法可以做到这一点,不涉及加入。

2 个答案:

答案 0 :(得分:4)

你可以使用atr[N == max(N)][1]只返回领带的第一个,就像这样 -

library(data.table)

sample[, .(atr = atr[N == max(N)][1]), by = .(id, name)]
#    id   name     atr
# 1:  1  apple  pretty
# 2:  2 orange bruised
# 3:  3   pear bruised
# 4:  4 banana perfect

注意:正如弗兰克指出atr[N == max(N)][1]也只是atr[which.max(N)]

答案 1 :(得分:3)

我只会使用order

> unique(sample[order(-N), .(id, name, atr)], by = c("id", "name"))
   id   name     atr
1:  2 orange bruised
2:  4 banana perfect
3:  1  apple  pretty
4:  3   pear bruised

如果您想维护整体排序,请改用order(id, name, -N)

您也可以将其拆分为两行:

setorder(sample, -N) #done by reference, as with all set* functions in data.table
unique(sample[ , .(id, name, atr)], by = c("id", "name"))

或许更好,取决于您的最终目标:

setkey(setorder(sample, -N), id, name)
unique(sample[ , .(id, name, atr)])

(注意:顺序至关重要,因为首先使用setorder会覆盖NULL的密钥