重复两列时从数据框中删除条目,在第三列中保留最高值的条目

时间:2017-06-28 01:35:16

标签: r dataframe filter subset

我有一个数据框,其中一些条目具有相同的ID和TYPE值,而ID + TYPE应该是我的唯一键。我需要过滤掉那些具有重复ID和TYPE值的条目,查看第三列以确定要保留哪个条目以及丢弃哪个条目。

数据框如下所示:

mydf <- data.frame(ID=c('A1','B6','C3','C3','E8','D4','G1','B6','C1','C1'),
                     TYPE=c('class','genus','order','order','class','genus','species','genus','family','order'),
                     STRING=c('a;a;a','b;b','c;c;c;c','c;c;c','e;e;e;e','d;d','g;g;g','b;b;b;b;b','c;c;c;c','c;c'),
                     VALUE=c(34,435,876,23,5,7,77,42,233,500))


mydf
   ID    TYPE    STRING VALUE
1  A1   class     a;a;a    34
2  B6   genus       b;b   435
3  C3   order   c;c;c;c   876
4  C3   order     c;c;c    23
5  E8   class   e;e;e;e     5
6  D4   genus       d;d     7
7  G1 species     g;g;g    77
8  B6   genus b;b;b;b;b    42
9  C1  family   c;c;c;c   233
10 C1   order       c;c   500

因此重复C3 +阶和B6 +属的条目。我想测试两种选择保留的方法:

1-具有最高VALUE

的两个中的一个(或更多在我的实际数据框中)

2- STRUE中用STRING分隔的最短元素数的两个(或更多)中的一个(不一定是最短的nchar)

从1开始,我应该获得以下内容(没有条目4和8):

mydf
   ID    TYPE    STRING VALUE
1  A1   class     a;a;a    34
2  B6   genus       b;b   435
3  C3   order   c;c;c;c   876
5  E8   class   e;e;e;e     5
6  D4   genus       d;d     7
7  G1 species     g;g;g    77
9  C1  family   c;c;c;c   233
10 C1   order       c;c   500

从2开始,我应该获得以下内容(没有条目3和8):

mydf
   ID    TYPE    STRING VALUE
1  A1   class     a;a;a    34
2  B6   genus       b;b   435
4  C3   order     c;c;c    23
5  E8   class   e;e;e;e     5
6  D4   genus       d;d     7
7  G1 species     g;g;g    77
9  C1  family   c;c;c;c   233
10 C1   order       c;c   500

有关如何获取这些子集的任何线索,过滤掉这些条目?非常感谢!

2 个答案:

答案 0 :(得分:3)

使用dplyr,您可以执行以下操作:

mydf %>% group_by(ID, TYPE) %>% filter(VALUE == max(VALUE))

# A tibble: 8 x 4
# Groups:   ID, TYPE [8]
#      ID    TYPE  STRING VALUE
#  <fctr>  <fctr>  <fctr> <dbl>
#1     A1   class   a;a;a    34
#2     B6   genus     b;b   435
#3     C3   order c;c;c;c   876
#4     E8   class e;e;e;e     5
#5     D4   genus     d;d     7
#6     G1 species   g;g;g    77
#7     C1  family c;c;c;c   233
#8     C1   order     c;c   500

library(stringr)
mydf %>% 
    group_by(ID, TYPE) %>% 
    filter(str_count(STRING, ";") == min(str_count(STRING, ";")))

# A tibble: 8 x 4
# Groups:   ID, TYPE [8]
#      ID    TYPE  STRING VALUE
#  <fctr>  <fctr>  <fctr> <dbl>
#1     A1   class   a;a;a    34
#2     B6   genus     b;b   435
#3     C3   order   c;c;c    23
#4     E8   class e;e;e;e     5
#5     D4   genus     d;d     7
#6     G1 species   g;g;g    77
#7     C1  family c;c;c;c   233
#8     C1   order     c;c   500

对于第二部分,如果你关心效率:

mydf %>% 
    group_by(ID, TYPE) %>% 
    mutate(n_semicolon = str_count(STRING, ";")) %>% 
    filter(n_semicolon == min(n_semicolon)) %>% 
    select(-n_semicolon)

# A tibble: 8 x 4
# Groups:   ID, TYPE [8]
#      ID    TYPE  STRING VALUE
#  <fctr>  <fctr>  <fctr> <dbl>
#1     A1   class   a;a;a    34
#2     B6   genus     b;b   435
#3     C3   order   c;c;c    23
#4     E8   class e;e;e;e     5
#5     D4   genus     d;d     7
#6     G1 species   g;g;g    77
#7     C1  family c;c;c;c   233
#8     C1   order     c;c   500 

答案 1 :(得分:1)

我们可以使用data.table

<强> 1)

library(data.table)
setDT(mydf)[, .SD[which.max(VALUE)], .(ID, TYPE)]
#   ID    TYPE  STRING VALUE
#1: A1   class   a;a;a    34
#2: B6   genus     b;b   435
#3: C3   order c;c;c;c   876
#4: E8   class e;e;e;e     5
#5: D4   genus     d;d     7
#6: G1 species   g;g;g    77
#7: C1  family c;c;c;c   233
#8: C1   order     c;c   500

<强> 2)

setDT(mydf)[, n := nchar(gsub("[^;]+", "", STRING))
     ][, .SD[n == min(n)], .(ID, TYPE)][, n := NULL][]
#    ID    TYPE  STRING VALUE
#1: A1   class   a;a;a    34
#2: B6   genus     b;b   435
#3: C3   order   c;c;c    23
#4: E8   class e;e;e;e     5
#5: D4   genus     d;d     7
#6: G1 species   g;g;g    77
#7: C1  family c;c;c;c   233
#8: C1   order     c;c   500