我有一个数据框,其中一些条目具有相同的ID和TYPE值,而ID + TYPE应该是我的唯一键。我需要过滤掉那些具有重复ID和TYPE值的条目,查看第三列以确定要保留哪个条目以及丢弃哪个条目。
数据框如下所示:
mydf <- data.frame(ID=c('A1','B6','C3','C3','E8','D4','G1','B6','C1','C1'),
TYPE=c('class','genus','order','order','class','genus','species','genus','family','order'),
STRING=c('a;a;a','b;b','c;c;c;c','c;c;c','e;e;e;e','d;d','g;g;g','b;b;b;b;b','c;c;c;c','c;c'),
VALUE=c(34,435,876,23,5,7,77,42,233,500))
mydf
ID TYPE STRING VALUE
1 A1 class a;a;a 34
2 B6 genus b;b 435
3 C3 order c;c;c;c 876
4 C3 order c;c;c 23
5 E8 class e;e;e;e 5
6 D4 genus d;d 7
7 G1 species g;g;g 77
8 B6 genus b;b;b;b;b 42
9 C1 family c;c;c;c 233
10 C1 order c;c 500
因此重复C3 +阶和B6 +属的条目。我想测试两种选择保留的方法:
1-具有最高VALUE
的两个中的一个(或更多在我的实际数据框中)2- STRUE中用STRING分隔的最短元素数的两个(或更多)中的一个(不一定是最短的nchar)
从1开始,我应该获得以下内容(没有条目4和8):
mydf
ID TYPE STRING VALUE
1 A1 class a;a;a 34
2 B6 genus b;b 435
3 C3 order c;c;c;c 876
5 E8 class e;e;e;e 5
6 D4 genus d;d 7
7 G1 species g;g;g 77
9 C1 family c;c;c;c 233
10 C1 order c;c 500
从2开始,我应该获得以下内容(没有条目3和8):
mydf
ID TYPE STRING VALUE
1 A1 class a;a;a 34
2 B6 genus b;b 435
4 C3 order c;c;c 23
5 E8 class e;e;e;e 5
6 D4 genus d;d 7
7 G1 species g;g;g 77
9 C1 family c;c;c;c 233
10 C1 order c;c 500
有关如何获取这些子集的任何线索,过滤掉这些条目?非常感谢!
答案 0 :(得分:3)
使用dplyr
,您可以执行以下操作:
mydf %>% group_by(ID, TYPE) %>% filter(VALUE == max(VALUE))
# A tibble: 8 x 4
# Groups: ID, TYPE [8]
# ID TYPE STRING VALUE
# <fctr> <fctr> <fctr> <dbl>
#1 A1 class a;a;a 34
#2 B6 genus b;b 435
#3 C3 order c;c;c;c 876
#4 E8 class e;e;e;e 5
#5 D4 genus d;d 7
#6 G1 species g;g;g 77
#7 C1 family c;c;c;c 233
#8 C1 order c;c 500
library(stringr)
mydf %>%
group_by(ID, TYPE) %>%
filter(str_count(STRING, ";") == min(str_count(STRING, ";")))
# A tibble: 8 x 4
# Groups: ID, TYPE [8]
# ID TYPE STRING VALUE
# <fctr> <fctr> <fctr> <dbl>
#1 A1 class a;a;a 34
#2 B6 genus b;b 435
#3 C3 order c;c;c 23
#4 E8 class e;e;e;e 5
#5 D4 genus d;d 7
#6 G1 species g;g;g 77
#7 C1 family c;c;c;c 233
#8 C1 order c;c 500
对于第二部分,如果你关心效率:
mydf %>%
group_by(ID, TYPE) %>%
mutate(n_semicolon = str_count(STRING, ";")) %>%
filter(n_semicolon == min(n_semicolon)) %>%
select(-n_semicolon)
# A tibble: 8 x 4
# Groups: ID, TYPE [8]
# ID TYPE STRING VALUE
# <fctr> <fctr> <fctr> <dbl>
#1 A1 class a;a;a 34
#2 B6 genus b;b 435
#3 C3 order c;c;c 23
#4 E8 class e;e;e;e 5
#5 D4 genus d;d 7
#6 G1 species g;g;g 77
#7 C1 family c;c;c;c 233
#8 C1 order c;c 500
答案 1 :(得分:1)
我们可以使用data.table
<强> 1)强>
library(data.table)
setDT(mydf)[, .SD[which.max(VALUE)], .(ID, TYPE)]
# ID TYPE STRING VALUE
#1: A1 class a;a;a 34
#2: B6 genus b;b 435
#3: C3 order c;c;c;c 876
#4: E8 class e;e;e;e 5
#5: D4 genus d;d 7
#6: G1 species g;g;g 77
#7: C1 family c;c;c;c 233
#8: C1 order c;c 500
<强> 2)强>
setDT(mydf)[, n := nchar(gsub("[^;]+", "", STRING))
][, .SD[n == min(n)], .(ID, TYPE)][, n := NULL][]
# ID TYPE STRING VALUE
#1: A1 class a;a;a 34
#2: B6 genus b;b 435
#3: C3 order c;c;c 23
#4: E8 class e;e;e;e 5
#5: D4 genus d;d 7
#6: G1 species g;g;g 77
#7: C1 family c;c;c;c 233
#8: C1 order c;c 500