当存在空白/缺失值时,如何才能使下面的数据框行唯一地依赖于第二列?
> head(interproscan)
V1 V14
1 sp0000001-mRNA-1
2 sp0000001-mRNA-1
3 sp0000001-mRNA-1
4 sp0000005-mRNA-1 GO:0003723
5 sp0000006-mRNA-1 GO:0016021
6 sp0000006-mRNA-1 GO:0016021
> head(unique(interproscan[ , 1:2] ))
V1 V14
1 sp0000001-mRNA-1
4 sp0000005-mRNA-1 GO:0003723
5 sp0000006-mRNA-1 GO:0016021
7 sp0000006-mRNA-2 GO:0016021
9 sp0000006-mRNA-3 GO:0016021
目标是:
V1 V14
1 sp0000001-mRNA-1
4 sp0000005-mRNA-1 GO:0003723
5 sp0000006-mRNA-1 GO:0016021
提前谢谢
答案 0 :(得分:1)
您需要修改V1
按照您的意图进行分组。我使用gsub
丢弃最后-number
个后缀。
library(dplyr)
ans <- df %>%
group_by(gsub("-\\d","",V1), V14) %>% # now it groups the way you want
arrange(V1) %>% # unnecessary for your toy example but just in case for your full data
slice(1) %>% # select top row-entry
ungroup() %>%
select(-4) # discard intermediate grouping variable
输出
# A tibble: 3 x 3
id V1 V14
<int> <chr> <chr>
1 1 sp0000001-mRNA-1
2 4 sp0000005-mRNA-1 GO:0003723
3 5 sp0000006-mRNA-1 GO:0016021
数据
df <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L), V1 = c("sp0000001-mRNA-1",
"sp0000001-mRNA-1", "sp0000001-mRNA-1", "sp0000005-mRNA-1", "sp0000006-mRNA-1",
"sp0000006-mRNA-1", "sp0000006-mRNA-2", "sp0000006-mRNA-3"),
V14 = c("", "", "", "GO:0003723", "GO:0016021", "GO:0016021",
"GO:0016021", "GO:0016021")), class = "data.frame", .Names = c("id",
"V1", "V14"), row.names = c(NA, -8L))
id V1 V14
1 1 sp0000001-mRNA-1
2 2 sp0000001-mRNA-1
3 3 sp0000001-mRNA-1
4 4 sp0000005-mRNA-1 GO:0003723
5 5 sp0000006-mRNA-1 GO:0016021
6 6 sp0000006-mRNA-1 GO:0016021
7 7 sp0000006-mRNA-2 GO:0016021
8 9 sp0000006-mRNA-3 GO:0016021
答案 1 :(得分:0)
尝试使用数据框或数据表:
interproscan <- data.frame(interproscan)
unique(interproscan)
输出:
V1 V14
1 sp0000001-mRNA-1
4 sp0000005-mRNA-1 GO:0003723
5 sp0000006-mRNA-1 GO:0016021
示例数据:
require(data.table)
interproscan <- fread("V1, V14
sp0000001-mRNA-1,
sp0000001-mRNA-1,
sp0000001-mRNA-1,
sp0000005-mRNA-1, GO:0003723
sp0000006-mRNA-1, GO:0016021
sp0000006-mRNA-1, GO:0016021")