我有两个数据框,例如:
gene_bacteriadf
seqnames ranges strand
[1] scaffold_1 1-50 -
[2] scaffold_1 60-100 -
[3] scaffold_1 200-350 -
[4] scaffold_2 1550-1650 +
[5] scaffold_2 1900-2300 -
[6] scaffold_5 250-255 +`
和 overlapdf
seqnames ranges strand hit with_busco with_bacteria Overlap_with
scaffold_2 1550-1650 + | TRUE 101 201 101 0.502487562189055
的想法只是删除列seqnames,range和strand中的匹配项。 我试过了;
genes_bacteriadf[!(alist(genes_bacteriadf$seqnames, genes_bacteriadf$start, genes_bacteriaf$end, genes_bacteriadf$width) %in% (alistoverlapsdf$seqnames,overlapsdf$start,overlapsdf$end,overlapsdf$width), ]
但是id不起作用。
示例scaffold2中的1550165à确实匹配,所以我应该得到一个新的df,例如:
seqnames ranges strand
[1] scaffold_1 1-50 -
[2] scaffold_1 60-100 -
[3] scaffold_1 200-350 -
[5] scaffold_2 1900-2300 -
[6] scaffold_5 250-255 +
有人有想法吗?
答案 0 :(得分:1)
这需要dplyr的anti_join
,尤其是列名相同的情况。
library(dplyr)
gene_bacteriadf %>%
anti_join(overlapdf)
Joining, by = c("seqnames", "ranges", "strand")
seqnames ranges strand
1 scaffold_1 1-50 -
2 scaffold_1 60-100 -
3 scaffold_1 200-350 -
4 scaffold_2 1900-2300 -
5 scaffold_5 250-255 +