我正在尝试编写比较两个数据框中的范围并找到重叠的函数
我的数据框有三列V1,V2,V3
(第一列是染色体编号,第二列是起始坐标,第三列是坐标结束)。
我们可以说:df1
V1 V2 V3
chr1 10 25
chr1 20 100
chr1 98 101
chr2 10 15
chr2 35 46
chr3 50 55
chr3 60 90
chr4 5 100
df2:
V1 V2 V3
chr1 95 105
chr1 200 205
chr2 45 50
chr2 49 51
chr2 55 90
chr3 50 100
chr4 101 110
我想写函数找到任何重叠。
函数find_overlap(df1,df2)
应该返回df1,其中包含与df2重叠的新列,如:
df1:
V1 V2 V3 overlap
chr1 10 25 0
chr1 20 100 1
chr1 98 101 1
chr2 10 15 0
chr2 35 46 1
chr3 50 55 1
chr3 60 90 1
chr4 5 100 0
如果我致电find_overlap(df2,df1)
df2:
V1 V2 V3 overlap
chr1 95 105 1
chr1 200 205 0
chr2 45 50 1
chr2 49 51 0
chr2 55 90 0
chr3 50 100 1
chr4 101 110 0
如果你告诉我如何在R中做到这一点,我会感激的。
(如果函数返回重叠矢量而不是添加新列,那会更好)
感谢。
答案 0 :(得分:4)
此处使用来自foverlaps()
个包的a data.table
函数和来自overlapsAny
的{{1}}。虽然Stackoverflow上有很多帖子应该可以帮到你。
GenomicRanges
您可以按照here的说明安装软件包。
require(data.table)
setDT(df1)
setDT(df2)
setkey(df1, V1,V2,V3)
setkey(df2, V1,V2,V3)
any_overlaps_dt = function(df1, df2) {
olaps = foverlaps(df1, df2, mult="first", type="any", which=TRUE)
as.integer(!is.na(olaps))
}
olaps_12 = any_overlaps_dt(df1, df2)
# [1] 0 1 1 0 1 1 1 0
olaps_21 = any_overlaps_dt(df2, df1)
# [1] 1 0 1 0 0 1 0
检查require(GenomicRanges)
any_overlaps_GR = function(df1, df2) {
gr1 = GRanges(Rle(df1[[1]]), IRanges(df1[[2]], df1[[3]]))
gr2 = GRanges(Rle(df2[[1]]), IRanges(df2[[2]], df2[[3]]))
as.integer(overlapsAny(gr1, gr2, type="any", ignore.strand=TRUE))
}
olaps_12 = any_overlaps_GR(df1, df2)
# [1] 0 1 1 0 1 1 1 0
olaps_21 = any_overlaps_GR(df2, df1)
# [1] 1 0 1 0 0 1 0
以获取从data.frames创建GRanges对象的替代方法。