找出R中是否有任何重叠

时间:2014-12-29 06:28:09

标签: r overlap

我正在尝试编写比较两个数据框中的范围并找到重叠的函数 我的数据框有三列V1,V2,V3(第一列是染色体编号,第二列是起始坐标,第三列是坐标结束)。
我们可以说:df1

 V1    V2  V3
chr1   10  25
chr1   20  100
chr1   98  101
chr2   10  15
chr2   35  46
chr3   50  55
chr3   60  90
chr4   5   100

df2:
 V1   V2  V3
chr1  95  105
chr1  200 205
chr2  45  50
chr2  49  51
chr2  55  90
chr3  50  100
chr4  101 110 

我想写函数找到任何重叠。

函数find_overlap(df1,df2)应该返回df1,其中包含与df2重叠的新列,如:

df1:
 V1    V2  V3  overlap 
chr1   10  25    0
chr1   20  100   1
chr1   98  101   1
chr2   10  15    0
chr2   35  46    1
chr3   50  55    1
chr3   60  90    1
chr4   5   100   0

如果我致电find_overlap(df2,df1)

df2:
 V1   V2  V3   overlap
chr1  95  105   1
chr1  200 205   0
chr2  45  50    1
chr2  49  51    0
chr2  55  90    0
chr3  50  100   1
chr4  101 110   0

如果你告诉我如何在R中做到这一点,我会感激的。

(如果函数返回重叠矢量而不是添加新列,那会更好)
感谢。

1 个答案:

答案 0 :(得分:4)

此处使用来自foverlaps()个包的a data.table函数和来自overlapsAny的{​​{1}}。虽然Stackoverflow上有很多帖子应该可以帮到你。

foverlaps

GenomicRanges

GenomicRanges

您可以按照here的说明安装软件包。

require(data.table)
setDT(df1)
setDT(df2)

setkey(df1, V1,V2,V3)
setkey(df2, V1,V2,V3)

any_overlaps_dt = function(df1, df2) {
    olaps = foverlaps(df1, df2, mult="first", type="any", which=TRUE)
    as.integer(!is.na(olaps))
}

olaps_12 = any_overlaps_dt(df1, df2)
# [1] 0 1 1 0 1 1 1 0

olaps_21 = any_overlaps_dt(df2, df1)
# [1] 1 0 1 0 0 1 0

检查require(GenomicRanges) any_overlaps_GR = function(df1, df2) { gr1 = GRanges(Rle(df1[[1]]), IRanges(df1[[2]], df1[[3]])) gr2 = GRanges(Rle(df2[[1]]), IRanges(df2[[2]], df2[[3]])) as.integer(overlapsAny(gr1, gr2, type="any", ignore.strand=TRUE)) } olaps_12 = any_overlaps_GR(df1, df2) # [1] 0 1 1 0 1 1 1 0 olaps_21 = any_overlaps_GR(df2, df1) # [1] 1 0 1 0 0 1 0 以获取从data.frames创建GRanges对象的替代方法。