获取R中基因组范围重叠的频率

时间:2018-08-21 16:31:46

标签: r dataframe frequency genomicranges

我使用GenomicRanges R包来查找两组基因组范围之间的重叠。 findOverlaps函数的输出提供两个信息:1.与列表A重叠的范围的行号2.与列表B重叠的范围的行号。

我对列表A的重叠感兴趣,并想在列表A中添加一列,指示每一行的重叠数量。

这是一个可复制的示例,您可以直接在R中使用:

#Define SetA    
    chrA = c(7,3,22)
    startA = c(127991052,37327681,50117297)
    stopA = c(127991052,37327681,50117297)
    SetA = data.frame(chrA,startA,stopA)

#Define SetB
    chrB = c(1,3,22,22)
    startB = c(105278917,37236502,46384621,49214228)
    stopB = c(105451039,37411958,50796976,50727239)
    SetB = data.frame(chrB,startB,stopB)

#Find Overlaps between SetA and SetB 
    library(GenomicRanges)
    gr0 = with(SetA, GRanges(chrA, IRanges(start=startA, end=stopA)))
    gr1 = with(SetB, GRanges(chrB, IRanges(start=startB, end=stopB)))

    hits = findOverlaps(gr0, gr1)
    hits = data.frame(hits) #the fist col of hits is the row numbers (from SetA) of genomic ranges that overlap with SetB
    mat

我想向SetA添加一列,以指示每一行与SetB重叠的频率。这是我的尝试以及需要获得的输出:

#Calculate frequencies:    
OverlapFreq = data.frame(table(hits$queryHits)) #calculate frequencies for the first col in hits
OverlapFreq

    #expected output:
    SetA$OverlapFreq = c(0,1,2)
    SetA

任何有关如何实现这一目标的建议都将受到赞赏!

2 个答案:

答案 0 :(得分:1)

我想出了答案,只不过是使用同一包中的countOverlaps函数:

OverlapFreq = countOverlaps(gr0,gr1)

答案 1 :(得分:0)

还使用plyranges版的函数:

    # direct
    gr0$n_overlaps <- count_overlaps(gr0, gr1)

    # dplyr style 
    overlaps <- gr0 %>% mutate(n_overlaps = count_overlaps(., gr1))      

我还建议连接操作使用plyranges。

    # return overlapping ranges
    find_overlaps(gr0,gr1)