我使用GenomicRanges R包来查找两组基因组范围之间的重叠。 findOverlaps函数的输出提供两个信息:1.与列表A重叠的范围的行号2.与列表B重叠的范围的行号。
我对列表A的重叠感兴趣,并想在列表A中添加一列,指示每一行的重叠数量。
这是一个可复制的示例,您可以直接在R中使用:
#Define SetA
chrA = c(7,3,22)
startA = c(127991052,37327681,50117297)
stopA = c(127991052,37327681,50117297)
SetA = data.frame(chrA,startA,stopA)
#Define SetB
chrB = c(1,3,22,22)
startB = c(105278917,37236502,46384621,49214228)
stopB = c(105451039,37411958,50796976,50727239)
SetB = data.frame(chrB,startB,stopB)
#Find Overlaps between SetA and SetB
library(GenomicRanges)
gr0 = with(SetA, GRanges(chrA, IRanges(start=startA, end=stopA)))
gr1 = with(SetB, GRanges(chrB, IRanges(start=startB, end=stopB)))
hits = findOverlaps(gr0, gr1)
hits = data.frame(hits) #the fist col of hits is the row numbers (from SetA) of genomic ranges that overlap with SetB
mat
我想向SetA添加一列,以指示每一行与SetB重叠的频率。这是我的尝试以及需要获得的输出:
#Calculate frequencies:
OverlapFreq = data.frame(table(hits$queryHits)) #calculate frequencies for the first col in hits
OverlapFreq
#expected output:
SetA$OverlapFreq = c(0,1,2)
SetA
任何有关如何实现这一目标的建议都将受到赞赏!
答案 0 :(得分:1)
我想出了答案,只不过是使用同一包中的countOverlaps函数:
OverlapFreq = countOverlaps(gr0,gr1)
答案 1 :(得分:0)
还使用plyranges版的函数:
# direct
gr0$n_overlaps <- count_overlaps(gr0, gr1)
# dplyr style
overlaps <- gr0 %>% mutate(n_overlaps = count_overlaps(., gr1))
我还建议连接操作使用plyranges。
# return overlapping ranges
find_overlaps(gr0,gr1)