将个体基因组区间连接到种群区域

时间:2015-11-16 15:08:53

标签: r overlap overlapping bioconductor

我想将单个基因组区间连接到共同区域。

我的意见:

dfin <- "chr start end sample type
        1   10    20   NE1    loss
        1   5     15   NE2    gain
        1   25    30   NE1    gain
        2   40    50   NE1    loss
        2   40    60   NE2    loss
        3   20    30   NE1    gain"
dfin <- read.table(text=dfin, header=T)

我的预期输出:

dfout <- "chr start end samples type
        1   5     20   NE1-NE2  both
        1   25    30   NE1      gain
        2   40    60   NE1-NE2  loss
        3   20    30   NE1      gain"
dfout <- read.table(text=dfout, header=T)

dfin中的间隔永远不会在同一动物中重叠,只会在动物之间重叠(分别为samplesamples列)。列typeloss中有两个因素(gaindfin),预计dfout中有三个因素(loss,{ {1}}和gain,当both中的连接区域同时基于dfoutloss时出现。

有什么想法来解决这个问题吗?

*更新了@David Arenburg

2 个答案:

答案 0 :(得分:3)

这是尝试使用data.table::foverlaps对间隔进行分组,然后计算所有其余的

library(data.table)
setkey(setDT(dfin), chr, start, end)
res <- foverlaps(dfin, dfin, which = TRUE)[, toString(xid), by = yid
                                           ][, indx := .GRP, by = V1]$indx
dfin[, .(
          chr = chr[1L],
          start = min(start), 
          end = max(end), 
          samples = paste(unique(sample), collapse = "-"),
          type = if(uniqueN(type) > 1L) "both" else as.character(type[1L])
         ),
       by = res]

#    res chr start end samples type
# 1:   1   1     5  20 NE2-NE1 both
# 2:   2   1    25  30     NE1 gain
# 3:   3   2    40  60 NE1-NE2 loss
# 4:   4   3    20  30     NE1 gain

答案 1 :(得分:1)

(扩展评论)您可以使用&#34; IRanges&#34;包:

reduce

无法弄清楚如何避免tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning maps = rep(seq_along(start(tmp)), width(tmp)) maps #[1] 1 1 2 3 3 4 丢失&#34; RangedData&#34;对象,但保存映射,我们可以做一些事情(可能有一个更合适的 - 根据&#34; IRanges&#34; - 提取映射的方式,但我无法找到它):

tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-"))
#        1         2         3         4 
#"NE1-NE2"     "NE1" "NE1-NE2"     "NE1"

具有间隔连接的映射,我们可以聚合&#34; sample&#34;和&#34;键入&#34;得到最终形式。例如:

{{1}}