平均信号在几个时间间隔与R的农庄

时间:2014-02-12 10:05:12

标签: r bioconductor

我有一个从bedGraph文件导入GRanges对象的全基因组ChIP-seq信号。我想在覆盖所有峰值的固定宽度间隔上绘制平均信号。如何将信号提取到数字向量中,以便我可以对它们求平均值?

举例说明:

library(GenomicRanges)
set.seed(1)

signal <- GRanges(
    seqnames = Rle(c("chr1"), c(10)),
    ranges = IRanges(1:10*10, end = 1:10*10+5),
    score = runif(10))

intervals <- GRanges(
    seqnames = Rle(c("chr1"), c(5)),
    ranges = IRanges(1:5*20 + floor(runif(5)*4), width = 10))

所以信号如下:

GRanges with 10 ranges and 1 metadata column:
       seqnames     ranges strand |              score
          <Rle>  <IRanges>  <Rle> |          <numeric>
   [1]     chr1 [ 10,  15]      * |    0.2655086631421
   [2]     chr1 [ 20,  25]      * |   0.37212389963679
   [3]     chr1 [ 30,  35]      * |  0.572853363351896
   [4]     chr1 [ 40,  45]      * |  0.908207789994776
   [5]     chr1 [ 50,  55]      * |  0.201681931037456
   [6]     chr1 [ 60,  65]      * |  0.898389684967697
   [7]     chr1 [ 70,  75]      * |  0.944675268605351
   [8]     chr1 [ 80,  85]      * |  0.660797792486846
   [9]     chr1 [ 90,  95]      * |   0.62911404389888
  [10]     chr1 [100, 105]      * | 0.0617862704675645
  ---
  seqlengths:
   chr1
     NA

,间隔看起来像:

GRanges with 5 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]     chr1 [ 20,  29]      *
  [2]     chr1 [ 40,  49]      *
  [3]     chr1 [ 62,  71]      *
  [4]     chr1 [ 81,  90]      *
  [5]     chr1 [103, 112]      *
  ---
  seqlengths:
   chr1
     NA

所以我想平均向量:

Rle(c(0.372, 0), c(6, 4))            # [ 20, 29]
Rle(c(0.908, 0), c(6, 4))            # [ 40, 49]
Rle(c(0.898, 0, 0.945), c(4, 4, 2))  # [ 62, 71]
Rle(c(0.661, 0, 0.629), c(5, 4, 1))  # [ 81, 90]
Rle(c(0.061, 0), c(3, 7))            # [103,112]

如果没有for循环和许多繁琐的容易出错的区间运算,我怎么能这样做呢?我希望GenomicRanges包中包含这种功能,但我在手册中看不到它。我一直在尝试使用subsetByOverlaps,但这似乎没有将信号得分延伸到结果中,也似乎没有帮助提取上面的Rle向量。

3 个答案:

答案 0 :(得分:2)

我想我可能已经弄明白了。我可以按时间间隔将getScores()函数应用于每个范围。这些函数使用findOverlaps改编自此答案https://stackoverflow.com/a/9913411/959926

getScores <- function(interval) {
    scores <- Rle(0, width(interval))
    bases <- GRanges(
        seqnames = seqnames(interval),
        ranges = IRanges(start(interval):end(interval), width = 1))
    overlaps <- findOverlaps(signal, bases)
    scores[start(bases)[subjectHits(overlaps)] - start(interval) + 1] <- score(signal)[queryHits(overlaps)]
    scores
}
Reduce('+', sapply(split(intervals, 1:length(intervals)), getScores)) / length(intervals)

到目前为止似乎有效,但欢迎任何改进。例如,当信号和/或间隔很长时,它很慢。

答案 1 :(得分:1)

这个解决方案怎么样?

overlaps <- findOverlaps(signal, intervals)
signal <- signal[overlaps@queryHits]
averagedSignal <- aggregate(score(sites), list(overlaps@subjectHits), mean)

答案 2 :(得分:0)

 overlaps <- findOverlaps(signal, intervals)
 sites <- signal[queryHits(overlaps)]
 intervals$averagedSignal <- aggregate(score(sites), list(subjectHits(overlaps)), mean)