我有一个从bedGraph文件导入GRanges对象的全基因组ChIP-seq信号。我想在覆盖所有峰值的固定宽度间隔上绘制平均信号。如何将信号提取到数字向量中,以便我可以对它们求平均值?
举例说明:
library(GenomicRanges)
set.seed(1)
signal <- GRanges(
seqnames = Rle(c("chr1"), c(10)),
ranges = IRanges(1:10*10, end = 1:10*10+5),
score = runif(10))
intervals <- GRanges(
seqnames = Rle(c("chr1"), c(5)),
ranges = IRanges(1:5*20 + floor(runif(5)*4), width = 10))
所以信号如下:
GRanges with 10 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 [ 10, 15] * | 0.2655086631421
[2] chr1 [ 20, 25] * | 0.37212389963679
[3] chr1 [ 30, 35] * | 0.572853363351896
[4] chr1 [ 40, 45] * | 0.908207789994776
[5] chr1 [ 50, 55] * | 0.201681931037456
[6] chr1 [ 60, 65] * | 0.898389684967697
[7] chr1 [ 70, 75] * | 0.944675268605351
[8] chr1 [ 80, 85] * | 0.660797792486846
[9] chr1 [ 90, 95] * | 0.62911404389888
[10] chr1 [100, 105] * | 0.0617862704675645
---
seqlengths:
chr1
NA
,间隔看起来像:
GRanges with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 20, 29] *
[2] chr1 [ 40, 49] *
[3] chr1 [ 62, 71] *
[4] chr1 [ 81, 90] *
[5] chr1 [103, 112] *
---
seqlengths:
chr1
NA
所以我想平均向量:
Rle(c(0.372, 0), c(6, 4)) # [ 20, 29]
Rle(c(0.908, 0), c(6, 4)) # [ 40, 49]
Rle(c(0.898, 0, 0.945), c(4, 4, 2)) # [ 62, 71]
Rle(c(0.661, 0, 0.629), c(5, 4, 1)) # [ 81, 90]
Rle(c(0.061, 0), c(3, 7)) # [103,112]
如果没有for循环和许多繁琐的容易出错的区间运算,我怎么能这样做呢?我希望GenomicRanges包中包含这种功能,但我在手册中看不到它。我一直在尝试使用subsetByOverlaps,但这似乎没有将信号得分延伸到结果中,也似乎没有帮助提取上面的Rle向量。
答案 0 :(得分:2)
我想我可能已经弄明白了。我可以按时间间隔将getScores()
函数应用于每个范围。这些函数使用findOverlaps
改编自此答案https://stackoverflow.com/a/9913411/959926:
getScores <- function(interval) {
scores <- Rle(0, width(interval))
bases <- GRanges(
seqnames = seqnames(interval),
ranges = IRanges(start(interval):end(interval), width = 1))
overlaps <- findOverlaps(signal, bases)
scores[start(bases)[subjectHits(overlaps)] - start(interval) + 1] <- score(signal)[queryHits(overlaps)]
scores
}
Reduce('+', sapply(split(intervals, 1:length(intervals)), getScores)) / length(intervals)
到目前为止似乎有效,但欢迎任何改进。例如,当信号和/或间隔很长时,它很慢。
答案 1 :(得分:1)
这个解决方案怎么样?
overlaps <- findOverlaps(signal, intervals)
signal <- signal[overlaps@queryHits]
averagedSignal <- aggregate(score(sites), list(overlaps@subjectHits), mean)
答案 2 :(得分:0)
overlaps <- findOverlaps(signal, intervals)
sites <- signal[queryHits(overlaps)]
intervals$averagedSignal <- aggregate(score(sites), list(subjectHits(overlaps)), mean)