使用GRanges

时间:2016-05-30 14:15:17

标签: r optimization matrix bioinformatics bioconductor

我有一个基因组窗口列表(参见下面的示例),我试图确定每个窗口的哪些位置与来自不同样本的一系列Chip-Seq峰重叠。我的目标是评估所有这些样本中峰值的位置,因此我创建了一个1和0的矩阵,其中行是样本,列是基因组位置。因此,矩阵的列数与窗口中的位置一样多。

GRanges object with 6 ranges and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]    chr22 [16056636, 16057635]      *
  [2]    chr22 [16847853, 16848852]      *
  [3]    chr22 [16848853, 16849852]      *
  [4]    chr22 [16849853, 16850852]      *
  [5]    chr22 [16850853, 16851852]      *
  [6]    chr22 [16851853, 16852852]      *

因此,一方面我有一个grwindows GRanges对象,另一方面是每个样本包含一个GRange对象的GRangesList。目前,我以一种非常低效的方式逐窗口生成这些矩阵窗口,涉及多个循环。

for (window in 1:nwin) { # nwin: number of windows
    matpeaks <- matrix(0,nrow=length(samples),ncol=wsize) # wsize: window size
    for (sample in samples) { # We add every sample to the matrix
      newrow <- numeric(wsize)
      grwin <- grwindows[window]
      grpeaks <- grlist[[sample]] # GRangesList object}
      overlap <- findOverlaps(query=grwin,subject=grpeaks)
      spans <- ranges(overlap,ranges(grwin),ranges(grpeaks)) # Ranges of the overlap
      if (length(spans) > 0) {
        for (i in 1:length(spans)) { # Peaks overlapping that window
           newrow[(start(spans[i])-start(grwin)):(end(spans[i])-start(grwin))] <- 1
        }
      }  
      matpeaks[sample,] <- newrow
    }
 }

我对GRanges不是很熟悉,也没有R编程经验,所以我希望你能帮我优化这段代码。我想出了一些可以略微缩短执行时间的东西,但是仍然有太多的for循环并且改进非常有限:

for (window in 1:nwin) {
  grwin <- grwindows[window]
  fun <- function(x) {overlap <- findOverlaps(query=grwin,subject=x); spans <- ranges(overlap,ranges(grwin),ranges(x))}
  spans <- lapply(grlist,fun)
  for (sam in 1:nrow(matpeaks)) {
    if (length(spans[[sam]]) > 0) {
      for (i in 1:length(spans[[sam]])) {
        matpeaks[sam,(start(spans[[sam]][i])-start(grwin)):(end(spans[[sam]][i])-start(grwin))] <- 1
      }
    }
  }
}

感谢您的关注,我希望自己能够理解。

0 个答案:

没有答案