我有一个基因组窗口列表(参见下面的示例),我试图确定每个窗口的哪些位置与来自不同样本的一系列Chip-Seq峰重叠。我的目标是评估所有这些样本中峰值的位置,因此我创建了一个1和0的矩阵,其中行是样本,列是基因组位置。因此,矩阵的列数与窗口中的位置一样多。
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr22 [16056636, 16057635] *
[2] chr22 [16847853, 16848852] *
[3] chr22 [16848853, 16849852] *
[4] chr22 [16849853, 16850852] *
[5] chr22 [16850853, 16851852] *
[6] chr22 [16851853, 16852852] *
因此,一方面我有一个grwindows
GRanges对象,另一方面是每个样本包含一个GRange对象的GRangesList。目前,我以一种非常低效的方式逐窗口生成这些矩阵窗口,涉及多个循环。
for (window in 1:nwin) { # nwin: number of windows
matpeaks <- matrix(0,nrow=length(samples),ncol=wsize) # wsize: window size
for (sample in samples) { # We add every sample to the matrix
newrow <- numeric(wsize)
grwin <- grwindows[window]
grpeaks <- grlist[[sample]] # GRangesList object}
overlap <- findOverlaps(query=grwin,subject=grpeaks)
spans <- ranges(overlap,ranges(grwin),ranges(grpeaks)) # Ranges of the overlap
if (length(spans) > 0) {
for (i in 1:length(spans)) { # Peaks overlapping that window
newrow[(start(spans[i])-start(grwin)):(end(spans[i])-start(grwin))] <- 1
}
}
matpeaks[sample,] <- newrow
}
}
我对GRanges不是很熟悉,也没有R编程经验,所以我希望你能帮我优化这段代码。我想出了一些可以略微缩短执行时间的东西,但是仍然有太多的for
循环并且改进非常有限:
for (window in 1:nwin) {
grwin <- grwindows[window]
fun <- function(x) {overlap <- findOverlaps(query=grwin,subject=x); spans <- ranges(overlap,ranges(grwin),ranges(x))}
spans <- lapply(grlist,fun)
for (sam in 1:nrow(matpeaks)) {
if (length(spans[[sam]]) > 0) {
for (i in 1:length(spans[[sam]])) {
matpeaks[sam,(start(spans[[sam]][i])-start(grwin)):(end(spans[[sam]][i])-start(grwin))] <- 1
}
}
}
}
感谢您的关注,我希望自己能够理解。