我正在开展一个项目,在这个项目中,我迭代鼠标基因组的区域并使用GenomicRanges
/ rtracklayer
计算一些重叠。分档已经以150个基准/位置增量计算。对于那些不熟悉的人来说,这意味着对于染色质Y,它恰好是最短的小鼠染色体之一,大约有106,000个区域,而较大的染色体包含大约1,300,000个区域! 这里的目标是迭代垃圾箱,在两个方向上扩展垃圾箱100,000个位置的范围,然后找出哪些基因与这些窗口重叠。我想计算从bin中心到扩展bin窗口中包含的基因开始的距离。
好的,这就是我在这一点上写的代码。它运行没有错误,并准确计算我需要的。 这里的问题是它很慢并且需要永远迭代1M +箱。
progress <- function(w) {
# just a function that prints out the window being processed
cat(sprintf(paste0(as.character(Sys.time()), ": Window ", w, " completed!\n")))
}
extend <- function(x, upstream=0, downstream=0) {
# this will expand a `GenomicRanges` object range
if (any(strand(x) == "*"))
warning("'*' ranges were treated as '+'")
on_plus <- strand(x) == "+" | strand(x) == "*"
new_start <- start(x) - ifelse(on_plus, upstream, downstream)
new_end <- end(x) + ifelse(on_plus, downstream, upstream)
ranges(x) <- IRanges(new_start, new_end)
trim(x)
}
feature.overlap <- function(x, window, genes, extend.upstream=100000, extend.downstream=100000) {
# # test case
# x = chrY; window = 2668 ; genes = gene; extend.upstream = 100000 ; extend.downstream = 100000
# extend window of signal in both directions
x.window = extend(x[window], extend.upstream, extend.downstream)
names(x.window) <- window
# compute signal window overlap with genes
overlaps <- subsetByOverlaps(genes, x.window)
if(length(overlaps) == 0){
values <- data.frame(signal_window=names(x.window),
signal_start=max(0, start(x.window)),
signal_center=max(0, start(x.window)) + floor((width(x.window) - 1)/2),
signal_end=end(x.window),
signal_score=x.window$score,
symbol=NA,
gene_id=NA,
gene_chr=NA,
gene_start=NA,
gene_end=NA,
gene_strand=NA)
} else {
hits <- findOverlaps(x.window, genes)
s.idx <- unique(subjectHits(hits))
q.idx <- unique(queryHits(hits))
values <- data.frame(signal_window=names(x.window)[q.idx],
signal_start=max(0, start(x.window)[q.idx]),
signal_center=max(0, start(x.window)[q.idx]) + floor((width(x.window)[q.idx] - 1)/2),
signal_end=end(x.window)[q.idx],
signal_score=x.window$score[q.idx],
mcols(overlaps)[,c(2,1)],
gene_chr=chrom(genes)[s.idx],
gene_start=ifelse(strand(genes)[s.idx] == '+', start(genes)[s.idx], end(genes)[s.idx]) ,
gene_end=end(genes)[s.idx],
gene_strand=strand(genes)[s.idx])
}
return(values)
}
# Import data
library(rtracklayer)
merged_wig <- import.wig('~/file/linked/below.wig', format='wig', genome='mm9')
merged_wig <- keepSeqlevels(merged_wig, paste0('chr', c(seq(1,19), 'X', 'Y')))
chrY <- merged_wig[seqnames(merged_wig) == 'chrY']
# Generate gene info needed for computing overlap
library(TxDb.Mmusculus.UCSC.mm9.knownGene); library(Mus.musculus)
gene <- genes(TxDb.Mmusculus.UCSC.mm9.knownGene)
values(gene) <- merge(values(gene), as.data.frame(org.Mm.egSYMBOL), by='gene_id', all.x=T)
gene <- keepSeqlevels(gene, paste0('chr', c(seq(1,19), 'X', 'Y')))
# BEGIN LOOP GENOME WINDOWS *** TIME CONSUMING ***
window.overlaps <- list()
ptm <- proc.time()
for(i in 1:100) { # ideally 1:length(chrY) but this takes very long so I've only posted a few windows
result = feature.overlap(chrY, i, gene, extend.upstream=100000, extend.downstream=100000)
window.overlaps[[i]] <- result
progress(i)
}
proc.time() - ptm
all.overlaps = do.call(rbind, window.overlaps)
上面的代码将使用this文件(88mb)运行。
我尝试使用foreach
doParallel
库加快外观:
library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
ptm <- proc.time()
ls<-foreach(i = 1:100, chrY=chrY, gene=gene, .packages=c('rtracklayer', 'GenomicRanges')) %dopar% {
result = feature.overlap(chrY, i, gene, extend.upstream=100000, extend.downstream=100000)
progress(i)
result
}
proc.time() - ptm
stopCluster(cl)
但是,这些代码不起作用。返回的错误是Error: this S4 class is not subsettable
,并且progress()
没有输出。 错误修复 - 查看编辑
同样,这里的目标是以更有效的方式写出来。一旦我values
,我就可以轻松计算出我需要的指标。
任何帮助将不胜感激!谢谢!
EDIT :我用dopar实现了一个有效的foreach循环,但它似乎比上面的实现更慢。
library(foreach)
library(doParallel)
cl<-makeCluster(8)
registerDoParallel(cl)
ptm <- proc.time()
ls <- foreach(i = 1:100, .combine='rbind', .packages=c('rtracklayer', 'GenomicRanges')) %dopar% {
result = feature.overlap(chrY, i, gene, counts, extend.upstream=100000, extend.downstream=100000)
progress(i)
result
}
proc.time() - ptm
stopCluster(cl)
对于100个窗口,这需要大约10秒,而使用上述for循环处理的相同窗口需要6秒。