我有2张桌子。他们都是染色体的形式,这个染色体的起点和终点坐标。第一个表包含基因,第二个表包含可能会或可能不会落入这些基因的短序列。在我的真实数据集中,基因大约有50.000行,序列大约有7.000.000行,并且两个表都有各种额外的列。我想在两个表之间找到重叠。
chromosome=as.character(rep(c(1,2,3,4,5), each=10000))
start=floor(runif(50000, min=0, max=50000000))
end=start+floor(runif(10000, min=0, max=10000))
genes=cbind(chromosome, start, end)
startseq=floor(runif(7000000, min=0, max=50000000))
endseq=startseq+4
sequences=cbind(chromosome, startseq, endseq)
我试图使用以下方法找到所有相交:
for (g in 1:nrow(sequences)) {
seqrow=as.vector(sequences[g,])
rownr=which(genes[,1]==seqrow[1] & genes[,2] < seqrow[2] & genes[,3] > seqrow[3])
print(rownr)
}
我打算使用这些行号对我真实数据集中的额外列执行操作。现在的问题是所描述的过程相当慢。我可以通过哪些方式加快这种交叉?
答案 0 :(得分:1)
您希望bioconductor用于此任务,特别是GenomicRanges包。这将返回类&#34; Hits&#34;的对象。它将包含重叠的索引。您也可以使用intersect
函数,但这会返回相交的间隔而不是相交seq的id。简而言之,bioconductor和GenomicRanges有许多有用的设置函数,它们非常快。
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite("GenomicRanges") ## I think genomicranges is part of the standard bioconductor install but if not this will install it.
library(GenomicRanges)
set.seed(8675309)
chromosome <- as.character(rep(c(1,2,3,4,5), each=10000))
start <- floor(runif(50000, min=0, max=50000000))
end <- start+floor(runif(10000, min=0, max=10000))
genes <- cbind(chromosome, start, end)
startseq <- floor(runif(7000000, min=0, max=50000000))
endseq <- startseq+4
chromosome <- sample(c(1,2,3,4,5), size = 7000000, replace=T)
sequences=cbind(chromosome, startseq, endseq)
genes <- GRanges(seqnames = chromosome, ranges = IRanges(start = start, end = end))
seqs <- GRanges(seqnames = chromosome, ranges = IRanges(start = startseq, end = endseq))
x <- findOverlaps(seqs, genes)
head(x)
#Hits object with 6 hits and 0 metadata columns:
# queryHits subjectHits
# <integer> <integer>
# [1] 2 41673
# [2] 2 47476
# [3] 3 20048
# [4] 4 9624
# [5] 4 5662
# [6] 4 1531
# -------
# queryLength: 7000000
# subjectLength: 50000