我有大数据框,我想从中构建一个特殊的矩阵。但是当我试图创建这个矩阵时,我的代码在for循环中大约500次迭代后变得非常慢。我把print(i)用来跟踪我脚本的进度。它起步非常快,但变得缓慢而缓慢。如果它像这样运行,我想我应该等几个月!
有人会帮我加速这段代码或让它并行吗?
这是我的代码和步骤:
A)有关变量大小和长度的信息:
dim(tars.hsa)
[1] 737857 2
> length(ind.mirs)
[1] 737857
> length(lclus)
[1] 737857
了解 lclus 的条目:
[8977] 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[99001] 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[99025] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[99049] 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
[99073] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[99097] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5
[99121] 5 5 5 5 5 5 5 5 5 5 3 5 5 5 5 5 5 5 5 5 5 2 2 2
> tars.hsa[1:10,]
miR.Family Gene.Symbol
2 miR-23abc/miR-23b-3p A1BG
19 miR-218/miR-218a A1BG
121 miR-376abd/miR-376b-3p A1CF
130 miR-376abd/miR-376b-3p A1CF
143 miR-376abd/miR-376b-3p A1CF
147 miR-376abd/miR-376b-3p A1CF
153 miR-376abd/miR-376b-3p A1CF
157 miR-376abd/miR-376b-3p A1CF
201 miR-433 A1CF
215 miR-433 A1CF
>
这是我的代码,
第一步:
# separate out individual miRs of one cluster
ind.mirs <- strsplit(tars.hsa[, "miR.Family"], split="/")
lclus <- unlist(lapply(ind.mirs, length))
第二步(几次徘徊之后的最慢步骤):
new.tars <- NULL
for( i in 1:length(ind.mirs)){
add.tar <- matrix(data=rep(tars.hsa[i,], times=lclus[i]), ncol=2, byrow=T)
add.tar[,1] <- ind.mirs[[i]] # substitute miR cluster w/ individual miR
new.tars <- rbind(new.tars, add.tar)
print(i)
}
答案 0 :(得分:2)
要避免第二个圆圈,可以使用矢量化函数。这是一个例子。
# Read data.
tars.hsa<-read.table(textConnection('miR.Family Gene.Symbol
2 miR-23abc/miR-23b-3p A1BG
19 miR-218/miR-218a A1BG
121 miR-376abd/miR-376b-3p A1CF
130 miR-376abd/miR-376b-3p A1CF
143 miR-376abd/miR-376b-3p A1CF
147 miR-376abd/miR-376b-3p A1CF
153 miR-376abd/miR-376b-3p A1CF
157 miR-376abd/miR-376b-3p A1CF
201 miR-433 A1CF
215 miR-433 A1CF'),stringsAsFactors=FALSE)
ind.mirs <- strsplit(tars.hsa[, "miR.Family"], split="/")
# lclus <- unlist(lapply(ind.mirs, length))
lclus <- sapply(ind.mirs, length)
# A single vectorized statement.
new.tars<-cbind(unlist(ind.mirs),
rep(tars.hsa$Gene.Symbol,times=lclus))
# [,1] [,2]
# [1,] "miR-23abc" "A1BG"
# [2,] "miR-23b-3p" "A1BG"
# [3,] "miR-218" "A1BG"
# [4,] "miR-218a" "A1BG"
# [5,] "miR-376abd" "A1CF"
# [6,] "miR-376b-3p" "A1CF"
# [7,] "miR-376abd" "A1CF"
# [8,] "miR-376b-3p" "A1CF"
# [9,] "miR-376abd" "A1CF"
# [10,] "miR-376b-3p" "A1CF"
# [11,] "miR-376abd" "A1CF"
# [12,] "miR-376b-3p" "A1CF"
# [13,] "miR-376abd" "A1CF"
# [14,] "miR-376b-3p" "A1CF"
# [15,] "miR-376abd" "A1CF"
# [16,] "miR-376b-3p" "A1CF"
# [17,] "miR-433" "A1CF"
# [18,] "miR-433" "A1CF"
答案 1 :(得分:2)
我确信有人可以改进以下内容,但最好尽量避免使用循环并使用内置函数。我建议像:
#Make up some simple data to run it on
tars.hsa <- data.frame(miR.Family=c("a","b/b","c/c","d/d/d"), Gene.Symbol=paste0("A",1:4,"BG"),stringsAsFactors=FALSE)
ind.mirs <- strsplit(tars.hsa[, "miR.Family"], split="/")
lclus <- sapply(ind.mirs, length)
new.tars <- matrix(NA,sum(lclus),2)
new.tars[,1] <- do.call(c,ind.mirs)
new.tars[,2] <- rep(tars.hsa$Gene.Symbol, each=lclus )
显然我无法测试你的数据有多快,所以你能告诉我们它是如何工作的。
答案 2 :(得分:1)
问题出在rbind
,因为它重新分配整个new.tars
并将其复制到循环的每个步骤的新位置。为了加快速度,事先预先分配矩阵并将其填入循环中。类似的东西:
new.tars = matrix(0, length(ind.mirs), length(ind.mirs));
for( i in 1:length(ind.mirs)){
...
new.tars[,i] <- add.tar;
print(i)
}
答案 3 :(得分:1)
另一种方法:
tars.hsa <- data.frame(
miR.Family=c('miR-23abc/miR-23b-3p', 'miR-218/miR-218a', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-433', 'miR-433'),
Gene.Symbol=c('A1BG','A1BG','A1CF','A1CF','A1CF','A1CF','A1CF','A1CF','A1CF','A1CF')
)
do.call(rbind, apply(tars.hsa, 1, function(row) {
data.frame(stringsAsFactors=FALSE, row.names=NULL,
miR.Family=strsplit(row['miR.Family'], '/'),
Gene.Symbol=row['Gene.Symbol']
)
}))