当我想构建一个大矩阵时,我的代码变得非常慢。为什么?

时间:2014-05-01 15:29:53

标签: r

我有大数据框,我想从中构建一个特殊的矩阵。但是当我试图创建这个矩阵时,我的代码在for循环中大约500次迭代后变得非常慢。我把print(i)用来跟踪我脚本的进度。它起步非常快,但变得缓慢而缓慢。如果它像这样运行,我想我应该等几个月!

有人会帮我加速这段代码或让它并行吗?

这是我的代码和步骤:

A)有关变量大小和长度的信息:

dim(tars.hsa)
[1] 737857      2 
> length(ind.mirs)
[1] 737857
> length(lclus)
[1] 737857

了解 lclus 的条目:

    [8977]  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
    [99001]  2  2  1  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
    [99025]  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
    [99049]  2  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
    [99073]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
    [99097]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  5  5  5
    [99121]  5  5  5  5  5  5  5  5  5  5  3  5  5  5  5  5  5  5  5  5  5  2  2  2


> tars.hsa[1:10,]
                miR.Family Gene.Symbol
2     miR-23abc/miR-23b-3p        A1BG
19        miR-218/miR-218a        A1BG
121 miR-376abd/miR-376b-3p        A1CF
130 miR-376abd/miR-376b-3p        A1CF
143 miR-376abd/miR-376b-3p        A1CF
147 miR-376abd/miR-376b-3p        A1CF
153 miR-376abd/miR-376b-3p        A1CF
157 miR-376abd/miR-376b-3p        A1CF
201                miR-433        A1CF
215                miR-433        A1CF
> 

这是我的代码,

第一步:

# separate out individual miRs of one cluster
ind.mirs <- strsplit(tars.hsa[, "miR.Family"], split="/")
lclus    <- unlist(lapply(ind.mirs, length))

第二步(几次徘徊之后的最慢步骤):

new.tars <- NULL
for( i in 1:length(ind.mirs)){
  add.tar <- matrix(data=rep(tars.hsa[i,], times=lclus[i]), ncol=2, byrow=T)
  add.tar[,1] <-  ind.mirs[[i]]  # substitute miR cluster w/ individual miR
  new.tars <- rbind(new.tars, add.tar)
 print(i)
}

4 个答案:

答案 0 :(得分:2)

要避免第二个圆圈,可以使用矢量化函数。这是一个例子。

# Read data.
tars.hsa<-read.table(textConnection('miR.Family Gene.Symbol
2     miR-23abc/miR-23b-3p        A1BG
19        miR-218/miR-218a        A1BG
121 miR-376abd/miR-376b-3p        A1CF
130 miR-376abd/miR-376b-3p        A1CF
143 miR-376abd/miR-376b-3p        A1CF
147 miR-376abd/miR-376b-3p        A1CF
153 miR-376abd/miR-376b-3p        A1CF
157 miR-376abd/miR-376b-3p        A1CF
201                miR-433        A1CF
215                miR-433        A1CF'),stringsAsFactors=FALSE)


ind.mirs <- strsplit(tars.hsa[, "miR.Family"], split="/")
# lclus    <- unlist(lapply(ind.mirs, length))
lclus    <- sapply(ind.mirs, length)

# A single vectorized statement.
new.tars<-cbind(unlist(ind.mirs),
                rep(tars.hsa$Gene.Symbol,times=lclus))


# [,1]          [,2]  
# [1,] "miR-23abc"   "A1BG"
# [2,] "miR-23b-3p"  "A1BG"
# [3,] "miR-218"     "A1BG"
# [4,] "miR-218a"    "A1BG"
# [5,] "miR-376abd"  "A1CF"
# [6,] "miR-376b-3p" "A1CF"
# [7,] "miR-376abd"  "A1CF"
# [8,] "miR-376b-3p" "A1CF"
# [9,] "miR-376abd"  "A1CF"
# [10,] "miR-376b-3p" "A1CF"
# [11,] "miR-376abd"  "A1CF"
# [12,] "miR-376b-3p" "A1CF"
# [13,] "miR-376abd"  "A1CF"
# [14,] "miR-376b-3p" "A1CF"
# [15,] "miR-376abd"  "A1CF"
# [16,] "miR-376b-3p" "A1CF"
# [17,] "miR-433"     "A1CF"
# [18,] "miR-433"     "A1CF"

答案 1 :(得分:2)

我确信有人可以改进以下内容,但最好尽量避免使用循环并使用内置函数。我建议像:

#Make up some simple data to run it on
tars.hsa <- data.frame(miR.Family=c("a","b/b","c/c","d/d/d"), Gene.Symbol=paste0("A",1:4,"BG"),stringsAsFactors=FALSE)
ind.mirs <- strsplit(tars.hsa[, "miR.Family"], split="/")

lclus    <- sapply(ind.mirs, length)
new.tars <- matrix(NA,sum(lclus),2)
new.tars[,1] <- do.call(c,ind.mirs)
new.tars[,2] <- rep(tars.hsa$Gene.Symbol, each=lclus )

显然我无法测试你的数据有多快,所以你能告诉我们它是如何工作的。

答案 2 :(得分:1)

问题出在rbind,因为它重新分配整个new.tars并将其复制到循环的每个步骤的新位置。为了加快速度,事先预先分配矩阵并将其填入循环中。类似的东西:

new.tars = matrix(0, length(ind.mirs), length(ind.mirs));
for( i in 1:length(ind.mirs)){
  ...
  new.tars[,i] <- add.tar;
  print(i)
}

答案 3 :(得分:1)

另一种方法:

tars.hsa <- data.frame(
    miR.Family=c('miR-23abc/miR-23b-3p', 'miR-218/miR-218a', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-376abd/miR-376b-3p', 'miR-433', 'miR-433'),
    Gene.Symbol=c('A1BG','A1BG','A1CF','A1CF','A1CF','A1CF','A1CF','A1CF','A1CF','A1CF')
)

do.call(rbind, apply(tars.hsa, 1, function(row) {
    data.frame(stringsAsFactors=FALSE, row.names=NULL,
        miR.Family=strsplit(row['miR.Family'], '/'),
        Gene.Symbol=row['Gene.Symbol']
    )
}))