必须加快逐行操作

时间:2012-02-23 15:13:30

标签: r

我需要执行超过1500万次的行操作,但代码太慢。这是一个可重复的小例子:

costMatrix1 <- rbind(c(4.2,3.6,2.1,2.3),c(9.6,5.5,7.2,4.9),c(2.6,8.2,6.4,8.3),c(4.8,3.3,6.8,5.7))
costMatrix2 <- costMatrix1 #Example, the costMatrix2 is actually different from costMatrix1

tbl_Filter <- rbind(c(0,0,0,4),c(1,2,3,4),c(1,0,3,0),c(1,2,0,0),c(1,2,0,4))

tbl_Sums <- data.frame(matrix(0, nrow=10, ncol=2))
colnames(tbl_Sums) <- c("Sum1","Sum2")

for (i in 1:nrow(tbl_Filter))
{
  tbl_Sums[i,1] <- sum(costMatrix1[tbl_Filter[i,],tbl_Filter[i,]])
  tbl_Sums[i,2] <- sum(costMatrix2[tbl_Filter[i,],tbl_Filter[i,]])
}

我认为用ddply替换for循环是解决方案,但我无法让它工作。

4 个答案:

答案 0 :(得分:5)

如果你有很大的阵列可以使用,你最好坚持使用基础R。

以下是如何使用sapply来解决单个矩阵的求和问题。然后在每个输入矩阵上重复使用它:

sumOne <- function(cost, filter){
  sapply(1:nrow(filter), function(i)sum(cost[filter[i,], filter[i,]]))
}


cbind(
    sumOne(costMatrix1, tbl_Filter),
    sumOne(costMatrix2, tbl_Filter)
)

结果:

     [,1]  [,2]
[1,]  5.7  11.4
[2,] 85.5 171.0
[3,] 15.3  30.6
[4,] 22.9  45.8
[5,] 43.9  87.8

这应该比你的循环快得多。不是因为for循环本质上比sapply慢(它不是),而是因为sapply自动为结果保留了内存,并且[<-慢了。

答案 1 :(得分:4)

如果您有多个CPU核心,使用snowfall可能会帮助您加快速度。设置(预并行化):

newfun = function(n) {
  a <- sum(costMatrix1[tbl_Filter[n,],tbl_Filter[n,]])
  b <- sum(costMatrix2[tbl_Filter[n,],tbl_Filter[n,]])
  c(a,b)
  }

nvec = matrix(data = 1:nrow(tbl_Filter), ncol = 1)

t = proc.time()
out = t(apply(nvec,1,function(x) newfun(x)))
proc.time() - t

现在,并行化了:

## load 'snowfall' package
require(snowfall)

## Initialize parallel operation --> choose number of CPUs here!
sfInit( parallel=TRUE, cpus=2 )

##################################################################
## 'Export' functions and variables to all "slaves" so that parallel calculations
## can occur

sfExport(list=list('newfun'))

sfExport('costMatrix1')
sfExport('costMatrix2')
sfExport('tbl_Filter')
sfExport('nvec')

## call function using sfApply; will return values as a list object
 out = sfApply(nvec, 1, function(x) newfun(x))

## stop parallel computing job
sfStop()

tbl_Sums = as.data.frame(t(out))
colnames(tbl_Sums) <- c("Sum1","Sum2")

答案 2 :(得分:2)

不确定速度如何比较,但您也可以设置矩阵来进行矩阵乘法。这使用了tbl_Filter中的信息在您想要求和的列中包含正数的事实。

> ttt <- apply((tbl_Filter>0)*1,1,function(x) x %*% t(x))
> t(rbind(as.numeric(costMatrix1), as.numeric(costMatrix2)) %*% ttt)
     [,1]  [,2]
[1,]  5.7  11.4
[2,] 85.5 171.0
[3,] 15.3  30.6
[4,] 22.9  45.8
[5,] 43.9  87.8

答案 3 :(得分:0)

除了上面提到的snowfall库之外,还有multicore只实现lapply的并行版本(称为mclapply)而不是apply但是,重写代码以容纳这一点很容易:

newfun = function(n) {
  a <- sum(costMatrix1[tbl_Filter[n,],tbl_Filter[n,]])
  b <- sum(costMatrix2[tbl_Filter[n,],tbl_Filter[n,]])
  c(a,b)
}

nvec = matrix(data = 1:nrow(tbl_Filter), ncol = 1)

# single-core version using apply
out = t(apply(nvec,1,newfun))

# multicore version using mclapply
library(multicore)
out.list = mclapply(1:nrow(nvec),function(i)newfun(nvec[i,]))) 
out = do.call("rbind", out.list) 

# if the number of rows is huge, this will be much faster than do.call:
library(data.table)
out = rbindlist(out.list)