我需要执行超过1500万次的行操作,但代码太慢。这是一个可重复的小例子:
costMatrix1 <- rbind(c(4.2,3.6,2.1,2.3),c(9.6,5.5,7.2,4.9),c(2.6,8.2,6.4,8.3),c(4.8,3.3,6.8,5.7))
costMatrix2 <- costMatrix1 #Example, the costMatrix2 is actually different from costMatrix1
tbl_Filter <- rbind(c(0,0,0,4),c(1,2,3,4),c(1,0,3,0),c(1,2,0,0),c(1,2,0,4))
tbl_Sums <- data.frame(matrix(0, nrow=10, ncol=2))
colnames(tbl_Sums) <- c("Sum1","Sum2")
for (i in 1:nrow(tbl_Filter))
{
tbl_Sums[i,1] <- sum(costMatrix1[tbl_Filter[i,],tbl_Filter[i,]])
tbl_Sums[i,2] <- sum(costMatrix2[tbl_Filter[i,],tbl_Filter[i,]])
}
我认为用ddply替换for循环是解决方案,但我无法让它工作。
答案 0 :(得分:5)
如果你有很大的阵列可以使用,你最好坚持使用基础R。
以下是如何使用sapply
来解决单个矩阵的求和问题。然后在每个输入矩阵上重复使用它:
sumOne <- function(cost, filter){
sapply(1:nrow(filter), function(i)sum(cost[filter[i,], filter[i,]]))
}
cbind(
sumOne(costMatrix1, tbl_Filter),
sumOne(costMatrix2, tbl_Filter)
)
结果:
[,1] [,2]
[1,] 5.7 11.4
[2,] 85.5 171.0
[3,] 15.3 30.6
[4,] 22.9 45.8
[5,] 43.9 87.8
这应该比你的循环快得多。不是因为for
循环本质上比sapply慢(它不是),而是因为sapply
自动为结果保留了内存,并且[<-
慢了。
答案 1 :(得分:4)
如果您有多个CPU核心,使用snowfall
可能会帮助您加快速度。设置(预并行化):
newfun = function(n) {
a <- sum(costMatrix1[tbl_Filter[n,],tbl_Filter[n,]])
b <- sum(costMatrix2[tbl_Filter[n,],tbl_Filter[n,]])
c(a,b)
}
nvec = matrix(data = 1:nrow(tbl_Filter), ncol = 1)
t = proc.time()
out = t(apply(nvec,1,function(x) newfun(x)))
proc.time() - t
现在,并行化了:
## load 'snowfall' package
require(snowfall)
## Initialize parallel operation --> choose number of CPUs here!
sfInit( parallel=TRUE, cpus=2 )
##################################################################
## 'Export' functions and variables to all "slaves" so that parallel calculations
## can occur
sfExport(list=list('newfun'))
sfExport('costMatrix1')
sfExport('costMatrix2')
sfExport('tbl_Filter')
sfExport('nvec')
## call function using sfApply; will return values as a list object
out = sfApply(nvec, 1, function(x) newfun(x))
## stop parallel computing job
sfStop()
tbl_Sums = as.data.frame(t(out))
colnames(tbl_Sums) <- c("Sum1","Sum2")
答案 2 :(得分:2)
不确定速度如何比较,但您也可以设置矩阵来进行矩阵乘法。这使用了tbl_Filter
中的信息在您想要求和的列中包含正数的事实。
> ttt <- apply((tbl_Filter>0)*1,1,function(x) x %*% t(x))
> t(rbind(as.numeric(costMatrix1), as.numeric(costMatrix2)) %*% ttt)
[,1] [,2]
[1,] 5.7 11.4
[2,] 85.5 171.0
[3,] 15.3 30.6
[4,] 22.9 45.8
[5,] 43.9 87.8
答案 3 :(得分:0)
除了上面提到的snowfall
库之外,还有multicore
只实现lapply
的并行版本(称为mclapply
)而不是apply
但是,重写代码以容纳这一点很容易:
newfun = function(n) {
a <- sum(costMatrix1[tbl_Filter[n,],tbl_Filter[n,]])
b <- sum(costMatrix2[tbl_Filter[n,],tbl_Filter[n,]])
c(a,b)
}
nvec = matrix(data = 1:nrow(tbl_Filter), ncol = 1)
# single-core version using apply
out = t(apply(nvec,1,newfun))
# multicore version using mclapply
library(multicore)
out.list = mclapply(1:nrow(nvec),function(i)newfun(nvec[i,])))
out = do.call("rbind", out.list)
# if the number of rows is huge, this will be much faster than do.call:
library(data.table)
out = rbindlist(out.list)