加快矩阵运算

时间:2018-10-17 01:51:23

标签: r matrix apply

我有两个尺寸相同的矩阵(实际上是2,500 x 15,000):

set.seed(1)
a.mat <- matrix(rnorm(25*150),nrow=25,ncol=150,dimnames=list(paste0("p",1:25),paste0("c",1:150)))
b.mat <- matrix(rnorm(25*150),nrow=25,ncol=150,dimnames=list(paste0("p",1:25),paste0("c",1:150)))

我正在和他们一起计算:

res.mat <- do.call(cbind,lapply(1:ncol(a.mat),function(i){
  t.mat <- a.mat-a.mat[,i]
  t.mat <- log10(abs(t.mat)+1) * sign(t.mat)
  return(suppressWarnings(cor(t.mat,b.mat[,i])))
}))

是否知道该方法是否以及如何比我目前执行的方法更快?也许与multidplyr并行运行?

每个`multidplyr,这是我要尝试的内容:

library(dplyr)

df <- do.call(rbind,lapply(1:ncol(a.mat),function(i){
    cbind(reshape2::melt(a.mat) %>% dplyr::rename(id=Var1,cell=Var2,a.value=value),
          do.call("rbind",replicate(ncol(a.mat),data.frame(cell.i=colnames(a.mat)[i],a.value.i=a.mat[,i]),simplify=F)),
          do.call("rbind",replicate(ncol(a.mat),data.frame(b.value.i=b.mat[,i]),simplify=F)))
  })) %>% dplyr::mutate(t.value=a.value-a.value.i) %>% dplyr::mutate(t.value=log10(abs(t.value)+1)*sign(t.value)) %>% dplyr::group_by(cell.i)

然后:

group.size <- 3
n.groups <- ceiling(ncol(a.mat)/group.size)

for(i in 1:n.groups){
  start.idx <- (i-1)*group.size+1
  end.idx <- min(i*group.size,ncol(a.mat))
  current.df <- df %>% dplyr::filter(cell.i %in% colnames(a.mat)[start.idx:end.idx])
  current.df <- current.df %>% multidplyr::partition(cell.i) %>% multidplyr::cluster_library("tidyverse") %>% multidplyr::cluster_library("MASS") %>%
    multidplyr::cluster_assign_value("myFunction", myFunction) %>%
    do(results = myFunction(.)) %>% dplyr::collect() %>% .$results %>% dplyr::bind_rows()
}

位置:

myFunction <- function(df)
{
  return(df %>% dplyr::group_by(e.cell) %>% dplyr::mutate(cor=cor(t.value,b.value.i)))
}

但是,这会出现以下错误消息:

Error in checkForRemoteErrors(lapply(cl, recvResult)) :
  3 nodes produced errors; first error: Can't convert an environment to function
Call `rlang::last_error()` to see a backtrace
In addition: Warning message:
group_indices_.grouped_df ignores extra arguments

那么,以这种方式使用multidplyr值得吗?如果有的话,我在这里做错了什么?

0 个答案:

没有答案