Question

我来自Java / Python comp sci理论背景，所以我仍然习惯了各种R包以及如何在函数中节省运行时间。

基本上，我正在开展一些项目，所有项目都涉及在长列表数据集（15,000到200,000个因子）中考虑个别因素，并在同等大的数据集中对各个因素进行计算，并同时存储在指数级更长的数据框中计算结果。

到目前为止，我一直在使用嵌套的while循环并连接到一个不断增长的列表，但这需要几天时间。我最近在R中了解了'lapply'和'data.frame'选项，我很想看到如何将它们（没有双关语）应用于以下基本相关函数的示例：

Corr<-function(miRdf, mRNAdf)
{
j=1  
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
  while(m<=1477)
  {    
  corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
  corrList<-c(corrList, corr)
  j=j+1
  c=c+1
  print(c)  #just a counter to see how far the function has run
  m=m+1
}
k=k+1
n=n+1
j=1
m=1         #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
}

如您所见，嵌套while循环导致105,636,517（71521x1477）miRNA与mRNA表达 - 值相关性分数需要执行并存储在1477 cols x 71521行的数据框中以生成评分矩阵。

我的问题是，任何人都可以阐明如何将上述怪物变成一个利用'lapply'代替while循环的高效函数，并使用'data.table'set（）函数来消除在每次通过循环期间连接列表的效率低下？

提前谢谢！

Answer 1

您的名字以“df”结尾，这使您的数据看起来像是data.frame。但是@ Troy的答案使用矩阵。当数据是同构的时，矩阵是合适的，并且通常矩阵运算比data.frame运算快得多。所以你可以看到，如果你提供了一个数据集的小例子（例如，dput(mRNAdf[1:10,])，人们可能会更好地帮助你;这就是他们所要求的。

在大型数值计算中，在环路外“提升”任何重复计算是有意义的，因此它们只执行一次。在您的情况下重复计算包括对第2:13列的子设置，以及对数字的强制。有了这个想法，并猜测你实际上有一个data.frame，其中每列已经是一个数字向量，我从

开始

mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])

在帮助页面?cor中，我们看到参数可以是矩阵，如果是，则在列之间计算相关性。当参数相对于当前表示转置时，您对结果感兴趣。所以

result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")

这足够快你的目的

> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
   user  system elapsed 
  9.124   0.200   9.340 
> dim(ans)
[1] 71521  1477

result与您的corrList相同 - 它不是列表，而是矩阵;可能行和列名称已经结转。您可以像上面那样将其写入文件write.csv(result, "testCorrWhole.csv")

Answer 2

更新以下显示并行处理 - 关于节省60％

使用apply()可能不够快。不过，这是怎么做的。自从这个例子（1000x1000网格中的1M输出相关性）在笔记本电脑上花费超过一分钟时，将考虑性能。

miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf)     # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)

corRow<-function(y){
    apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
  }

system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user  system elapsed 
# 72.94    0.00   73.39

使用4核Win64笔记本电脑上的parallel::parApply

require(parallel) ## Library to allow parallel processing

miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf)     # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)

corRow<-function(y){
    apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
  }


      # Make a cluster from all available cores
      cl=makeCluster(detectCores()) 
      # Use clusterExport() to distribute the function and data.frames needed in the apply() call
      clusterExport(cl,c("corRow","miRdf","mRNAdf"))
      # time the call
      system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))

      # Stop the cluster
      stopCluster(cl)

      # time the call without clustering
      system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))

      ## WITH CLUSTER (4)
      user  system elapsed 
      0.04    0.03   29.94 

      ## WITHOUT CLUSTER
      user  system elapsed 
      73.96    0.00   74.46

在R中增加data.table函数

2 个答案: