Question

一个模拟的产物是一个大的data.frame，具有固定的列和行。我运行了数百次模拟，每个结果存储在一个单独的RData文件中（为了有效读取）。

现在我想收集所有这些文件，并将这个data.frame的每个字段的统计信息创建到“单元格”结构中，该结构基本上是一个向量列表。我就是这样做的：

#colscount, rowscount - number of columns and rows from each simulation
#simcount - number of simulation.
#colnames - names of columns of simulation's data frame.
#simfilenames - vector with filenames with each simulation 

cells<-as.list(rep(NA, colscount))
for(i in 1:colscount)
{
  cells[[i]]<-as.list(rep(NA,rowscount)) 
  for(j in 1:rows)
  {
    cells[[i]][[j]]<-rep(NA,simcount)
  }
}
names(cells)<-colnames

addcells<-function(simnr)
# This function reads and appends simdata to "simnr" position in each cell in the "cells" structure
{
  simdata<readRDS(simfilenames[[simnr]])
  for(i in 1:colscount)
  {
    for(j in 1:rowscount)
    {
      if (!is.na(simdata[j,i]))
      {
        cells[[i]][[j]][simnr]<-simdata[j,i]
      }
    }
  }
}
library(plyr)
a_ply(1:simcount,1,addcells)

问题是，这就是

> system.time(dane<-readRDS(path.cat(args$rdatapath,pliki[[simnr]]))$dane)
   user  system elapsed 
  0.088   0.004   0.093

虽然

? system.time(addcells(1))
user  system elapsed 
147.328   0.296 147.644

我希望这两个命令具有可比较的执行时间（或者至少后者最多可以慢10倍）。我想我在做一些非常低效的事情，但是什么呢？整个cells数据结构相当大，需要大约1GB的内存。

我需要以这种方式转置数据，因为后来我对结果做了很多描述性统计（比如计算方法，sd，分位数和直方图），所以重要的是，每个单元格的数据都存储为a（单维）向量。

以下是分析输出：

> summaryRprof('/tmp/temp/rprof.out')
$by.self
                self.time self.pct total.time total.pct
"[.data.frame"      71.98    47.20     129.52     84.93
"names"             11.98     7.86      11.98      7.86
"length"            10.84     7.11      10.84      7.11
"addcells"          10.66     6.99     151.52     99.36
".subset"           10.62     6.96      10.62      6.96
"["                  9.68     6.35     139.20     91.28
"match"              6.06     3.97      11.36      7.45
"sys.call"           4.68     3.07       4.68      3.07
"%in%"               4.50     2.95      15.86     10.40
"all"                4.28     2.81       4.28      2.81
"=="                 2.34     1.53       2.34      1.53
".subset2"           1.28     0.84       1.28      0.84
"is.na"              1.06     0.70       1.06      0.70
"nargs"              0.62     0.41       0.62      0.41
"gc"                 0.54     0.35       0.54      0.35
"!"                  0.42     0.28       0.42      0.28
"dim"                0.34     0.22       0.34      0.22
".Call"              0.12     0.08       0.12      0.08
"readRDS"            0.10     0.07       0.12      0.08
"cat"                0.10     0.07       0.10      0.07
"readLines"          0.04     0.03       0.04      0.03
"strsplit"           0.04     0.03       0.04      0.03
"addParaBreaks"      0.02     0.01       0.04      0.03

看起来索引列表结构需要花费很多时间。但我不能使它成为数组，因为并非所有单元格都是数字，而R不容易支持哈希映射...

如何在R中有效地增长大数据

0 个答案: