更新：发布样本数据

Question

我有一个程序从MySQL数据库中提取数据，解码一对二进制列，然后将该对中的行的子集加在一起二进制列。在示例数据集上运行程序需要12-14秒，其中有9-10人被unlist占用。我想知道是否有办法加快速度。

表的结构

我从数据库中获取的行如下所示：

| array_length | mz_array        | intensity_array |
|--------------+-----------------+-----------------|
|           98 | 00c077e66340... | 002091c37240... |
|           74 | c04a7c7340...   | db87734000...   |

其中array_length是两个数组中的小端双精度数（它们保证长度相同）。所以第一排有98个双打每个mz_array和intensity_array。 array_length的平均值为825和a 中位数为620，有13,000行。

解码二进制数组

通过传递给以下函数来解码每一行。一旦二进制数组已被解码，不再需要array_length。

DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
  sapply(list(mz_array=mz_array, intensity_array=intensity_array),
         readBin,
         what="double",
         endian="little",
         n=array_length)
}

对数组求和

下一步是对intensity_array中的值求和，但仅限于它们 mz_array中的相应条目在特定窗口内。阵列是按mz_array排序，升序。我正在使用以下功能来总结 intensity_array值：

SumInWindow <- function(spectrum, lower, upper) {
  sum(spectrum[spectrum[,1] > lower & spectrum[,1] < upper, 2])
}

其中spectrum是DecodeSpectrum的输出，matrix。

在行列表上操作

每一行都由：

处理

ProcessSegment <- function(spectra, window_bounds) {
  lower <- window_bounds[1]
  upper <- window_bounds[2]
  ## Decode a single spectrum and sum the intensities within the window.
  SumDecode <- function (...) {
    SumInWindow(DecodeSpectrum(...), lower, upper)
  }

  do.call("mapply", c(SumDecode, spectra))
}

最后，这些行被提取并传递给ProcessSegment 功能：

ProcessAllSegments <- function(conn, window_bounds) {
  nextSeg <- function() odbcFetchRows(conn, max=batchSize, buffsize=batchSize)

  while ((res <- nextSeg())$stat == 1 && res$data[[1]] > 0) {
    print(ProcessSegment(res$data, window_bounds))
  }
}

我正在分段中进行提取，以便R不必加载整个数据立即设置到内存中（导致内存不足错误）。我正在使用 RODBC驱动程序，因为RMySQL驱动程序无法返回未被修复的二进制文件价值（据我所知）。

效果

对于大约140MiB的样本数据集，整个过程大约需要14秒完成，这对13,000行来说并不坏。不过，我认为还有空间改进，尤其是在查看Rprof输出时：

$by.self
                 self.time self.pct total.time total.pct
"unlist"             10.26    69.99      10.30     70.26
"SumInWindow"         1.06     7.23      13.92     94.95
"mapply"              0.48     3.27      14.44     98.50
"as.vector"           0.44     3.00      10.60     72.31
"array"               0.40     2.73       0.40      2.73
"FUN"                 0.40     2.73       0.40      2.73
"list"                0.30     2.05       0.30      2.05
"<"                   0.22     1.50       0.22      1.50
"unique"              0.18     1.23       0.36      2.46
">"                   0.18     1.23       0.18      1.23
".Call"               0.16     1.09       0.16      1.09
"lapply"              0.14     0.95       0.86      5.87
"simplify2array"      0.10     0.68      11.48     78.31
"&"                   0.10     0.68       0.10      0.68
"sapply"              0.06     0.41      12.36     84.31
"c"                   0.06     0.41       0.06      0.41
"is.factor"           0.04     0.27       0.04      0.27
"match.fun"           0.04     0.27       0.04      0.27
"<Anonymous>"         0.02     0.14      13.94     95.09
"unique.default"      0.02     0.14       0.06      0.41

$by.total
                     total.time total.pct self.time self.pct
"ProcessAllSegments"      14.66    100.00      0.00     0.00
"do.call"                 14.50     98.91      0.00     0.00
"ProcessSegment"          14.50     98.91      0.00     0.00
"mapply"                  14.44     98.50      0.48     3.27
"<Anonymous>"             13.94     95.09      0.02     0.14
"SumInWindow"             13.92     94.95      1.06     7.23
"sapply"                  12.36     84.31      0.06     0.41
"DecodeSpectrum"          12.36     84.31      0.00     0.00
"simplify2array"          11.48     78.31      0.10     0.68
"as.vector"               10.60     72.31      0.44     3.00
"unlist"                  10.30     70.26     10.26    69.99
"lapply"                   0.86      5.87      0.14     0.95
"array"                    0.40      2.73      0.40     2.73
"FUN"                      0.40      2.73      0.40     2.73
"unique"                   0.36      2.46      0.18     1.23
"list"                     0.30      2.05      0.30     2.05
"<"                        0.22      1.50      0.22     1.50
">"                        0.18      1.23      0.18     1.23
".Call"                    0.16      1.09      0.16     1.09
"nextSeg"                  0.16      1.09      0.00     0.00
"odbcFetchRows"            0.16      1.09      0.00     0.00
"&"                        0.10      0.68      0.10     0.68
"c"                        0.06      0.41      0.06     0.41
"unique.default"           0.06      0.41      0.02     0.14
"is.factor"                0.04      0.27      0.04     0.27
"match.fun"                0.04      0.27      0.04     0.27

$sample.interval
[1] 0.02

$sampling.time
[1] 14.66

我很惊讶unlist占用了这么多时间;这告诉我那里可能是一些冗余的复制或重新安排。我是R的新人，所以就是这样完全有可能这是正常的，但我想知道是否有什么明显错了。

更新：发布样本数据

我已发布该程序的完整版本 here以及我使用的示例数据 here。样本数据是来自gzip的{{1}}输出。您需要设置适当的环境脚本连接数据库的变量：

mysqldump
MZDB_HOST
MZDB_DB
MZDB_USER

要运行脚本，您必须指定MZDB_PW和窗口边界。一世像这样运行程序：

run_id

这些窗口边界非常随意，但选择大约一半到三分之一范围。如果要打印结果，请在呼叫周围加Rscript ChromatoGen.R -i 1 -m 600 -M 1200 到print()内的ProcessSegment。使用那些参数，前5应该是：

ProcessAllSegments

您可能希望限制结果数量，除非您想要13,000 填充屏幕的数字:)最简单的方法是在最后添加[1] 7139.682 4522.314 3435.512 5255.024 5947.999 LIMIT 5。

Answer 1

我已经明白了！

问题出在sapply()电话中。 sapply做了相当多的事情重命名和属性设置，大大减慢了数组的速度这个大小。用以下代码替换DecodeSpectrum带来了样本从14.66秒到3.36秒的时间，增加了4倍！

以下是DecodeSpectrum的新主体：

DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
  ## needed to tell `vapply` how long the result should be. No, there isn't an
  ## easier way to do this.
  resultLength <- rep(1.0, array_length)

  vapply(list(mz_array=mz_array, intensity_array=intensity_array),
         readBin,
         resultLength,
         what="double",
         endian="little",
         n=array_length,
         USE.NAMES=FALSE)
}

Rprof输出现在看起来像：

$by.self
               self.time self.pct total.time total.pct
"<Anonymous>"           0.64    19.75       2.14     66.05
"DecodeSpectrum"        0.46    14.20       1.12     34.57
".Call"                 0.42    12.96       0.42     12.96
"FUN"                   0.38    11.73       0.38     11.73
"&"                     0.16     4.94       0.16      4.94
">"                     0.14     4.32       0.14      4.32
"c"                     0.14     4.32       0.14      4.32
"list"                  0.14     4.32       0.14      4.32
"vapply"                0.12     3.70       0.66     20.37
"mapply"                0.10     3.09       2.54     78.40
"simplify2array"        0.10     3.09       0.30      9.26
"<"                     0.08     2.47       0.08      2.47
"t"                     0.04     1.23       2.72     83.95
"as.vector"             0.04     1.23       0.08      2.47
"unlist"                0.04     1.23       0.08      2.47
"lapply"                0.04     1.23       0.04      1.23
"unique.default"        0.04     1.23       0.04      1.23
"NextSegment"           0.02     0.62       0.50     15.43
"odbcFetchRows"         0.02     0.62       0.46     14.20
"unique"                0.02     0.62       0.10      3.09
"array"                 0.02     0.62       0.04      1.23
"attr"                  0.02     0.62       0.02      0.62
"match.fun"             0.02     0.62       0.02      0.62
"odbcValidChannel"      0.02     0.62       0.02      0.62
"parent.frame"          0.02     0.62       0.02      0.62

$by.total
                     total.time total.pct self.time self.pct
"ProcessAllSegments"       3.24    100.00      0.00     0.00
"t"                        2.72     83.95      0.04     1.23
"do.call"                  2.68     82.72      0.00     0.00
"mapply"                   2.54     78.40      0.10     3.09
"<Anonymous>"              2.14     66.05      0.64    19.75
"DecodeSpectrum"           1.12     34.57      0.46    14.20
"vapply"                   0.66     20.37      0.12     3.70
"NextSegment"              0.50     15.43      0.02     0.62
"odbcFetchRows"            0.46     14.20      0.02     0.62
".Call"                    0.42     12.96      0.42    12.96
"FUN"                      0.38     11.73      0.38    11.73
"simplify2array"           0.30      9.26      0.10     3.09
"&"                        0.16      4.94      0.16     4.94
">"                        0.14      4.32      0.14     4.32
"c"                        0.14      4.32      0.14     4.32
"list"                     0.14      4.32      0.14     4.32
"unique"                   0.10      3.09      0.02     0.62
"<"                        0.08      2.47      0.08     2.47
"as.vector"                0.08      2.47      0.04     1.23
"unlist"                   0.08      2.47      0.04     1.23
"lapply"                   0.04      1.23      0.04     1.23
"unique.default"           0.04      1.23      0.04     1.23
"array"                    0.04      1.23      0.02     0.62
"attr"                     0.02      0.62      0.02     0.62
"match.fun"                0.02      0.62      0.02     0.62
"odbcValidChannel"         0.02      0.62      0.02     0.62
"parent.frame"             0.02      0.62      0.02     0.62

$sample.interval
[1] 0.02

$sampling.time
[1] 3.24

有些额外的表现可能会被挤掉通过do.call('mapply', ...)电话，但我对此感到满意因为我不愿意浪费时间表现。

R：`unlist`在对矩阵的子集求和时使用了大量时间

表的结构

解码二进制数组

对数组求和

在行列表上操作

效果

更新：发布样本数据

1 个答案: