更新

Question

背景

来自R编程，我正在使用 Rcpp 扩展为C / C ++形式的已编译代码的过程。作为练习循环交换效果的练习（通常只是C / C ++），我用 Rcpp （）对矩阵的R实现了activeIndex和rowSums()函数的等效项（我知道这些以Rcpp糖和犰狳的形式存在-这只是一种练习）。

问题

我有colSums()和rowSums()的C ++实现以及Rcpp sugar中的this matsums.cpp file和colSums()版本。我的只是这样的简单循环：

arma::sum()

（ R矩阵存储在以列为主的列中，因此外循环中的列应该是更有效的方法。这就是我最初所测试的。）

在运行这些基准测试时，我遇到了意料之外的事情：行总和与列总和之间存在明显的性能差异（请参见下面的基准）：

使用内置的R函数，NumericVector Cpp_colSums(const NumericMatrix& x) { int nr = x.nrow(), nc = x.ncol(); NumericVector ans(nc); for (int j = 0; j < nc; j++) { double sum = 0.0; for (int i = 0; i < nr; i++) { sum += x(i, j); } ans[j] = sum; } return ans; } NumericVector Cpp_rowSums(const NumericMatrix& x) { int nr = x.nrow(), nc = x.ncol(); NumericVector ans(nr); for (int j = 0; j < nc; j++) { for (int i = 0; i < nr; i++) { ans[i] += x(i, j); } } return ans; }的速度大约是colSums()的两倍。
使用我自己的Rcpp和制糖版本，这是相反的：rowSums()的速度大约是rowSums()的两倍。
最后，添加Armadillo实现，这些操作大致相等（col sum可能会更快一些，因为我也希望它们也位于R中）。

所以我的主要问题是：为什么colSums()比Cpp_rowSums()快得多？

作为次要兴趣，我也很好奇为什么在R实现中会颠倒相同的区别。（我略过了the C source，但并不能真正看出明显的区别。）（第三，犰狳如何获得相同的性能？）

基准

我在Cpp_colSums()对称矩阵上测试了这两个函数的所有4个实现：

10,000 x 10,000

（同样，您可以找到C ++源文件Rcpp::sourceCpp("matsums.cpp") set.seed(92136) n <- 1e4 # build n x n test matrix x <- matrix(rnorm(n), 1, n) x <- crossprod(x, x) # symmetric bench::mark( rowSums(x), colSums(x), Cpp_rowSums(x), Cpp_colSums(x), Sugar_rowSums(x), Sugar_colSums(x), Arma_rowSums(x), Arma_colSums(x) )[, 1:7] #> # A tibble: 8 x 7 #> expression min mean median max `itr/sec` mem_alloc #> <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> #> 1 rowSums(x) 192.2ms 207.9ms 194.6ms 236.9ms 4.81 78.2KB #> 2 colSums(x) 93.4ms 97.2ms 96.5ms 101.3ms 10.3 78.2KB #> 3 Cpp_rowSums(x) 73.5ms 76.3ms 76ms 80.4ms 13.1 80.7KB #> 4 Cpp_colSums(x) 126.5ms 127.6ms 126.8ms 130.3ms 7.84 80.7KB #> 5 Sugar_rowSums(x) 73.9ms 75.6ms 74.3ms 79.4ms 13.2 80.7KB #> 6 Sugar_colSums(x) 124.2ms 125.8ms 125.6ms 127.9ms 7.95 80.7KB #> 7 Arma_rowSums(x) 73.2ms 74.7ms 73.9ms 79.3ms 13.4 80.7KB #> 8 Arma_colSums(x) 62.8ms 64.4ms 63.7ms 69.6ms 15.5 80.7KB here。）

平台：

matsums.cpp

更新

进一步研究，我还使用R的传统C接口编写了相同的函数：来源为here。我用> sessioninfo::platform_info() setting value version R version 3.5.1 (2018-07-02) os Windows >= 8 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 tz Europe/Helsinki date 2018-08-09 compiled the functions，并再次进行了测试：行总和比col总和快的相同现象仍然存在（benchmarks）。然后，我还查看了the disassembly with objdump，但在我看来（由于对asm的理解非常有限），编译器并没有真正优化主循环体（rows，cols）还有C代码之外的任何内容吗？

Answer 1

首先，让我在笔记本电脑上显示计时统计信息。我使用的是5000 x 5000的矩阵，足以用于基准测试，并且microbenchmark包用于100个评估。

Unit: milliseconds
             expr       min        lq      mean    median        uq       max
       colSums(x)  71.40671  71.64510  71.80394  71.72543  71.80773  75.07696
   Cpp_colSums(x)  71.29413  71.42409  71.65525  71.48933  71.56241  77.53056
 Sugar_colSums(x)  73.05281  73.19658  73.38979  73.25619  73.31406  76.93369
  Arma_colSums(x)  39.08791  39.34789  39.57979  39.43080  39.60657  41.70158
       rowSums(x) 177.33477 187.37805 187.57976 187.49469 187.73155 194.32120
   Cpp_rowSums(x)  54.00498  54.37984  54.70358  54.49165  54.73224  64.16104
 Sugar_rowSums(x)  54.17001  54.38420  54.73654  54.56275  54.75695  61.80466
  Arma_rowSums(x)  49.54407  49.77677  50.13739  49.90375  50.06791  58.29755

R核中的

C代码并不总是比我们自己编写的要好。 Cpp_rowSums比rowSums更快，表明了这一点。我不觉得自己有能力解释为什么R的版本比应该的慢。我将重点介绍：我们如何进一步优化自己的colSums和rowSums来击败Armadillo 。请注意，我编写C，使用R的旧C接口，并使用R CMD SHLIB进行编译。

`colSums`和`rowSums`之间是否有实质性区别？

如果我们有一个n x n矩阵，它比CPU缓存的容量大得多，则colSums从RAM加载n x n数据到缓存，但是rowSums加载为是2 x n x n的两倍。

想想得到的向量包含和：长度为n的向量从RAM加载到高速缓存的次数是多少？对于colSums，它仅被加载一次，但是对于rowSums，它被加载n次。每次向其中添加矩阵列时，该列都会加载到缓存中，但由于它太大而被逐出。

对于大型n：

colSums导致n x n + n数据从RAM加载到缓存；
rowSums导致n x n + n x n数据从RAM加载到缓存。

换句话说，rowSums从理论上讲内存效率较低，并且可能会更慢。

如何提高`colSums`的性能？

由于RAM和缓存之间的数据流很容易优化，因此唯一的改进就是循环展开。将内部循环（求和循环）展开2的深度就足够了，我们将看到2倍的提升。

循环展开可以启用CPU的指令流水线。如果我们每次迭代仅做一次加法运算，则不可能进行流水线化。有两个附加功能，该指令级并行性开始起作用。我们也可以将循环展开深度为4，但是我的经验是，深度为2的展开足以从循环展开中获得大部分好处。

如何提高`rowSums`的性能？

优化数据流是第一步。我们首先需要进行缓存阻止，以将数据传输从2 x n x n降低到n x n。

将此n x n矩阵切成许多行块：每个行块为2040 x n（最后一个块可能较小），然后逐个块应用普通的rowSums块。对于每个块，累加器向量的长度为2040，大约是32KB CPU缓存可以容纳的一半。对于添加到此累加器向量的矩阵列，将另一半取反。这样，累加器向量可以保存在缓存中，直到处理了该块中的所有矩阵列。结果，累加器向量仅被加载到缓存一次，因此总体存储性能与colSums相同。

现在，我们可以进一步为每个块中的rowSums应用循环展开。将外部循环和内部循环都展开2的深度，我们将看到一个增强。展开外循环后，应将块大小减小到1360，因为现在我们需要在缓存中保留空间，以便每次外循环迭代可容纳三个length-1360向量。

C代码：让我们击败犰狳

编写带有循环展开的代码可能是一件令人讨厌的工作，因为我们现在需要为一个函数编写几个不同的版本。

对于colSums，我们需要两个版本：

colSums_1x1：内部和外部循环都以深度1展开，即，这是一个没有循环展开的版本；
colSums_2x1：没有展开外部循环，而展开内部循环深度为2。

对于rowSums，我们最多可以有四个版本，rowSums_sxt，其中s = 1 or 2是内循环的展开深度，t = 1 or 2是外循环的展开深度。

如果我们一个接一个地编写每个版本，代码编写可能会非常繁琐。经过多年或无奈之后，我使用内联的模板函数和宏开发了一个“自动代码/版本生成”技巧。

#include <stdlib.h>
#include <Rinternals.h>

static inline void colSums_template_sx1 (size_t s,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t nrc = nr % s, i;
  double *A_end = A + LDA * nc, a0, a1;

  for (; A < A_end; A += LDA) {
    a0 = 0.0; a1 = 0.0;  // accumulator register variables
    if (nrc > 0) a0 = A[0];  // is there a "fractional loop"?
    for (i = nrc; i < nr; i += s) {  // main loop of depth-s
      a0 += A[i];  // 1st iteration
      if (s > 1) a1 += A[i + 1];  // 2nd iteration
      }
    if (s > 1) a0 += a1;  // combine two accumulators
    *sum++ = a0;  // write-back
    }

  }

#define macro_define_colSums(s, colSums_sx1) \
SEXP colSums_sx1 (SEXP matA) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, ncols(matA))); \
  double *sum = REAL(result); \
  colSums_template_sx1(s, A, nrow_A, nrow_A, ncol_A, sum); \
  UNPROTECT(1); \
  return result; \
  }

macro_define_colSums(1, colSums_1x1)
macro_define_colSums(2, colSums_2x1)

模板函数为具有sum <- colSums(A[1:nr, 1:nc])（A的前向维）行的矩阵A计算（以R语法）LDA。参数s是内部循环展开的版本控制。乍看之下，模板函数看起来很恐怖，因为它包含许多if。但是，它被声明为static inline。如果通过将已知常量1或2传递给s来调用，则优化的编译器可以在编译时评估那些if，消除无法访问的代码，并删除“设置但未使用” “变量（注册已初始化，修改但未写回到RAM的变量。

该宏用于函数声明。接受常量s和函数名称，它将生成具有所需循环展开版本的函数。

以下内容适用于rowSums。

static inline void rowSums_template_sxt (size_t s, size_t t,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t ncr = nc % t, nrr = nr % s, i;
  double *A_end = A + LDA * nc, *B;
  double a0, a1;

  for (i = 0; i < nr; i++) sum[i] = 0.0;  // necessary initialization

  if (ncr > 0) {  // is there a "fractional loop" for the outer loop?
    if (nrr > 0) sum[0] += A[0];  // is there a "fractional loop" for the inner loop?
    for (i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      sum[i] += A[i];
      if (s > 1) sum[i + 1] += A[i + 1];
      }
    A += LDA;
    }

  for (; A < A_end; A += t * LDA) {  // main outer loop with depth-t
    if (t > 1) B = A + LDA;
    if (nrr > 0) {  // is there a "fractional loop" for the inner loop?
      a0 = A[0]; if (t > 1) a0 += A[LDA];
      sum[0] += a0;
      }
    for(i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      a0 = A[i]; if (t > 1) a0 += B[i];
      sum[i] += a0;
      if (s > 1) {
        a1 = A[i + 1]; if (t > 1) a1 += B[i + 1];
        sum[i + 1] += a1;
        }
      }
    }

  }

#define macro_define_rowSums(s, t, rowSums_sxt) \
SEXP rowSums_sxt (SEXP matA, SEXP chunk_size) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, nrows(matA))); \
  double *sum = REAL(result); \
  size_t block_size = (size_t)asInteger(chunk_size); \
  size_t i, block_size_i; \
  if (block_size > nrow_A) block_size = nrow_A; \
  for (i = 0; i < nrow_A; i += block_size_i) { \
    block_size_i = nrow_A - i; if (block_size_i > block_size) block_size_i = block_size; \
    rowSums_template_sxt(s, t, A, nrow_A, block_size_i, ncol_A, sum); \
    A += block_size_i; sum += block_size_i; \
    } \
  UNPROTECT(1); \
  return result; \
  }

macro_define_rowSums(1, 1, rowSums_1x1)
macro_define_rowSums(1, 2, rowSums_1x2)
macro_define_rowSums(2, 1, rowSums_2x1)
macro_define_rowSums(2, 2, rowSums_2x2)

请注意，模板函数现在接受s和t，并且要由宏定义的函数已应用行分块。

即使我在代码中留下了一些注释，代码也可能仍然不容易遵循，但是我不能花更多的时间来解释更多细节。

要使用它们，请将其复制并粘贴到名为“ matSums.c”的C文件中，然后使用R CMD SHLIB -c matSums.c进行编译。

对于R端，在“ matSums.R”中定义以下函数。

colSums_zheyuan <- function (A, s) {
  dyn.load("matSums.so")
  if (s == 1) result <- .Call("colSums_1x1", A)
  if (s == 2) result <- .Call("colSums_2x1", A)
  dyn.unload("matSums.so")
  result
  }

rowSums_zheyuan <- function (A, chunk.size, s, t) {
  dyn.load("matSums.so")
  if (s == 1 && t == 1) result <- .Call("rowSums_1x1", A, as.integer(chunk.size))
  if (s == 2 && t == 1) result <- .Call("rowSums_2x1", A, as.integer(chunk.size))
  if (s == 1 && t == 2) result <- .Call("rowSums_1x2", A, as.integer(chunk.size))
  if (s == 2 && t == 2) result <- .Call("rowSums_2x2", A, as.integer(chunk.size))
  dyn.unload("matSums.so")
  result
  }

现在让我们有一个基准，同样使用5000 x 5000矩阵。

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")

microbenchmark("col0" = colSums(A),
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = rowSums(A),
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

在笔记本电脑上，我得到：

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  65.33908  71.67229  71.87273  71.80829  71.89444 111.84177   100
 col1  67.16655  71.84840  72.01871  71.94065  72.05975  77.84291   100
 col2  35.05374  38.98260  39.33618  39.09121  39.17615  53.52847   100
 row0 159.48096 187.44225 185.53748 187.53091 187.67592 202.84827   100
 row1  49.65853  54.78769  54.78313  54.92278  55.08600  60.27789   100
 row2  49.42403  54.56469  55.00518  54.74746  55.06866  60.31065   100
 row3  37.43314  41.57365  41.58784  41.68814  41.81774  47.12690   100
 row4  34.73295  37.20092  38.51019  37.30809  37.44097  99.28327   100

请注意循环展开如何加快colSums和rowSums的速度。通过完全优化（“ col2”和“ row4”），我们击败了Armadillo（请参阅此答案开头的计时表）。

在这种情况下，行分块策略不能明显产生收益。让我们尝试一个具有数百万行的矩阵。

A <- matrix(0, 1e+7, 20)
microbenchmark("row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

我明白了

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
 row1 604.7202 607.0256 617.1687 607.8580 609.1728 720.1790   100
 row2 514.7488 515.9874 528.9795 516.5193 521.4870 636.0051   100
 row3 412.1884 413.8688 421.0790 414.8640 419.0537 525.7852   100
 row4 377.7918 379.1052 390.4230 379.9344 386.4379 476.9614   100

在这种情况下，我们观察到了缓存阻塞带来的好处。

最终想法

基本上，此答案已解决所有问题，但以下情况除外：

为什么R的rowSums效率不高。
为什么没有进行任何优化，rowSums（“ row1”）比colSums（“ col1”）快。

同样，我不能首先解释，实际上我不在乎，因为我们可以轻松地编写比R内置版本更快的版本。

第二个绝对值得追求。我将自己的评论复制到我们的讨论室中以作记录。

这个问题归结为：“为什么加一个向量要比逐个元素加两个向量要慢？”

我不时看到类似的现象。我第一次遇到这种奇怪的行为是几年前，我编写了自己的矩阵矩阵乘法。我发现DAXPY比DDOT快。

DAXPY这样做：y += a * x，其中x和y是向量，a是标量； DDOT为此：a += x * y。

鉴于DDOT是一种还原操作，我希望它比DAXPY更快。但是不，DAXPY更快。

但是，一旦我在矩阵乘法的三重循环嵌套中展开循环，DDOT就会比DAXPY快得多。

您的问题也发生了类似的事情。归约运算：a = x[1] + x[2] + ... + x[n]比逐元素加法：y[i] += x[i]慢。但是一旦循环展开完成，后者的优势就会丧失。

我不确定以下解释是否正确，因为我没有证据。

约简操作具有依赖项链，因此计算严格是串行的；另一方面，按元素操作没有依赖关系链，因此CPU可能会做得更好。

展开循环后，每次迭代都需要执行更多的算法，并且在两种情况下CPU都可以进行流水线处理。然后可以看到还原操作的真正优势。

使用`rowSums2`中的`colSums2`和`matrixStats`来回复Jaap

仍然使用上面的5000 x 5000示例。

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")
library(matrixStats)  ## NEW

microbenchmark("col0" = base::colSums(A),
               "col*" = matrixStats::colSums2(A),  ## NEW
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = base::rowSums(A),
               "row*" = matrixStats::rowSums2(A),  ## NEW
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  71.53841  71.72628  72.13527  71.81793  71.90575  78.39645   100
 col*  75.60527  75.87255  76.30752  75.98990  76.18090  87.07599   100
 col1  71.67098  71.86180  72.06846  71.93872  72.03739  77.87816   100
 col2  38.88565  39.03980  39.57232  39.08045  39.16790  51.39561   100
 row0 187.44744 187.58121 188.98930 187.67168 187.86314 206.37662   100
 row* 158.08639 158.26528 159.01561 158.34864 158.62187 174.05457   100
 row1  54.62389  54.81724  54.97211  54.92394  55.04690  56.33462   100
 row2  54.15409  54.44208  54.78769  54.59162  54.76073  60.92176   100
 row3  41.43393  41.63886  42.57511  41.73538  41.81844 111.94846   100
 row4  37.07175  37.25258  37.45033  37.34456  37.47387  43.14157   100

我看不到rowSums2和colSums2的性能优势。

Answer 2

“为什么Cpp_rowSums（）比Cpp_colSums（）快得多？” -当获取“行主”时，CPU预取器可以预测您正在做什么，并在需要之前将需要的下一组数据从主内存获取到CPU缓存中。这样可以加快对数据的访问。

当您访问“ column major”时，预取器将很难预测下一个需求，因此不会像以前那样高效地（如果有的话）将东西塞入高速缓存中-这会减慢速度你失望了。

CPU 爱线性访问数据。如果您不按照他们的喜好来做，那么您就要付出高速缓存未命中和主存访问延迟的代价。

R vs Rcpp vs Armadillo中矩阵rowSums（）与colSums（）的效率

背景

问题

基准

更新

2 个答案:

`colSums`和`rowSums`之间是否有实质性区别？

如何提高`colSums`的性能？

如何提高`rowSums`的性能？

C代码：让我们击败犰狳

最终想法

使用`rowSums2`中的`colSums2`和`matrixStats`来回复Jaap

R vs Rcpp vs Armadillo中矩阵rowSums（）与colSums（）的效率

背景

问题

基准

更新

2 个答案:

colSums和rowSums之间是否有实质性区别？

如何提高colSums的性能？

如何提高rowSums的性能？

C代码：让我们击败犰狳

最终想法

使用rowSums2中的colSums2和matrixStats来回复Jaap

`colSums`和`rowSums`之间是否有实质性区别？

如何提高`colSums`的性能？

如何提高`rowSums`的性能？

使用`rowSums2`中的`colSums2`和`matrixStats`来回复Jaap