最快的方法来计算LogicalMatrix R / C ++ / Rcpp的所有TRUE行

时间:2015-09-27 16:40:10

标签: c++ r performance matrix rcpp

我需要计算LogicalMatrix中全部为TRUE的行数。

因为我需要能够在相对规律的基础上做到1到2千5百万次,所以速度实际上非常重要:

我目前最好的:

我认为如何做到这一点的最有效/最快的单进程方式是Rcpp函数的多少(hm2)。

我有限的性能分析能力表明我绝大部分时间花在了if(r_tll == xcolls){...上。我似乎无法想到一个更快的不同算法(我在找到FALSE后尝试突破循环并且速度慢得多。)

可以假设的细节:

我可以认为:

  1. 矩阵总是少于1000万行。
  2. 来自上游的所有输出矩阵将具有相同数量的cols(对于给定的会话/进程/线程)。
  3. 每个矩阵永远不会超过2326个cols。
  4. 最小例子:

    m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
    head(m)
    #>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
    #> [1,] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
    #> [2,] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
    #> [3,] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
    #> [4,]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
    #> [5,]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
    #> [6,] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
    
      // [[Rcpp::export]]
    int hm(const LogicalMatrix& x){
      const int xrows = x.nrow();
      const int xcols = x.ncol();
      int n_all_true = 0;
    
      for(size_t row = 0; row < xrows; row++) {
        int r_ttl = 0;
        for(size_t col = 0; col < xcols; col++) {
          r_ttl += x(row,col);
        }
        if(r_ttl == xcols){
          n_all_true++;
        }
      }
      return n_all_true;
    }
    

    我不明白为什么,但是在我的机器上如果我加入cols数量更快(如果有人可以解释为什么这样也会很棒):

    // [[Rcpp::export]]
    int hm2(const LogicalMatrix& x){
      const int xrows = x.nrow();
      // const int xcols = x.ncol();
      int n_all_true = 0;
    
      for(size_t row = 0; row < xrows; row++) {
        int r_ttl = 0;
        for(size_t col = 0; col < 10; col++) {
          r_ttl += x(row,col);
        }
        if(r_ttl == 10){
          n_all_true += 1;
        }
      }
      return n_all_true;
    }
    

    定时:

    microbenchmark(hm(m), hm2(m), times = 1000)
    #>  Unit: microseconds
    #>   expr     min       lq     mean  median       uq      max neval
    #>  hm(m) 597.828 599.0995 683.3482 605.397 643.8655 1659.711  1000
    #> hm2(m) 236.847 237.6565 267.8787 238.748 253.5280  683.221  1000
    

4 个答案:

答案 0 :(得分:4)

使用OpenMP(我现在看到的是针对请求单线程解决方案的问题)和最少的代码更改(至少在我的4核Xeon上)仍然可以快30%。我有一种感觉,逻辑上的减少可能会做得更好但会留下另一天:

library(Rcpp)
library(microbenchmark)

m_rows <- 10L
m_cols <- 50000L
rebuild = FALSE

cppFunction('int hm(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}', rebuild = rebuild)

hm3 <- function(m) {
  nc <- ncol(m)
  sum(rowSums(m) == nc)
}

cppFunction('int hm_jmu(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(int col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}', rebuild = rebuild)

macroExpand <- function(NCOL) {
  paste0('int hm_npjc(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

macroExpand_omp <- function(NCOL) {
  paste0('int hm_npjc_omp(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  #pragma omp parallel for reduction(+:n_all_true)
  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

cppFunction(macroExpand(m_rows), rebuild = rebuild)
cppFunction(macroExpand_omp(m_rows),  plugins = "openmp", rebuild = rebuild)

cppFunction('int hm_omp(const LogicalMatrix& x) {
const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  #pragma omp parallel for reduction(+:n_all_true) schedule(static)
  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}',  plugins = "openmp", rebuild = rebuild)

# using != as inner loop control - no difference, using pre-increment in n_all_true, no diff, static vs dynamic OpenMP, attempted to direct clang and gcc to unroll loops: didn't seem to work

set.seed(21)
m <- matrix(sample(c(TRUE, FALSE), m_cols * m_rows, replace = T), ncol = m_rows)
print(microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m),
                     hm_omp(m), hm_npjc_omp(m),
                     times = 1000))

我使用了GCC 4.9。 Clang 3.7的结果相似。 赠送: Unit: microseconds expr min lq mean median uq max neval hm(m) 614.074 640.9840 643.24836 641.462 642.9920 976.694 1000 hm3(m) 2705.066 2768.3080 2948.39388 2775.992 2786.8625 43424.060 1000 hm_jmu(m) 591.179 612.3590 625.84484 612.881 613.8825 6874.428 1000 hm_npjc(m) 62.958 63.8965 64.89338 64.346 65.0550 144.487 1000 hm_omp(m) 91.892 92.6050 165.21507 93.758 98.8230 10026.583 1000 hm_npjc_omp(m) 43.129 43.6820 129.15842 44.458 47.0860 17636.875 1000

OpenMP魔术只是在编译和链接时包含-fopenmp(由Rcpp,plugin="openmp"处理),以及     #pragma omp parallel for reduction(+:n_all_true)schedule(static) 在这种情况下,外部循环是并行化的,结果是总和,因此减少语句告诉编译器分解问题,并将每个部分的总和减少为一个总和。 schedule(static)描述了编译器和/或运行时如何在线程之间分配循环。在这种情况下,内环和外环的宽度都是已知的,因此static是首选;如果说内部循环大小变化很大,那么dynamic可能会更好地平衡线程之间的工作。

可以明确地告诉OpenMP每个线程需要多少循环迭代,但通常最好让编译器决定。

另一方面,我努力使用编译器标志,例如-funroll-loops来替换内部循环宽度的丑陋但快速的硬编码(这不是问题的通用解决方案)。我测试了这些无济于事:见https://github.com/jackwasey/optimization-comparison

答案 1 :(得分:3)

这是你的功能,以及通过cppFunction编译的输出:

require(Rcpp)
cppFunction('int hm(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}')
# file.*.cpp: In function ‘int hm(const LogicalMatrix&)’:
# file.*.cpp:12:29: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
#    for(size_t row = 0; row < xrows; row++) {
#                              ^
# file.*.cpp:14:31: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
#      for(size_t col = 0; col < xcols; col++) {
#                                ^

请注意警告。对intsize_t使用row代替col,我可以获得一些改进。除此之外,我找不到太大的改进空间。

这是我的代码,基准和可重复的示例:

require(Rcpp)
require(microbenchmark)

cppFunction('int hm_jmu(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(int col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}')

hm3 <- function(m) {
  nc <- ncol(m)
  sum(rowSums(m) == nc)
}

set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), times=1000)
# Unit: microseconds
#       expr      min        lq   median        uq       max neval
#      hm(m)  578.844  594.1460  607.357  636.4410   858.347  1000
#     hm3(m) 6389.014 6452.9595 6476.197 6735.5465 33720.870  1000
#  hm_jmu(m)  409.920  415.0395  424.401  449.0075   650.127  1000

答案 2 :(得分:1)

我非常好奇为什么'烘焙'被定义为const 会有所作为;所以我玩弄了这个想法。

此前:

library(Rcpp)
library(microbenchmark)
cppFunction('int hm(const LogicalMatrix& x)
            {
            const int xrows = x.nrow();
            const int xcols = x.ncol();
            int n_all_true = 0;

            for(size_t row = 0; row < xrows; row++) {
            int r_ttl = 0;
            for(size_t col = 0; col < xcols; col++) {
            r_ttl += x(row,col);
            }
            if(r_ttl == 10){
            n_all_true++;
            }
            }
            return n_all_true;
            }')

hm3 <- function(m) {
  nc <- ncol(m)
  sum(rowSums(m) == nc)
}

cppFunction('int hm_jmu(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < xcols; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == xcols){
  n_all_true++;
  }
  }
  return n_all_true;
  }')

以cols数量烘烤

我只是把Joshua的sol'n带到这里,但产生了量身定制的功能 通过code-gen在我的机器上运行良好。 这对我来说似乎很烦人,但我 以为我会发帖:

macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

cppFunction(macroExpand(10L))

结果:

set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m), times=1000)
#> Unit: microseconds
#>        expr      min        lq      mean    median        uq       max
#>       hm(m)  596.808  600.1870  722.5140  629.1750  709.3875  1680.379
#>      hm3(m) 2189.164 2353.6700 2972.1463 2509.4630 2956.7675 49930.471
#>   hm_jmu(m)  574.137  576.5160  678.6475  600.4775  665.2800  2240.988
#>  hm_npjc(m)   81.978   83.1855  102.7646   89.2160  101.0400   380.884
#>  neval
#>   1000
#>   1000
#>   1000
#>   1000

我想请注意,我真的不明白为什么编译器不会优化到同一解决方案;如果有人对此有所了解那将是非常棒的。

种源

devtools::session_info()
#> Session info --------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.2.2 (2015-08-14)
#>  system   x86_64, darwin13.4.0        
#>  ui       RStudio (0.99.691)          
#>  language (EN)                        
#>  collate  en_CA.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2015-09-27
#> Packages ------------------------------------------------------------------
#>  package        * version    date       source                         
#>  clipr            0.1.1      2015-09-04 CRAN (R 3.2.0)                 
#>  colorspace       1.2-6      2015-03-11 CRAN (R 3.2.0)                 
#>  devtools         1.9.1      2015-09-11 CRAN (R 3.2.0)                 
#>  digest           0.6.8      2014-12-31 CRAN (R 3.2.0)                 
#>  evaluate         0.8        2015-09-18 CRAN (R 3.2.0)                 
#>  formatR          1.2.1      2015-09-18 CRAN (R 3.2.0)                 
#>  ggplot2          1.0.1      2015-03-17 CRAN (R 3.2.0)                 
#>  gtable           0.1.2      2012-12-05 CRAN (R 3.2.0)                 
#>  htmltools        0.2.6      2014-09-08 CRAN (R 3.2.0)                 
#>  knitr            1.10.5     2015-05-06 CRAN (R 3.2.0)                 
#>  magrittr         1.5        2014-11-22 CRAN (R 3.2.0)                 
#>  MASS             7.3-43     2015-07-16 CRAN (R 3.2.2)                 
#>  memoise          0.2.1      2014-04-22 CRAN (R 3.2.0)                 
#>  microbenchmark * 1.4-2      2014-09-28 CRAN (R 3.2.0)                 
#>  munsell          0.4.2      2013-07-11 CRAN (R 3.2.0)                 
#>  plyr             1.8.3      2015-06-12 CRAN (R 3.2.0)                 
#>  proto            0.3-10     2012-12-22 CRAN (R 3.2.0)                 
#>  Rcpp           * 0.12.1     2015-09-10 CRAN (R 3.2.0)                 
#>  reprex           0.0.0.9001 2015-09-26 Github (jennybc/reprex@1d6584a)
#>  reshape2         1.4.1      2014-12-06 CRAN (R 3.2.0)                 
#>  rmarkdown        0.7        2015-06-13 CRAN (R 3.2.0)                 
#>  rstudioapi       0.3.1      2015-04-07 CRAN (R 3.2.0)                 
#>  scales           0.3.0      2015-08-25 CRAN (R 3.2.0)                 
#>  stringi          0.5-5      2015-06-29 CRAN (R 3.2.0)                 
#>  stringr          1.0.0      2015-04-30 CRAN (R 3.2.0)

答案 3 :(得分:0)

对于许多数字运算符,利用TRUE被强制转换为1的事实如何,然后它已经在已经用C编程的函数中进行了矢量化。例如。

set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
sum(rowSums(m) == ncol(m))
## [1] 47

microbenchmark::microbenchmark(sum(rowSums(m) == ncol(m)))
## Unit: milliseconds
##                       expr      min       lq     mean   median       uq     max neval
## sum(rowSums(m) == ncol(m)) 1.715399 1.840763 1.873422 1.861552 1.905841 2.02524   100

请参阅R Inferno第3章。

直接比较编辑回答:

(这里我将两个C ++函数粘贴到我桌面上名为test.cpp的文件中,并带有常用的Rcpp标题信息)

require(Rcpp)
sourceCpp("~/Desktop/test.cpp")

set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)

hm3 <- function(m) {
    nc <- ncol(m)
    sum(rowSums(m) == nc)
}

microbenchmark::microbenchmark(hm(m), hm2(m), hm3(m), times = 1000)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq        max neval
##  hm(m) 4.996005 5.036732 5.169672 5.089707 5.194580   9.961581  1000
## hm2(m) 5.031222 5.074990 5.228239 5.128106 5.242909  10.109776  1000
## hm3(m) 1.626933 1.878014 2.205195 1.922608 2.014012 226.894190  1000

我在这里注意到对R Inferno的引用并不合适,因为它不适用于C ++,但它仍然是生活的口头禅。 : - )