稀疏矩阵-稀疏矩阵乘法算法。 R的性能提升

时间:2018-10-07 23:50:26

标签: r openmp sparse-matrix rcpp armadillo

我的Mac的R与openblas相关联。在使用Armadillo 在R或Rcpp中执行稀疏稀疏乘法时,当我查看“%CPU”使用率时,似乎不像在使用稠密乘法那样使用多线程。在速度方面,R或Armadillo中的单线程稀疏稀疏乘法似乎也比Matlab慢。

为解决此问题,我实现了FG Gustavson算法(https://dl.acm.org/citation.cfm?id=355796),该算法使用Armadillo的spMat容器在Rcpp中执行稀疏矩阵乘法。

如果我忽略对行的排序(这是算法的直接实现),我会看到一种改进(请参见下文),但是标准排序使其比R的速度慢(根据mtall的注释进行了编辑。 )。我不是Rcpp / RcppArmadillo / C ++的专家,我在两件事上寻求帮助:

  • 如何通过编程使基于单线程应用程序的 sp_sp_gc_ord 功能更高效,更快捷?

  • 我la脚的尝试使用 openmp sp_sp_gc_ord 进行多线程处理导致R崩溃。我已经在下面注释了 omp 命令。我已经看过有关OpenMP http://gallery.rcpp.org/tags/openmp/的Rcpp画廊的讨论,但找不到问题


char *


#### Rcpp functions

#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace arma;

// [[Rcpp::plugins(openmp)]]
// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
sp_mat sp_sp_gc_ord(const arma::sp_mat &A, const arma::sp_mat &B, double p){

  // This function evaluates A * B where both A & B are sparse and the resultant
  // product is also sparse

  // define matrix sizes
  const int mA= A.n_rows;
  const int nB= B.n_cols;

  // number of non-zeros in the resultant matrix
  const int nnzC = ceil(mA * nB * p);

  // initialize colptr, row_index and value vectors for the resultant sparse matrix
  urowvec colptrC(nB+1);
  uvec rowvalC(nnzC);
  colvec nzvalC(nnzC);


  // counters and other variables
  unsigned int i, jp, j, kp, k, vp; 
  unsigned int ip = 0;
  double nzB, nzA; 
  ivec xb(mA);
  vec x(mA);

  // loop logic: outer loop over columns of B and inner loop over columns of A and then aggregate

  //  #pragma omp parallel for shared(colptrC,rowvalC,nzvalC,x,xb,ip,A,B) private(j,nzA,nzB,kp,i,jp,kp,k,vp) default(none) schedule(auto) 
  for(i=0; i< nB; i++) { 

    colptrC.at(i) = ip;

    for ( jp = B.col_ptrs[i]; jp < B.col_ptrs[i+1]; jp++) {

      j = B.row_indices[jp];
      nzB = B.values[jp];

      for ( kp = A.col_ptrs[j]; kp < A.col_ptrs[j+1]; kp++ ){

        k = A.row_indices[kp];
        nzA = A.values[kp];

        if (xb.at(k) != i){
          rowvalC.at(ip) = k;
          ip +=1;
          // Rcpp::print(wrap(ip));
          xb.at(k) = i;
          x.at(k) = nzA * nzB;
        } else {
          x.at(k) += nzA * nzB;

    // put in the value vector of resultant matrix

      for ( vp= colptrC.at(i); vp <= (ip-1); vp++ ) {
        nzvalC.at(vp) = x(rowvalC.at(vp));


  // resize and put in the spMat container
  colptrC.at(nB) = ip;
  sp_mat C(rowvalC.subvec(0,(ip-1)),colptrC,nzvalC.subvec(0,(ip-1)),mA,nB);

  // Gustavson's algorithm produces unordered rows for each column: a standard way to address this is: (X.t()).t()

  return (C.t()).t();

 // [[Rcpp::export]]
sp_mat sp_sp_arma(const sp_mat &A, const sp_mat &B){

  return A * B; 


// [[Rcpp::export]]
mat dense_dense_arma(const mat &A, const mat &B){

  return A * B; 


#### End 


#### Microbenchmark 


## define two matrices
m<- 1000
n<- 6000
p<- 2000

A<-  matrix(runif(m*n),m,n)
B<-  matrix(runif(n*p),n,p)
A[abs(A)> .01] = B[abs(B)> .01] = 0
A <- as(A,'dgCMatrix')
B<- as(B,'dgCMatrix')
Adense<- as.matrix(A)
Bdense<- as.matrix(B)

## sp_sp_gc is the function without ordering 

dense_dense_arma(Adense,Bdense),Adense %*% Bdense,Adense %*% B, times=100)

Unit: milliseconds
                             expr       min        lq      mean    median        uq       max neval
              sp_sp_gc(A, B, 0.5)  16.09809  21.75001  25.76436  24.44657  26.96300  99.30778   100
          sp_sp_gc_ord(A, B, 0.5)  36.78781  44.64558  49.82102  47.64348  51.87361 116.85013   100
                 sp_sp_arma(A, B)  47.45203  52.77132  59.37077  59.24010  62.41710  86.15647   100
                          A %*% B  23.64307  28.99649  32.88566  32.10017  35.21816  59.16251   100
 dense_dense_arma(Adense, Bdense) 286.22358 302.95170 345.66766 317.75786 340.50143 862.15116   100
                Adense %*% Bdense 292.32099 317.10795 342.48345 329.80950 342.21333 697.56468   100
                     Adense %*% B 167.87248 186.63499 219.11872 195.19197 212.50286 843.17172   100   



0 个答案:
