为什么使用openmp并行化我的rcpp代码不会太快

时间:2019-04-08 12:31:52

标签: openmp rcpp

我尝试使用openmp来使我的循环并行化,以便更快。问题是并行版本不比顺序版本快

#include <Rcpp.h>
#include <iostream>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
#include "test.h"

using namespace std;

// [[Rcpp::export]]
std::vector<double> parallel_random_sum(int n, int ncores) {

  std::vector<double> res(n);

#pragma omp parallel num_threads(ncores)
{
#pragma omp for
  for (int j = 0; j < n; ++j) {
    double lres(0);
   // cout << "j = "<<j <<" test = " << lres<<endl;
    lres += j;
    res[j] = lres / n;
  } 
}

return res;
}

// [[Rcpp::export]]
std::vector<double> not_parallel_random_sum(int n) {

  std::vector<double> res(n);

  for (int j = 0; j < n; ++j) {
    double lres(0);
  //  cout << "j = "<<j <<" test = " << lres<<endl;
    lres += j;
    res[j] = lres / n;
  }

return res;
}

/*** R
microbenchmark::microbenchmark(
  parallel_random_sum(1e7, 8),
  not_parallel_random_sum(1e7),
  times = 20
) 
  */

结果==>

  1. parallel_random_sum(1e + 07,8)62.02360毫秒

  2. not_parallel_random_sum(1e + 07)65.56082毫秒

1 个答案:

答案 0 :(得分:0)

您要并行化的代码还不够昂贵,使得并行化的开销可与增益媲美。如果您通过短暂睡眠在循环中添加一些人为的工作负载,则可以看到性能提升:

#include <chrono>
#include <thread>
#include <Rcpp.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>

// [[Rcpp::export]]
Rcpp::NumericVector parallel_sleep(int n, int ncores) {

  Rcpp::NumericVector res_(n);
  RcppParallel::RVector<double> res(res_);

#pragma omp parallel num_threads(ncores)
{
#pragma omp for
  for (int j = 0; j < n; ++j) {
    double lres(0);
    std::this_thread::sleep_for(std::chrono::microseconds(10));
    lres += j;
    res[j] = lres / n;
  }
}

return res_;
}

// [[Rcpp::export]]
Rcpp::NumericVector not_parallel_sleep(int n) {

  Rcpp::NumericVector res(n);

  for (int j = 0; j < n; ++j) {
    double lres(0);
    std::this_thread::sleep_for(std::chrono::microseconds(10));
    lres += j;
    res[j] = lres / n;
  }

  return res;
}

/*** R
N <- 1e4
bench::mark(
  parallel_sleep(N, 8),
  not_parallel_sleep(N)
) 
*/

结果:

# A tibble: 2 x 14
  expression         min     mean   median   max `itr/sec` mem_alloc  n_gc n_itr total_time result   memory    time  gc     
  <chr>         <bch:tm> <bch:tm> <bch:tm> <bch>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>   <list>    <lis> <list> 
1 parallel_sle…   73.2ms   81.3ms   82.3ms  87ms     12.3     80.7KB     0     7      569ms <dbl [1… <Rprofme… <bch… <tibbl…
2 not_parallel…  667.8ms  667.8ms  667.8ms 668ms      1.50    80.7KB     0     1      668ms <dbl [1… <Rprofme… <bch… <tibbl…

请注意,我还使用了RcppParallel的数据结构,以避免在返回数据时进行深度复制(参见@coatless的注释)。