我尝试使用openmp来使我的循环并行化,以便更快。问题是并行版本不比顺序版本快
#include <Rcpp.h>
#include <iostream>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
#include "test.h"
using namespace std;
// [[Rcpp::export]]
std::vector<double> parallel_random_sum(int n, int ncores) {
std::vector<double> res(n);
#pragma omp parallel num_threads(ncores)
{
#pragma omp for
for (int j = 0; j < n; ++j) {
double lres(0);
// cout << "j = "<<j <<" test = " << lres<<endl;
lres += j;
res[j] = lres / n;
}
}
return res;
}
// [[Rcpp::export]]
std::vector<double> not_parallel_random_sum(int n) {
std::vector<double> res(n);
for (int j = 0; j < n; ++j) {
double lres(0);
// cout << "j = "<<j <<" test = " << lres<<endl;
lres += j;
res[j] = lres / n;
}
return res;
}
/*** R
microbenchmark::microbenchmark(
parallel_random_sum(1e7, 8),
not_parallel_random_sum(1e7),
times = 20
)
*/
结果==>
parallel_random_sum(1e + 07,8)62.02360毫秒
not_parallel_random_sum(1e + 07)65.56082毫秒
答案 0 :(得分:0)
您要并行化的代码还不够昂贵,使得并行化的开销可与增益媲美。如果您通过短暂睡眠在循环中添加一些人为的工作负载,则可以看到性能提升:
#include <chrono>
#include <thread>
#include <Rcpp.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::export]]
Rcpp::NumericVector parallel_sleep(int n, int ncores) {
Rcpp::NumericVector res_(n);
RcppParallel::RVector<double> res(res_);
#pragma omp parallel num_threads(ncores)
{
#pragma omp for
for (int j = 0; j < n; ++j) {
double lres(0);
std::this_thread::sleep_for(std::chrono::microseconds(10));
lres += j;
res[j] = lres / n;
}
}
return res_;
}
// [[Rcpp::export]]
Rcpp::NumericVector not_parallel_sleep(int n) {
Rcpp::NumericVector res(n);
for (int j = 0; j < n; ++j) {
double lres(0);
std::this_thread::sleep_for(std::chrono::microseconds(10));
lres += j;
res[j] = lres / n;
}
return res;
}
/*** R
N <- 1e4
bench::mark(
parallel_sleep(N, 8),
not_parallel_sleep(N)
)
*/
结果:
# A tibble: 2 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <lis> <list>
1 parallel_sle… 73.2ms 81.3ms 82.3ms 87ms 12.3 80.7KB 0 7 569ms <dbl [1… <Rprofme… <bch… <tibbl…
2 not_parallel… 667.8ms 667.8ms 667.8ms 668ms 1.50 80.7KB 0 1 668ms <dbl [1… <Rprofme… <bch… <tibbl…
请注意,我还使用了RcppParallel的数据结构,以避免在返回数据时进行深度复制(参见@coatless的注释)。