我正在玩OpenMP,我偶然发现了一些我不理解的事情。我使用以下parallell代码(正常工作)。当使用两次以上的线程时,它的执行时间几乎减半。但是,使用OpenMP与一个线程的执行时间是35秒,而当我评论编译指示时,它减少到25秒!我能做些什么来减少这个巨大的开销吗? 我使用gcc 4.8.1并使用" -O2 -Wall -fopenmp"进行编译。
我读过类似的主题(OpenMP with 1 thread slower than sequential version,OpenMP overhead) - 意见不同,没有开销,也有很多开销。我很好奇是否有更好的方法在我的特定情况下使用OpenMP(for
循环和parallell for
内)。
for (size_t k = 0 k < maxk; ++k) { // k is ~5000
// init reduction variables
const bool is_time_for_reduction = ;// init from k
double mmin = INFINITY, mmax = -INFINITY;
double sum = 0.0;
#pragma omp parallel shared(m1, m2)
{
// w, h are both between 1000 and 2000
#pragma omp for
for (size_t i = 0; i < h; ++i) { // w,h - consts
for (size_t j = 0; j < w; ++j) {
// computations with matrices m1 and m2, using only m1,m2 and constants w,h
}
}
if (is_time_for_reduction) {
#pragma omp for reduction (max/min/sum: mmax,mmin,sum)
for (size_t i = 0; i < h; ++i) {
for (size_t j = 0; j < w; ++j) {
// reductions
}
}
}
}
if (is_time_for_reduction) {
// use "reduced" variables
}
}
答案 0 :(得分:0)
我没有看到更改原始顺序代码的理由。我会试试这个:
for (size_t k = 0 k < maxk; ++k) {
// init reduction variables
const bool is_time_for_reduction = ;// init from k
double mmin = INFINITY, mmax = -INFINITY;
double sum = 0.0;
#pragma omp parallel for
for (size_t i = 0; i < h; ++i) { // w,h - consts
for (size_t j = 0; j < w; ++j) {
// computations with matrices m1 and m2, using only m1,m2 and constants w,h
}
}
if (is_time_for_reduction) {
#pragma omp parallel for reduction (max/min/sum: mmax,mmin,sum)
for (size_t i = 0; i < h; ++i) {
for (size_t j = 0; j < w; ++j) {
// reductions
}
}
// use "reduced" variables
}
}