我在这里有一段Open MP代码,它在4.0/(1+x^2)
区间内执行函数[0,1]
的整合。对此的分析答案是pi = 3.14159...
整合函数的方法只是通过简单的近似黎曼和。现在的代码 当我使用1个OpenMP线程时,给出了正确答案,最多11个OpenMP线程。
然而,一旦我开始使用12个或更多OpenMP线程,它就会开始给出越来越错误的答案。
为什么会发生这种情况?首先是C ++代码。我在Ubuntu 10.10环境中使用gcc。代码使用g++ -fopenmp integration_OpenMP.cpp
// f(x) = 4/(1+x^2)
// Domain of integration: [0,1]
// Integral over the domain = pi =(approx) 3.14159
#include <iostream>
#include <omp.h>
#include <vector>
#include <algorithm>
#include <functional>
#include <numeric>
int main (void)
{
//Information common to serial and parallel computation.
int num_steps = 2e8;
double dx = 1.0/num_steps;
//Serial Computation: Method pf integration is just a plain Riemann sum
double start = omp_get_wtime();
double serial_sum = 0;
double x = 0;
for (int i=0;i< num_steps; ++i)
{
serial_sum += 4.0*dx/(1.0+x*x);
x += dx;
}
double end = omp_get_wtime();
std::cout << "Time taken for the serial computation: " << end-start << " seconds";
std::cout << "\t\tPi serial: " << serial_sum << std::endl;
//OpenMP computation. Method of integration, just a plain Riemann sum
std::cout << "How many OpenMP threads do you need for parallel computation? ";
int t;//number of openmp threads
std::cin >> t;
start = omp_get_wtime();
double parallel_sum = 0; //will be modified atomically
#pragma omp parallel num_threads(t)
{
int threadIdx = omp_get_thread_num();
int begin = threadIdx * num_steps/t; //integer index of left end point of subinterval
int end = begin + num_steps/t; // integer index of right-endpoint of sub-interval
double dx_local = dx;
double temp = 0;
double x = begin*dx;
for (int i = begin; i < end; ++i)
{
temp += 4.0*dx_local/(1.0+x*x);
x += dx_local;
}
#pragma omp atomic
parallel_sum += temp;
}
end = omp_get_wtime();
std::cout << "Time taken for the parallel computation: " << end-start << " seconds";
std::cout << "\tPi parallel: " << parallel_sum << std::endl;
return 0;
}
以下是以11个线程开头的不同线程数的输出。
OpenMP: ./a.out
Time taken for the serial computation: 1.27744 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 11
Time taken for the parallel computation: 0.366467 seconds Pi parallel: 3.14159
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP: ./a.out
Time taken for the serial computation: 1.28167 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 12
Time taken for the parallel computation: 0.351284 seconds Pi parallel: 3.16496
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP: ./a.out
Time taken for the serial computation: 1.28178 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 13
Time taken for the parallel computation: 0.434283 seconds Pi parallel: 3.21112
OpenMP: ./a.out
Time taken for the serial computation: 1.2765 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 14
Time taken for the parallel computation: 0.375078 seconds Pi parallel: 3.27163
OpenMP:
答案 0 :(得分:4)
为什么不直接使用带有静态分区的parallel for
?
#pragma omp parallel shared(dx) num_threads(t)
{
double x = omp_get_thread_num() * 1.0 / t;
#pragma omp for reduction(+ : parallel_Sum)
for (int i = 0; i < num_steps; ++i)
{
parallel_Sum += 4.0*dx/(1.0+x*x);
x += dx;
}
}
然后您不需要自己管理结果的所有分区和原子收集。
为了正确初始化x
,我们注意到x = (begin * dx) = (threadIdx * num_steps/t) * (1.0 / num_steps) = (threadIdx * 1.0) / t
。
编辑:刚刚在我的机器上测试了这个最终版本,它似乎正常运行。
答案 1 :(得分:2)
问题在于计算begin
:
当您设置num_steps = 2e8
时,threadIdx==11
,num_steps * threadIdx
会导致32位整数溢出,因此您的start
计算错误。
我建议您对long long int
,threadIdx
和begin
使用end
。
修改强>
另请注意,计算开始和结束的方法可能会导致步骤(和精度)丢失。例如,对于313
个主题,您会松开199
个步骤。
计算开始和结束的正确方法是:
long long int begin = threadIdx * num_steps/t;
long long int end = (threadIdx + 1) * num_steps/t;
出于同样的原因,你不能用括号来做,但必须使用long long
。