我目前正在使用C ++和OpenMP并行化嵌套的for循环。无需深入了解该程序的实际细节,我就在下面使用的概念构建了一个基本示例:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
for(int i=0; i < distance.size; i++){
\\some work
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
我尝试通过以下方式并行化上述代码:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
#pragma omp parallel for default(shared)
for(int i=0; i < distance.size; i++){
\\some work
#pragma omp parallel for reduction(+:var)
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
然后我将串行程序输出与并行程序输出进行了比较,但结果不正确。我知道这主要是由于浮点算法不具有关联性。但是,是否有任何变通办法可以给出确切的结果?
答案 0 :(得分:2)
Although the lack of associativity of floating point arithmetic might be an issue in some cases, the code you show here exposes a much more essential problem which you need to address first: the status of the var
variable in the outer loop.
Indeed, since var
is modified inside the i
loop, even if only in the j
part of the i
loop, it needs to be "privatized" somehow. Now the exact status it needs to get depends on the value you expect it to store upon exit of the enclosing parallel
region:
private
(or better, declare it inside the parallel
region.i
loop, and considering it accumulates a sum of values, most likely you'll need to declare it reduction(+:)
, although lastprivate
might also be what you want (impossible to say without further details)private
or lastprivate
was all you needed, but you also need its initial value upon entrance of the parallel
region, then you'll have to consider adding firstprivate
too (no need of that if you went for reduction
as it is already been taken care of)That should be enough for fixing your issue.
Now, in your snippet, you also parallelized the inner loop. That is usually a bad idea to go for nested parallelism. So unless you have a very compelling reason for doing so, you will likely get much better performance by only parallelizing the outer loop, and leaving the inner loop alone. That won't mean the inner loop won't benefit from the parallelization, but rather that several instances of the inner loop will be computed in parallel (each one being sequential admittedly, but the whole process is parallel).
A nice side effect of removing the inner loop's parallelization (in addition to making the code faster) is that now all accumulations inside the privates var
variables are done in the same order as when not in parallel. Therefore, your (hypothetical) floating point arithmetic issues inside the outer loop will now have disappeared, and only if you needed the final reduction upon exit of the parallel
region might you still face them there.