Question

我正在尝试通过使用openMP来加速我的MPI项目。我有一个1000个2d点的数据集，我正在使用强力算法来找到2d图中的最小距离。但是，当我尝试拆分执行的线程时，它会严重损害性能。如何才能正确使用openMP？

这是我的尝试：

double calcDistance(double input[][2], int start, int stop){

    double temp;
    //declare and initialize minimum
    double minimum =  pow (((input[start+1][0]) - (input[start][0])),2) + pow(((input[start+1][1]) - (input[start][1])),2);
    minimum = sqrt(minimum);

    closestIndex1 = start;
    closestIndex2 = start+1;
    //Brute Force Algorithm to find minimum distance in dataset.

        #pragma omp parallel for
        for(int i=start; i<stop;i++){
            for(int j=start; j<stop;j++){

                    temp = pow(((input[j][0]) - (input[i][0])),2) + pow(((input[j][1]) - (input[i][1])),2);
                    temp = sqrt(temp);
                    if(temp < minimum && i < j){
                            minimum = temp; 
                            closestIndex1 = i;
                            closestIndex2 = j;
                    }//endif
            }//end j
          }//end i
return minimum;
}

我不得不说WOW。谢谢，这非常有帮助，并且真正解决了我的一堆问题。再次，谢谢你，gha.st。

Answer 1

分析

首先，纯粹的运气你的程序似乎是这样工作的。确实有数据争用，导致我的机器上的结果无效。请考虑以下测试工具：

::std::cout << ::xtd::target_info() << "\n\n"; // [target os] [target architecture] with [compiler]

static const int count = 30000;
auto gen = ::std::bind(::std::normal_distribution<double>(0, 1000), ::std::mt19937_64(42));
std::unique_ptr<double[][2]> input(new double[count][2]);
for(size_t i = 0; i < count; ++i)
{
    input[i][0] = gen();
    input[i][1] = gen();
}

::xtd::stopwatch sw; // does what its name suggests
sw.start();
double minimum = calcDistance(input.get(), 0, count);
sw.stop();
::std::cout << minimum << "\n";
::std::cout << sw << "\n";

删除omp pragma执行函数时，其输出为：

Windows x64 with icc 14.0

0.0559233
7045 ms

或

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
7272 ms

当omp pragma完整执行时，其输出为：

Windows x64 with icc 14.0

0.324085
675.9 ms

或

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
4338 ms

由于机器使用24个线程（在启用HT的12个核心上），加速很明显，但可能更好，至少对于msvc。生成更快程序（icc）的编译器也会通过给出每次运行不同的错误结果来显示数据争用。

注意：在使用10k迭代编译x86的调试版本时，我也能看到msvc的错误结果。

正确使用迭代局部变量

代码中的temp变量的生命周期为最内层循环的一次迭代。通过移动其范围以匹配其生命周期，我们可以消除一个数据竞争源。我还冒昧地删除了两个未使用的变量，并将minimum的初始化更改为常量：

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    //#pragma omp parallel for // still broken
    for(int i = start; i < stop; i++){
        for(int j = start; j < stop; j++) {
            double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][2]) - (input[i][3])), 2);
            temp = sqrt(temp);
            if(temp < minimum && i < j) minimum = temp;
        }
    }
    return minimum;
}

适当的OMP最小计算

OMP支持reductions，它很可能表现得相当不错。为了尝试它，我们将使用以下编译指示，它确保每个线程在其自己的minimum变量上工作，这些变量使用最小运算符组合：

#pragma omp parallel for reduction(min: minimum)

结果验证了ICC的方法：

Windows x64 with icc 14.0

0.0559233
622.1 ms

但是MSVC嚎叫error C3036: 'min' : invalid operator token in OpenMP 'reduction' clause，因为它不支持最小减少。为了定义我们自己的缩减，我们将使用一种名为double-checked locking的技术：

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    #pragma omp parallel for
    for(int i = start; i < stop; i++){
        for(int j = start; j < stop; j++) {
            double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][1]) - (input[i][1])), 2);
            temp = sqrt(temp);
            if(temp < minimum && i < j)
            {
                #pragma omp critical
                if(temp < minimum && i < j) minimum = temp;
            }
        }
    }
    return minimum;
}

这不仅是正确的，而且还可以带来与MSVC相当的性能（请注意，这比不正确的版本要快得多！）：

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
653.1 ms

ICC的表现不会受到太大影响：

Windows x64 with icc 14.0

0.0559233
636.8 ms

串行优化

虽然以上是您的串行代码的正确并行化，但考虑到您正在计算由于temp而永远不会使用的大量i < j结果，可以对其进行大幅优化。条件。

通过简单地改变内部循环的起点，不仅可以完全避免这种计算，还可以简化循环条件。

我们使用的另一个技巧是将sqrt计算延迟到最后一秒，因为它是一个同态变换，我们可以对距离的平方进行排序。

最后，为一个正方形调用pow是相当低效的，因为它会产生大量我们不需要的开销。

这导致最终代码

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    #pragma omp parallel for
    for(int i = start; i < stop; i++) {
        for(int j = i + 1; j < stop; j++) {
            double dx = input[j][0] - input[i][0];
            dx *= dx;
            double dy = input[j][1] - input[i][1];
            dy *= dy;
            double temp = dx + dy;
            if(temp < minimum)
            {
                #pragma omp critical
                if(temp < minimum) minimum = temp;
            }
        }
    }
    return sqrt(minimum);
}

导致最后的表现：

Windows x64 with icc 14.0

0.0559233
132.7 ms

或

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
120.1 ms

如何正确使用openMP

1 个答案:

分析

正确使用迭代局部变量

适当的OMP最小计算

串行优化