Question

如何在并行区域内减少OpenMP（和）？（仅在主线程上需要结果。）

算法原型：

#pragma omp parallel
{
    t = omp_get_thread_num();

    while iterate 
    {
        float f = get_local_result(t);

        // fsum is required on master only
        float fsum = // ? - SUM of f

        if (t == 0):
            MPI_Bcast(&fsum, ...);
}

如果我在while iterate循环中有OpenMP区域，则每次迭代的并行区域开销都会降低性能...

Answer 1

这是执行此操作的最简单方法：

    float sharedFsum = 0.f;
    float masterFsum;

    #pragma omp parallel
    {
        const int t = omp_get_thread_num();

        while(iteration_condition)
        {
            float f = get_local_result(t);

            // Manual reduction
            #pragma omp update
            sharedFsum += f;

            // Ensure the reduction is completed
            #pragma omp barrier

            #pragma omp master
            MPI_Bcast(&sharedFsum, ...);

            // Ensure no other threads update sharedFsum during the MPI_Bcast
            #pragma omp barrier
        }
    }

如果您有很多线程（例如数百个），那么原子操作可能会很昂贵。更好的方法是让运行时为您执行还原。这是一个更好的版本：

    float sharedFsum = 0;

    #pragma omp parallel
    {
        const int threadCount = omp_get_num_threads();
        float masterFsum;

        while(iteration_condition)
        {
            // Execute get_local_result on each thread and
            // perform the reduction into sharedFsum
            #pragma omp for reduction(+:sharedFsum) schedule(static,1)
            for(int i=0 ; i<threadCount ; ++i)
                sharedFsum += get_local_result(i);

            #pragma omp master
            {
                MPI_Bcast(&sharedFsum, ...);

                // sharedFsum must be reinitialized for the next iteration
                sharedFsum = 0.f;
            }

            // Ensure no other threads update sharedFsum during the MPI_Bcast
            #pragma omp barrier
        }
    }

旁注：

t在您的代码中不受保护，请在private(t)节中使用#pragma omp parallel以避免由于竞争条件而导致的未定义行为。或者，您可以使用范围变量。
#pragma omp master应该优先于线程ID的条件。

每次迭代的并行区域开销会破坏性能...

在大多数情况下，这是由于（隐式）同步/通信或工作失衡造成的。上面的代码可能有相同的问题，因为它非常同步。如果在您的应用程序中有意义，则可以通过消除或移动有关MPI_Bcast和get_local_result的速度的障碍来使其同步性降低（从而可能更快）。但是，正确做到这一点远非易事。一种使用 OpenMP任务和多缓冲的方法。

在并行区域内减少OpenMP

1 个答案: