Question

鉴于以下代码......

for (size_t i = 0; i < clusters.size(); ++i)
{
    const std::set<int>& cluster = clusters[i];
    // ... expensive calculations ...
    for (int j : cluster)
        velocity[j] += f(j); 
}

...我想在多个CPU /核心上运行。函数f不使用velocity。

第一个for循环之前的简单#pragma omp parallel for将产生不可预测/错误的结果，因为std::vector<T> velocity在内循环中被修改。多个线程可以同时访问和（尝试）修改velocity的相同元素。

我认为第一个解决方案是在#pragma omp atomic操作之前编写velocity[j] += f(j);。这给了我一个编译错误（可能与类型Eigen::Vector3d或velocity是类成员的元素有关）。此外，与每个线程拥有一个私有变量并最终进行减少相比，我认为原子操作非常慢。我想这就是我想做的事。

我想出了这个：

#pragma omp parallel
{
    // these variables are local to each thread
    std::vector<Eigen::Vector3d> velocity_local(velocity.size());
    std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));

    #pragma omp for
    for (size_t i = 0; i < clusters.size(); ++i)
    {
        const std::set<int>& cluster = clusters[i];
        // ... expensive calculations ...
        for (int j : cluster)
            velocity_local[j] += f(j); // save results from the previous calculations
    } 

    // now each thread can save its results to the global variable
    #pragma omp critical
    {
        for (size_t i = 0; i < velocity_local.size(); ++i)
            velocity[i] += velocity_local[i];
    }
}

这是一个很好的解决方案吗？它是最佳解决方案吗？（甚至正确？）

进一步的想法：使用reduce子句（而不是critical部分）会引发编译器错误。我认为这是因为velocity是一个班级成员。

我试图找到一个类似问题的问题，this问题看起来几乎一样。但我认为我的案例可能不同，因为最后一步包括for循环。还有一个问题是这是否是最佳方法仍然存在。

修改：每条评论的请求：reduction条款......

    #pragma omp parallel reduction(+:velocity)
    for (omp_int i = 0; i < velocity_local.size(); ++i)
        velocity[i] += velocity_local[i];

...抛出以下错误：

错误C3028：'ShapeMatching :: velocity'：只能在数据共享子句中使用变量或静态数据成员

（与g++类似的错误）

Answer 1

你正在减少数组。我已多次描述过这种情况（例如reducing an array in openmp和fill histograms array reduction in parallel with openmp without using a critical section）。您可以在有或没有关键部分的情况下执行此操作。

您已经使用关键部分（在最近的编辑中）正确完成了这一操作，因此，让我在没有关键部分的情况下描述如何执行此操作。

std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int vsize = velocity.size();

    #pragma omp single
    velocitya.resize(vsize*nthreads);
    std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1), 
              Eigen::Vector3d(0,0,0));

    #pragma omp for schedule(static)
    for (size_t i = 0; i < clusters.size(); i++) {
        const std::set<int>& cluster = clusters[i];
        // ... expensive calculations ...
        for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
    } 

    #pragma omp for schedule(static)
    for(int i=0; i<vsize; i++) {
        for(int t=0; t<nthreads; t++) {
            velocity[i] += velocitya[vsize*t + i];
        }
    }
}

这种方法需要额外的小心/调整，因为我没有做错误的共享。

关于哪种方法更好，你必须进行测试。

OpenMP / C ++：并行for循环随后减少 - 最佳实践？

1 个答案: