Question

假设我们有两个嵌套循环。内循环应该是并行的，但外循环需要按顺序执行。然后，以下代码执行我们想要的操作：

for (int i = 0; i < N; ++i) {
  #pragma omp parallel for schedule(static)
  for (int j = first(i); j < last(i); ++j) {
    // Do some work
  }
}

现在假设每个线程必须获取一些线程局部对象来执行内部循环中的工作，并且获取这些线程局部对象的代价很高。因此，我们不想做以下事情：

for (int i = 0; i < N; ++i) {
  #pragma omp parallel for schedule(static)
  for (int j = first(i); j < last(i); ++j) {
    ThreadLocalObject &obj = GetTLO(omp_get_thread_num()); // Costly!
    // Do some work with the help of obj
  }
}

我该如何解决这个问题？

每个帖子只应询问一次本地对象。
内部循环应该在所有线程之间并行化。
外循环的迭代应该一个接一个地执行。

我的想法如下，但它真的想要我想要吗？

#pragma omp parallel
{
  ThreadLocalObject &obj = GetTLS(omp_get_thread_num());
  for (int i = 0; i < N; ++i) {
    #pragma omp for schedule(static)
    for (int j = first(i); j < last(i); ++j) {
      // Do some work with the help of obj
    }
  }
}

Answer 1

当你可以简单地使用pool of objects时，我真的不明白为什么threadprivate的复杂性是必要的。基本的想法应该遵循以下几点：

#pragma omp parallel
{      
  // Will hold an handle to the object pool
  auto pool = shared_ptr<ObjectPool>(nullptr); 
  #pragma omp single copyprivate(pool)
  {
    // A single thread creates a pool of num_threads objects
    // Copyprivate broadcasts the handle
    pool = create_object_pool(omp_get_num_threads());
  }
  for (int i = 0; i < N; ++i) 
  {
    #pragma omp parallel for schedule(static)
    for (int j = first(i); j < last(i); ++j) 
    {
        // The object is not re-created, just a reference to it
        // is returned from the pool
        auto & r = pool.get( omp_get_thread_num() );
        // Do work with r
    }
  }
}

使用OpenMP并行化内部循环

1 个答案: