Question

我有一个OpenMP代码，它通过让每个线程管理由线程的id号码寻址的内存，可以通过omp_get_thread_num()访问CPU。这在CPU上运行良好，但它可以在GPU上运行吗？

MWE是：

#include <iostream>
#include <omp.h>

int main(){
  const int SIZE = 400000;

  int *m;
  m = new int[SIZE];

  #pragma omp target
  {
    #pragma omp parallel for
    for(int i=0;i<SIZE;i++)
      m[i] = omp_get_thread_num();
  }

  for(int i=0;i<SIZE;i++)
    std::cout<<m[i]<<"\n";
}

Answer 1

答案似乎是没有。

使用以下方式使用PGI进行编译：

pgc++ -fast -mp -ta=tesla,pinned,cc60 -Minfo=all test2.cpp

给出：

13, Parallel region activated
    Parallel loop activated with static block schedule
    Loop not vectorized/parallelized: contains call
14, Parallel region terminated

使用

与GCC进行编译

g++ -O3 test2.cpp -fopenmp -fopt-info

给出

test2.cpp:17: note: not vectorized: loop contains function calls or data references that cannot be analyzed
test2.cpp:17: note: bad data references.

Answer 2

使用GCC在GPU上运行正常。您需要映射m，例如像这样

#pragma omp target map(tofrom:m[0:SIZE])

我编译得像这样

g++ -O3 -Wall -fopenmp -fno-stack-protector so.cpp

您可以在此处查看系统示例，而无需卸载

http://coliru.stacked-crooked.com/a/1e756410d6e2db61

我用来在工作之前找出团队和线程数量的方法是这样的：

#pragma omp target teams defaultmap(tofrom:scalar)
{
    nteams = omp_get_num_teams();
    #pragma omp parallel
    #pragma omp single
    nthreads = omp_get_num_threads();
}

在我的GCC 7.2系统，Ubuntu 17.10和gcc-offload-nvptx的GTX 1060系统上，我得到nteams = 30和nthreads = 8。请参阅this answer我在哪里使用线程和团队对目标区域进行自定义缩减。使用-offload=disable nteams = 1和nthreads = 8（4核/ 8硬件线程CPU）。

我在编译选项中添加了-fopt-info，我只得到了消息

note: basic block vectorized

我可以在GPU上使用`omp_get_thread_num（）`吗？

2 个答案: