OpenACC数据移动

时间:2016-06-02 16:07:52

标签: pragma openacc

我是OpenACC的新手,我不太了解数据移动和#34; #pragma acc数据"子句。

我有一个用C编写的程序。代码中的摘录是这样的:

#pragma acc data create(intersectionSet[0:intersectionsCount][0:4]) // line 122
#pragma acc kernels // line 123
for (int i = 0; i<intersectionsCount; i++){ // line 124
    intersectionSet[i][0] = 9; // line 125
}

intersectionctionsCount的值为210395.在编译并运行以下代码之后:

pgcc -o rect_openacc -fast -Minfo -acc -ta=nvidia,time rect.c

我有这个输出:

    time(us): 1,475,607
122: data region reached 1 time
    31: kernel launched 210395 times
        grid: [1]  block: [128]
         device time(us): total=1,475,315 max=15 min=7 avg=7
        elapsed time(us): total=5,451,647 max=24,028 min=24 avg=25
123: compute region reached 1 time
    124: kernel launched 1 time
        grid: [1644]  block: [128]
         device time(us): total=292 max=292 min=292 avg=292
        elapsed time(us): total=312 max=312 min=312 avg=312
156: data region reached 1 time

阅读输出后我有一些问题:

  1. 我不知道为什么它说第31行,因为第31行没有acc pragma。这是否意味着我无法追踪?
  2. 在第34行中:内核启动210395次&#34;,它表示启动内核的210395倍。我不知道内核需要启动这么多次才是正常的,因为这部分已经占用了5,451,647(我们),我觉得它有点长。我认为for循环很简单,不应该花费太多时间。我是以错误的方式使用pragma吗?
  3. 更新
    我有几个程序头文件。但是这些文件没有&#34; acc data&#34;或&#34; acc内核&#34;附注

    用&#34; -Minfo = all&#34;编译代码后,结果如下:

    breakStringToCharArray:
     11, include "stringHelper.h"
          50, Loop not vectorized/parallelized: contains call
    countChar:
     11, include "stringHelper.h"
          74, Loop not vectorized/parallelized: not countable
    extractCharToIntRequiredInt:
     11, include "stringHelper.h"
          93, Loop not vectorized/parallelized: contains call
    extractArray:
     12, include "fileHelper.h"
          49, Loop not vectorized/parallelized: contains call
    isRectOverlap:
     13, include "shapeHelper.h"
          23, Generating acc routine vector
              Generating Tesla code
    getRectIntersection:
     13, include "shapeHelper.h"
          45, Generating acc routine vector
              Generating Tesla code
    getRectIntersectionInGPU:
     13, include "shapeHelper.h"
          69, Generating acc routine vector
              Generating Tesla code
    max:
     13, include "shapeHelper.h"
          98, Generating acc routine vector
              Generating Tesla code
    min:
     13, include "shapeHelper.h"
         118, Generating acc routine vector
              Generating Tesla code
    main:
    64, Loop not vectorized/parallelized: contains call
    108, Loop not vectorized/parallelized: contains call
    122, Generating create(intersectionSet[:intersectionsCount][:4])
    124, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
    124, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    

    我以这种方式创建了intersectionSet:

    intersectionSet = (int **)malloc(sizeof(int **) * intersectionsCount);
    for (i = 0; i<intersectionsCount; i++){
        intersectionSet[i] = (int *)malloc(sizeof(int *) * 4);
    }
    

1 个答案:

答案 0 :(得分:3)

发生了什么事情,因为你有指向数组的指针,“**”,(至少我猜这是什么是intersectionSet)编译器必须首先将指针分配给设备上的指针,然后遍历每个元素分配各个设备阵列。最后,它需要启动一个内核来设置设备上的指针值。这里有一些伪代码可以帮助说明。

devPtrPtr = deviceMalloc(numElements*pointer size);
for (i=0; i < numElements; ++i) {
   devPtr = deviceMalloc(elementSize * dataTypeSize);
   call deviceKernelToSetPointer<<<1,128>>(devPtrPtr[i],devPtr);
}

为了帮助您的代码,我将切换尺寸,使列长度为4,行长度为“intersectionctionsCount”。这也有助于设备上的数据访问,因为“向量”循环应该对应于stride-1(连续)维度,以避免内存分歧。

希望这有帮助,