-ta = tesla:managed:cuda8但cuMemAllocManaged返回错误2:内存不足

时间:2017-05-02 20:15:19

标签: cuda openacc

我是OpenACC的新手。我非常喜欢OpenMP。

我有2张1080Ti卡,每张卡有9GB,我有128GB的RAM。我正在尝试一个非常基本的测试来分配一个数组,初始化它,然后并行地总结它。这适用于8 GB,但当我增加到10 GB时,我会出现内存不足错误。我的理解是,凭借Pascal(这些卡都是)和CUDA 8的统一内存,我可以分配一个大于GPU内存的阵列,硬件将根据需要进行页面输入和分页。

这是我完整的C代码测试:

$ cat firstAcc.c 

#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 10

int main()
{
  float *a;
  size_t n = GB*1024*1024*1024/sizeof(float);
  size_t s = n * sizeof(float);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  float sum=0.0;
  #pragma acc loop reduction (+:sum)
  for (int i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}

根据&#34;启用统一内存&#34; this article的一节我用它编译:

$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
 20, Loop not fused: function call before adjacent loop
     Generated vector simd code for the loop
 28, Loop not fused: function call before adjacent loop
     Generated vector simd code for the loop containing reductions
     Generated a prefetch instruction for the loop

我需要了解这些消息但是现在我认为它们并不相关。然后我运行它:

$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)

如果我将GB更改为8,这样可以正常工作。由于Pascal 1080Ti和CUDA 8,我希望10GB能够工作(尽管GPU卡有9GB)。

我误解了,或者我做错了什么?提前谢谢。

$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

$ cat /usr/local/cuda-8.0/version.txt 
CUDA Version 8.0.61

2 个答案:

答案 0 :(得分:5)

除了鲍勃提到的,我还做了一些补救措施。

首先,您实际上并没有生成OpenACC计算区域,因为您只有“#pragma acc loop”指令。这应该是“#pragma acc parallel loop”。您可以在编译器反馈消息中看到这一点,它只显示主机代码优化。

其次,“i”索引应声明为“long”。否则,你将溢出索引。

最后,您需要在目标加速器选项中添加“cc60”,以告诉编译器以基于Pascal的GPU为目标。

% cat mi.c  
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 20ULL

int main()
{
  float *a;
  size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  double sum=0.0;
  #pragma acc parallel loop reduction (+:sum)
  for (long i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}

% pgcc -fast -acc -ta=tesla:managed,cuda8.0,cc60 -Minfo=accel mi.c
main:
     21, Accelerator kernel generated
         Generating Tesla code
         21, Generating reduction(+:sum)
         22, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     21, Generating implicit copyin(a[:5368709120])
% ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000

答案 1 :(得分:3)

我相信问题在这里:

size_t n = GB*1024*1024*1024/sizeof(float);

当我用g ++编译那行代码时,我得到一个关于整数溢出的警告。出于某种原因,PGI编译器没有警告,但同样的不良情况正在发生。在sn的声明之后,如果我添加这样的打印输出:

  size_t n = GB*1024*1024*1024/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);  // add this line

用PGI 17.04编译,运行(在P100上,16GB)我输出如下:

$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
     16, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop
     22, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$

所以很明显,ns并非您的意图。

我们可以通过用ULL标记所有这些常量来解决这个问题,然后事情似乎对我来说正常:

$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 20ULL

int main()
{
  float *a;
  size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  double sum=0.0;
  #pragma acc loop reduction (+:sum)
  for (int i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
     16, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop
     22, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$

请注意,我上面也做了另一项更改。我将sum累积变量从float更改为double。这对保持一定的敏感性是必要的。在非常小的数量下进行非常大的减少时的结果。

而且,正如@MatColgrove在他的回答中指出的那样,我也错过了其他一些事情。