在我的程序中分配统一内存。在运行之后,它会抛出CUDA错误:内存不足,但仍然有空闲内存

时间:2016-01-18 08:43:42

标签: cuda nested out-of-memory

在问这个之前,我读过this question,这与我的相似。

在这里,我将详细介绍我的计划。

#define N 70000
#define M 1000

class ObjBox
{public:

    int oid; float x; float y; float ts};

class Bucket
{public:

    int bid; int nxt; ObjBox *arr_obj; int nO;}

int main()
{

   Bucket *arr_bkt;

   cudaMallocManaged(&arr_bkt, N * sizeof(Bucket));

   for (int i = 0; i < N; i++)

   {

       arr_bkt[i].bid = i; 

       arr_bkt[i].nxt = -1;

       arr_bkt[i].nO = 0;

       cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), M * sizeof(ObjBox));

       if (r != cudaSuccess)

       {

           printf("CUDA Error on %s\n", cudaGetErrorString(r));

           exit(0);

       }

       for (int j = 0; j < M; j++)

       {

           arr_bkt[i].arr_obj[j].oid = -1;

           arr_bkt[i].arr_obj[j].x = -1;

           arr_bkt[i].arr_obj[j].y = -1;

           arr_bkt[i].arr_obj[j].ts = -1;

        }

   }

   cout << "Bucket Array Initial Completed..." << endl;

   cudaFree(arr_bkt);

   return 0;

}

在我的主程序中,我分配了一个Bucket类型的数组,它有一个嵌套数组ObjBox。阵列中总共有N(70000)Bucket,每个Bucket中有M(1000)ObjBox。我可以正常编译我的程序并在运行时出现内存错误,错误在于行cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), M * sizeof(ObjBox));

我试图解决这个问题很久了,我发现这里有一点:

1,当N较小时,这样的广告30000,40000,60000甚至,程序可以正常工作。也就是说,它可以在一个结构中分配如此多的统一内存;

2,还有很多可用内存。在我的服务器中,有16G主机内存和11G GPU全局内存。但是在这个程序中,Bucket阵列几乎消耗了

 N * M * sizeof(ObjBox) = 70000 * 1000 * 16Byte = 1120M; 

3,值M几乎与内存不足错误无关;当N保持不变(70000)时,M减小到100,程序也会中断;

我的GPU的类型是特斯拉K40c,我向我的导师提出了我的问题,她把这个问题呈现给她的朋友,她的朋友在她的特斯拉K20中使用CUDA 7.0版运行程序,它可以正常分配结构

怎么回事?如何在Tesla K40c中分配结构? 我的导师认为GPU驱动程序设置中可能存在一些有限的设置, 但我还不能解决它;

1 个答案:

答案 0 :(得分:2)

如果我使用某些工具修改代码,例如:

#include <cstdio>
#include <iostream>

#define N 70000
#define M 1000

class ObjBox
{
    public:

        int oid; 
        float x; 
        float y; 
        float ts;
};

class Bucket
{
    public:

        int bid; 
        int nxt; 
        ObjBox *arr_obj; 
        int nO;
};

int main()
{

    Bucket *arr_bkt;
    cudaMallocManaged(&arr_bkt, N * sizeof(Bucket));

    for (int i = 0; i < N; i++) {
        arr_bkt[i].bid = i; 
        arr_bkt[i].nxt = -1;
        arr_bkt[i].nO = 0;

        size_t allocsz = size_t(M) * sizeof(ObjBox);
        cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), allocsz);
        if (r != cudaSuccess) {
            printf("CUDA Error on %s\n", cudaGetErrorString(r));
            exit(0);
        } else {
            size_t total_mem, free_mem;
            cudaMemGetInfo(&free_mem, &total_mem);
            std::cout << i << ":Allocated " << allocsz;
            std::cout << " Currently " << free_mem << " bytes free" << std::endl;
        } 

        for (int j = 0; j < M; j++) {
            arr_bkt[i].arr_obj[j].oid = -1;
            arr_bkt[i].arr_obj[j].x = -1;
            arr_bkt[i].arr_obj[j].y = -1;
            arr_bkt[i].arr_obj[j].ts = -1;
        }
    }

    std::cout << "Bucket Array Initial Completed..." << std::endl;
    cudaFree(arr_bkt);

    return 0;
}

使用Linux 352.39驱动程序在具有16Gb物理主机内存和2Gb物理设备内存的统一内存系统上编译和运行它,我明白了:

0:Allocated 16000 Currently 2099871744 bytes free
1:Allocated 16000 Currently 2099871744 bytes free
2:Allocated 16000 Currently 2099871744 bytes free
3:Allocated 16000 Currently 2099871744 bytes free
4:Allocated 16000 Currently 2099871744 bytes free
5:Allocated 16000 Currently 2099871744 bytes free
6:Allocated 16000 Currently 2099871744 bytes free
7:Allocated 16000 Currently 2099871744 bytes free
8:Allocated 16000 Currently 2099871744 bytes free
9:Allocated 16000 Currently 2099871744 bytes free
....
....
....
65445:Allocated 16000 Currently 1028161536 bytes free
65446:Allocated 16000 Currently 1028161536 bytes free
65447:Allocated 16000 Currently 1028161536 bytes free
65448:Allocated 16000 Currently 1028161536 bytes free
65449:Allocated 16000 Currently 1028161536 bytes free
65450:Allocated 16000 Currently 1028161536 bytes free
65451:Allocated 16000 Currently 1028161536 bytes free
CUDA Error on out of memory    

即。报告内存不足,设备上剩余大量可用内存。

我认为理解这一点的关键是分配的,在失败点,而不是它们的大小。 65451可疑地接近65535(即2 ^ 16)。允许运行时进行的内部内存分配,我猜想对内存管理内存分配总数有65535的意外或故意限制。

我很想知道你是否可以重现这一点。如果可以,我会考虑向NVIDIA提交错误报告。