在问这个之前,我读过this question,这与我的相似。
在这里,我将详细介绍我的计划。
#define N 70000
#define M 1000
class ObjBox
{public:
int oid; float x; float y; float ts};
class Bucket
{public:
int bid; int nxt; ObjBox *arr_obj; int nO;}
int main()
{
Bucket *arr_bkt;
cudaMallocManaged(&arr_bkt, N * sizeof(Bucket));
for (int i = 0; i < N; i++)
{
arr_bkt[i].bid = i;
arr_bkt[i].nxt = -1;
arr_bkt[i].nO = 0;
cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), M * sizeof(ObjBox));
if (r != cudaSuccess)
{
printf("CUDA Error on %s\n", cudaGetErrorString(r));
exit(0);
}
for (int j = 0; j < M; j++)
{
arr_bkt[i].arr_obj[j].oid = -1;
arr_bkt[i].arr_obj[j].x = -1;
arr_bkt[i].arr_obj[j].y = -1;
arr_bkt[i].arr_obj[j].ts = -1;
}
}
cout << "Bucket Array Initial Completed..." << endl;
cudaFree(arr_bkt);
return 0;
}
在我的主程序中,我分配了一个Bucket类型的数组,它有一个嵌套数组ObjBox。阵列中总共有N(70000)Bucket,每个Bucket中有M(1000)ObjBox。我可以正常编译我的程序并在运行时出现内存错误,错误在于行cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), M * sizeof(ObjBox));
我试图解决这个问题很久了,我发现这里有一点:
1,当N较小时,这样的广告30000,40000,60000甚至,程序可以正常工作。也就是说,它可以在一个结构中分配如此多的统一内存;
2,还有很多可用内存。在我的服务器中,有16G主机内存和11G GPU全局内存。但是在这个程序中,Bucket阵列几乎消耗了
N * M * sizeof(ObjBox) = 70000 * 1000 * 16Byte = 1120M;
3,值M几乎与内存不足错误无关;当N保持不变(70000)时,M减小到100,程序也会中断;
我的GPU的类型是特斯拉K40c,我向我的导师提出了我的问题,她把这个问题呈现给她的朋友,她的朋友在她的特斯拉K20中使用CUDA 7.0版运行程序,它可以正常分配结构
怎么回事?如何在Tesla K40c中分配结构? 我的导师认为GPU驱动程序设置中可能存在一些有限的设置, 但我还不能解决它;
答案 0 :(得分:2)
如果我使用某些工具修改代码,例如:
#include <cstdio>
#include <iostream>
#define N 70000
#define M 1000
class ObjBox
{
public:
int oid;
float x;
float y;
float ts;
};
class Bucket
{
public:
int bid;
int nxt;
ObjBox *arr_obj;
int nO;
};
int main()
{
Bucket *arr_bkt;
cudaMallocManaged(&arr_bkt, N * sizeof(Bucket));
for (int i = 0; i < N; i++) {
arr_bkt[i].bid = i;
arr_bkt[i].nxt = -1;
arr_bkt[i].nO = 0;
size_t allocsz = size_t(M) * sizeof(ObjBox);
cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), allocsz);
if (r != cudaSuccess) {
printf("CUDA Error on %s\n", cudaGetErrorString(r));
exit(0);
} else {
size_t total_mem, free_mem;
cudaMemGetInfo(&free_mem, &total_mem);
std::cout << i << ":Allocated " << allocsz;
std::cout << " Currently " << free_mem << " bytes free" << std::endl;
}
for (int j = 0; j < M; j++) {
arr_bkt[i].arr_obj[j].oid = -1;
arr_bkt[i].arr_obj[j].x = -1;
arr_bkt[i].arr_obj[j].y = -1;
arr_bkt[i].arr_obj[j].ts = -1;
}
}
std::cout << "Bucket Array Initial Completed..." << std::endl;
cudaFree(arr_bkt);
return 0;
}
使用Linux 352.39驱动程序在具有16Gb物理主机内存和2Gb物理设备内存的统一内存系统上编译和运行它,我明白了:
0:Allocated 16000 Currently 2099871744 bytes free
1:Allocated 16000 Currently 2099871744 bytes free
2:Allocated 16000 Currently 2099871744 bytes free
3:Allocated 16000 Currently 2099871744 bytes free
4:Allocated 16000 Currently 2099871744 bytes free
5:Allocated 16000 Currently 2099871744 bytes free
6:Allocated 16000 Currently 2099871744 bytes free
7:Allocated 16000 Currently 2099871744 bytes free
8:Allocated 16000 Currently 2099871744 bytes free
9:Allocated 16000 Currently 2099871744 bytes free
....
....
....
65445:Allocated 16000 Currently 1028161536 bytes free
65446:Allocated 16000 Currently 1028161536 bytes free
65447:Allocated 16000 Currently 1028161536 bytes free
65448:Allocated 16000 Currently 1028161536 bytes free
65449:Allocated 16000 Currently 1028161536 bytes free
65450:Allocated 16000 Currently 1028161536 bytes free
65451:Allocated 16000 Currently 1028161536 bytes free
CUDA Error on out of memory
即。报告内存不足,设备上剩余大量可用内存。
我认为理解这一点的关键是分配的号,在失败点,而不是它们的大小。 65451可疑地接近65535(即2 ^ 16)。允许运行时进行的内部内存分配,我猜想对内存管理内存分配总数有65535的意外或故意限制。
我很想知道你是否可以重现这一点。如果可以,我会考虑向NVIDIA提交错误报告。