Question

我正在测试动态分配，即

__device__ double *temp;
__global__
void test(){
    temp = new double[125000]; //1MB
}

调用此函数100次以查看内存是否在减少：

size_t free, total;
CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6)); 

for(int t=0;t<100;t++){
        test<<<1, 1>>>();
        CUDA_CHECK(cudaDeviceSynchronize());  
        fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));
    }
CUDA_CHECK(cudaMemGetInfo(&free, &total));
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));

它实际上是。

注意：尝试没有调用函数和时 cudaMemGetInfo在循环中，它从800减少到 650 Mo，我得出结论，控制台的输出需要大约150 Mo. 实际上，在尝试上面编写的代码时，结果却没有更改。但它太棒了！
循环后我的可用内存减少了~50Mo（我希望通过对内核的调用来评论没有任何减少）。当我在内核中添加一个删除（temp）时，似乎没有减少浪费的内存量，我仍然减少了~30Mo。为什么？
循环后使用cudaFree（temp）或cudadeviceReset（）也没有多大帮助。为什么？以及如何释放内存？

Answer 1

听起来你需要先阅读question和answer对，然后才能继续前进。

您在内核中使用new分配的内存来自静态运行时堆，该堆作为“延迟”上下文建立的一部分进行分配，这是在程序运行时由CUDA运行时启动的。建立上下文的第一个CUDA调用也将加载包含内核代码的模块，并为后面的内核调用保留本地内存，运行时缓冲区和运行时堆。这就是你观察到的大部分内存消耗来自的地方。运行时API包含call，允许用户控制分配的大小。

你应该发现在CUDA第4版或第5版上做这样的事情：

size_t free, total;
CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", 
                    free/1e6, total/1e6); 

cudaFree(0);

CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", 
                    free/1e6, total/1e6); 

// Kernel loop follows

[免责声明：用浏览器编写，使用风险自负]

应该显示cudaFree(0)调用后可用内存会丢失，因为这应该启动上下文初始化序列，这会消耗GPU上的内存。

cudaMemGetInfo不是常数？

1 个答案: