我正在编写cuda程序,并且在分析一个函数之后,比如大部分时间在大矩阵成本上做点积:
==27530== API calls:
Time(%) Time Calls Avg Min Max Name
64.90% 2.25369s 23 97.986ms 9.5590us 1.79533s cudaMemcpy
21.04% 730.65ms 1422 513.82us 3.0050us 21.028ms cudaLaunch
8.72% 302.72ms 5 60.543ms 477ns 170.92ms cudaFree
3.64% 126.54ms 18 7.0298ms 4.8882ms 35.518ms cudaMallocHost
1.39% 48.292ms 16 3.0182ms 3.0076ms 3.0601ms cudaFreeHost
0.11% 3.9026ms 23 169.68us 64.314us 1.7771ms cudaMalloc
0.09% 3.0171ms 17661 170ns 144ns 3.1750us cudaSetupArgument
0.04% 1.3514ms 810 1.6680us 1.4000us 9.9270us cudaBindTexture
0.02% 569.60us 810 703ns 596ns 4.8010us cudaUnbindTexture
0.02% 556.24us 945 588ns 484ns 4.2560us cudaFuncSetCacheConfig
0.01% 499.67us 1422 351ns 163ns 198.52us cudaConfigureCall
0.01% 256.21us 1310 195ns 150ns 335ns cudaGetLastError
0.01% 238.26us 166 1.4350us 165ns 49.141us cuDeviceGetAttribute
0.01% 175.44us 945 185ns 157ns 755ns cudaPeekAtLastError
0.00% 50.787us 2 25.393us 16.700us 34.087us cuDeviceGetName
0.00% 45.330us 2 22.665us 19.024us 26.306us cuDeviceTotalMem
0.00% 43.289us 2 21.644us 13.641us 29.648us cudaMemset
0.00% 43.029us 2 21.514us 14.059us 28.970us cudaGetDeviceProperties
0.00% 13.931us 12 1.1600us 339ns 5.5310us cudaGetDevice
0.00% 3.4750us 1 3.4750us 3.4750us 3.4750us cudaDeviceSynchronize
0.00% 1.5320us 1 1.5320us 1.5320us 1.5320us cuDriverGetVersion
0.00% 1.2690us 3 423ns 241ns 753ns cuDeviceGetCount
0.00% 1.0080us 1 1.0080us 1.0080us 1.0080us cuInit
0.00% 1.0060us 3 335ns 314ns 377ns cuDeviceGet
它显示了' cudaMemcpy'花费超过两秒钟。但是我的代码中几乎没有cudaMemcpy调用,并且D-> H或H-> D存储器副本都是固定内存。我不认为我的cudaMemcpy电话会花费很多时间。
大部分时间消耗的功能:
==27530== Profiling result:
Time(%) Time Calls Avg Min Max Name
74.35% 2.34598s 112 20.946ms 20.743ms 21.161ms knl_convolve_filter(float*, float*, int, int, int, float*)
和功能:
__global__ void knl_convolve_filter(float *feature, float *filter, int width, int height, int cell_size, float *convolution) {
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
if( x < width && y < height) {
if( x & 1) {
//odd, imaginary part
float sum = 0.0f;
size_t offset = (y * width + x - 1) * cell_size ;
for(int i = 0, total_cell_size = cell_size * 2; i < total_cell_size ; i += 2) {
float y = *(feature + offset + i) * *(filter + offset + i + 1) + *(feature + offset + i + 1) * *(filter + offset + i);
sum += y;
}
*(convolution + y * width + x) = sum;
} else {
//even, real part
float sum = 0.0f;
size_t offset = (y * width + x) * cell_size ;
for(int i = 0, total_cell_size = cell_size * 2; i < total_cell_size ; i += 2) {
float x = *(feature + offset + i) * *(filter + offset + i) - *(feature + offset + i + 1) * *(filter + offset + i + 1);
sum += x;
}
*(convolution + y * width + x) = sum;
}
}
}
我在Fedora 19 64,cuda 6.0上使用GTX760(CC3.0)。我在这里犯了一个大错误吗?
答案 0 :(得分:3)
很难给出明确的答案,因为我们没有显示任何主机代码,但事实上,似乎有一个非常慢cudaMemcpy
来电分析序列消耗1.79533秒。其他20多个呼叫每个平均只需要大约20毫秒。所以真正的问题是“为什么这个特定的cudaMemcpy
调用需要1.79533秒?”,我怀疑答案是它在CUDA运行时API中吸收了大量的延迟设置延迟。
现代版本的CUDA工具包附带的nvprof
配置文件实用程序可以选择发出详细的API时间轴。对该时间表的分析肯定会回答您的问题,但如果没有主机代码或API跟踪,这就是可以提供的具体答案。