内核名称后方括号中显示的数字与启动该内核的CUDA API相关。 (from GPU-Trace and API-Trace Modes)
内核名称后面方括号中显示的数字是
NVIDIA CUDA Runtime API中的CUDA API [94](及其他)到底是什么?
==27706== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.36 GFlop/s, Time= 3.707 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27706== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
133.81ms 135.78us - - - - - 409.60KB 3.0167GB/s GeForce GT 640M 1 2 [CUDA memcpy HtoD]
134.62ms 270.66us - - - - - 819.20KB 3.0267GB/s GeForce GT 640M 1 2 [CUDA memcpy HtoD]
134.90ms 3.7037ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [94]
138.71ms 3.7011ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [105]
<...more output...>
1.24341s 3.7011ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2191]
1.24711s 3.7046ms (20 10 1) (32 32 1) 29 8.1920KB 0B - - GeForce GT 640M 1 2 void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2198]
1.25089s 248.13us - - - - - 819.20KB 3.3015GB/s GeForce GT 640M 1 2 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
答案 0 :(得分:3)
它可能更清楚:
内核名称后方括号中显示的数字与启动该内核的CUDA API 调用相关。
如果使用--print-api-trace
选项运行给定的代码,则将获得该应用程序发出的所有CUDA API调用的顺序列表。如果要按顺序编号,则与特定内核启动相关的编号将显示在--print-gpu-trace
输出的方括号中。
这是一个完整的示例。请注意api-trace输出和gpu-trace输出中[105]
,[106]
和[108]
之间的相关性:
$ cat t1.cu
__global__ void k(){}
int main(){
k<<<1,1>>>();
k<<<1,1>>>();
cudaDeviceSynchronize();
k<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
Start Duration Name
116.17ms 3.0990us cuDeviceGetPCIBusId
130.20ms 800ns cuDeviceGetCount
130.20ms 251ns cuDeviceGetCount
130.41ms 1.0500us cuDeviceGet
130.41ms 705ns cuDeviceGetAttribute
130.42ms 539ns cuDeviceGetAttribute
130.42ms 547ns cuDeviceGetAttribute
130.46ms 525ns cuDeviceGetCount
130.46ms 277ns cuDeviceGet
130.46ms 59.680us cuDeviceGetName
130.52ms 63.802us cuDeviceTotalMem
130.59ms 497ns cuDeviceGetAttribute
130.59ms 226ns cuDeviceGetAttribute
130.59ms 282ns cuDeviceGetAttribute
130.59ms 234ns cuDeviceGetAttribute
130.59ms 229ns cuDeviceGetAttribute
130.59ms 34.628us cuDeviceGetAttribute
130.62ms 372ns cuDeviceGetAttribute
130.63ms 220ns cuDeviceGetAttribute
130.63ms 284ns cuDeviceGetAttribute
130.63ms 237ns cuDeviceGetAttribute
130.63ms 222ns cuDeviceGetAttribute
130.63ms 231ns cuDeviceGetAttribute
130.63ms 288ns cuDeviceGetAttribute
130.63ms 219ns cuDeviceGetAttribute
130.63ms 3.1870us cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 275ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 213ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 336ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 213ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 210ns cuDeviceGetAttribute
130.65ms 215ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 320.65us cuDeviceGetAttribute
130.97ms 322ns cuDeviceGetAttribute
130.97ms 206ns cuDeviceGetAttribute
130.97ms 218ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.98ms 226ns cuDeviceGetAttribute
130.98ms 220ns cuDeviceGetAttribute
130.98ms 212ns cuDeviceGetAttribute
130.98ms 210ns cuDeviceGetAttribute
130.98ms 206ns cuDeviceGetAttribute
130.98ms 207ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 211ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 229ns cuDeviceGetAttribute
130.98ms 215ns cuDeviceGetAttribute
130.98ms 216ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 316.59us cuDeviceGetAttribute
131.30ms 266ns cuDeviceGetAttribute
131.30ms 252ns cuDeviceGetAttribute
131.30ms 212ns cuDeviceGetAttribute
131.30ms 235ns cuDeviceGetAttribute
131.30ms 209ns cuDeviceGetAttribute
131.30ms 272ns cuDeviceGetAttribute
131.30ms 207ns cuDeviceGetAttribute
131.30ms 735ns cuDeviceGetAttribute
131.30ms 254ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 610ns cuDeviceGetAttribute
131.31ms 273ns cuDeviceGetAttribute
131.31ms 412ns cuDeviceGetAttribute
131.31ms 216ns cuDeviceGetAttribute
131.31ms 211ns cuDeviceGetAttribute
131.31ms 205ns cuDeviceGetAttribute
131.31ms 59.911ms cudaLaunchKernel (k(void) [105])
191.23ms 11.222us cudaLaunchKernel (k(void) [106])
191.24ms 5.7860us cudaDeviceSynchronize
191.25ms 9.2890us cudaLaunchKernel (k(void) [108])
191.26ms 5.1790us cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
191.20ms 1.6000us (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [105]
191.22ms 896ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [106]
191.23ms 928ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [108]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$