在nvprof中,我可以看到我正在使用的每个cuda执行流的流ID(0,13,15等)
鉴于流变量,我希望能够打印出流ID。目前我找不到任何API来执行此操作并将cudaStream_t
转换为int或uint不会产生合理的ID。 sizeof()
表示cudaStream_t
为8个字节。
答案 0 :(得分:5)
简单地说:我不知道直接访问这些ID的方法,但您可以为流分析提供流明确的名称。
cudaStream_t
是一个不透明的资源句柄"类型。资源句柄类似于指针;所以可以理解的是,流ID不包含在指针(句柄)本身中,而是以某种方式包含在它所引用的内容中。
由于它是不透明的(没有CUDA提供的内容的定义),并且你指出没有直接的API,我不认为你会找到一种方法来提取运行时来自cudaStream_t
的流ID。
对于cudaStream_t
是资源句柄且不透明的这些断言,请参阅CUDA头文件driver_types.h
但是,the NV Tools Extension API为您提供了" name"特定流(或其他资源)。这将允许您将源代码中的特定流与分析器中的特定名称相关联。
这是一个微不足道的例子:
$ cat t138.cu
#include <stdio.h>
#include <nvToolsExtCudaRt.h>
const long tdel = 1000000000ULL;
__global__ void tkernel(){
long st = clock64();
while (clock64() < st+tdel);
}
int main(){
cudaStream_t s1, s2, s3, s4;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);
cudaStreamCreate(&s4);
#ifdef USE_S_NAMES
nvtxNameCudaStreamA(s1, "stream 1");
nvtxNameCudaStreamA(s2, "stream 2");
nvtxNameCudaStreamA(s3, "stream 3");
nvtxNameCudaStreamA(s4, "stream 4");
#endif
tkernel<<<1,1,0,s1>>>();
tkernel<<<1,1,0,s2>>>();
tkernel<<<1,1,0,s3>>>();
tkernel<<<1,1,0,s4>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt
$ nvprof --print-gpu-trace ./t138
==28720== NVPROF is profiling process 28720, command: ./t138
==28720== Profiling application: ./t138
==28720== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
464.80ms 622.06ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 13 tkernel(void) [393]
464.81ms 621.69ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 14 tkernel(void) [395]
464.82ms 623.30ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 15 tkernel(void) [397]
464.82ms 622.69ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 16 tkernel(void) [399]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt -DUSE_S_NAMES
$ nvprof --print-gpu-trace ./t138
==28799== NVPROF is profiling process 28799, command: ./t138
==28799== Profiling application: ./t138
==28799== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
457.98ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 1 tkernel(void) [393]
457.99ms 544.31ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 2 tkernel(void) [395]
458.00ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 3 tkernel(void) [397]
458.00ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 4 tkernel(void) [399]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$