我写了一个有两个流的程序。两个流都对某些数据进行操作,然后将结果写回到主机内存中。 这是我执行此操作的通用结构:
loop {
AsyncCpy(....,HostToDevice,Stream1);
AsyncCpy(....,HostToDevice,Stream2);
Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>
/* Write the results on the host memory */
AsyncCpy(....,DeviceToHost,Stream1);
AsyncCpy(....,DeviceToHost,Stream2);
}
当我知道StreamX完成将结果复制回主机内存后,我想在CPU上做一些工作。同时,我不想停止执行Async操作(memcpy或内核执行)的循环。
如果我插入主机函数,可以这样说host_ftn1(..)和host_ftn2(..)
loop {
AsyncCpy(....,HostToDevice,Stream1);
AsyncCpy(....,HostToDevice,Stream2);
Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>
/* Write the results on the host memory to be processed by host_ftn1(..) */
AsyncCpy(....DeviceToHost,Stream1);
/* Write the results on the host memory to be processed by host_ftn2(..) */
AsyncCpy(....DeviceToHost,Stream2);
if(Stream1 results are copied to host)
host_ftn1(..);
if(Stream2 results are copied to host)
host_ftn2(..);
}
它将停止循环执行,直到完成主机功能(即host_ftn1和host_ftn2)的执行为止,但我不想停止执行GPU指令,即AsyncCpy(..)和当CPU忙于执行主机功能(即host_ftn1(..)和host_ftn2(..)
)时,内核<< ..,StreamX >>>有关此问题的任何解决方案/方法
答案 0 :(得分:0)
正如 huseyin tugrul buyukisik 所建议的,stream callback
在这种情况下起作用。我已经测试了两个流。
最终设计如下:-
loop {
AsyncCpy(....,HostToDevice,Stream1);
AsyncCpy(....,HostToDevice,Stream2);
Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>
/* Write the results on the host memory to be processed by host_ftn1(..) */
AsyncCpy(....DeviceToHost,Stream1);
/* Write the results on the host memory to be processed by host_ftn2(..) */
AsyncCpy(....DeviceToHost,Stream2);
callback1(..); // Work to be done on the host once stream1 completes
callback2(..); // Work to be done on the host once stream2 completes
}