我的代码中有很多这种附加模式。基本上它相当于用于过滤大型数据集的第一个内核,其中返回的选定条目将非常稀疏,然后是第二个内核,用于对大大减少的数据集执行更多涉及的计算。
似乎cudaStreamSynchronize几乎是多余的,但我无法看到它的任何方式。
示例代码:
/* Pseudocode. Won't Compile */
/* Please ignore silly mistakes/syntax and inefficiant/incorrect simplifications */
__global__ void bar( const float * dataIn, float * dataOut, unsigned int * counter_ptr )
{
< do some computation >
if (bConditionalComputedAboveIsTrue)
{
const unsigned int ind = atomicInc(counter_ptr, (unsigned int)(-1));
dataOut[ ind ] = resultOfAboveComputation;
}
}
int foo( float * d_datain, float* d_tempbuffer, float* d_output, cudaStream_t stream ){
/* Initialize a counter that will be updated by the bar kernel */
unsigned int * counter_ptr;
cudaMalloc( &counter_ptr, sizeof( unsigned int) ); //< Create a Counter
cudaMemsetAsync(counter_ptr, 0, sizeof(unsigned int), stream); //<Initially Set the Counter to 0
dim3 threadsInit(16,16,1);
dim3 gridInit(256, 1, 1);
/* Launch the Filtering Kernel. This will update the value in counter_ptr*/
bar<<< gridInit, threadsInit, 0, stream >>>( d_datain, d_tempbuffer, counter_ptr );
/* Download the count and synchronize the stream */
unsigned int count;
cudaMemcpyAsync(&count, counter_ptr, sizeof(unsigned int), cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize( stream ); //< Is there any way around this synchronize?
/* Compute the grid parameters and launch a second kernel */
dim3 bazThreads(128,1,1);
dim3 bazGrid( count/128 + 1, 1, 1); //< Here I use the counter modified in the prior kernel to set the grid parameters
baz<<< bazGrid, bazThreads, 0, stream >>>( d_tempbuffer, d_output );
/* cleanup */
cudaFree(counter_ptr);
}
答案 0 :(得分:1)
不是改变第二个内核中的块数,而是使用固定的块数,让块调整它们的工作量。
E.g。发射更多的块,如果没有剩余工作则让它们提前退出。或者启动足够的块来填充设备,并让每个块循环遍历工作。 Grid-stride loops是一种很好的方法。
还可以选择使用动态并行机制将内核启动本身(以及对网格大小的决定)移动到设备。