Question

我在Cuda内有两个阵列;

public int getVotesForEvent(final String event_title)
public void onDataChange(DataSnapshot dataSnapshot)

我的部分算法要求我定期将新数据从源数组插入主数组。如果主数组中的位置为零，则假定它为空，因此可以使用源数组中的值填充它。

我只是想知道最有效的方法是什么，我尝试了几种方法，但仍然认为这里可以获得更多的性能提升。

目前我正在使用基数排序的修改版本，将主阵列的内容“混洗”到主阵列的最末端，将所有零值保留在数组的开头，从而插入源琐碎。排序已被修改为迭代一个位而不是32位，这适用于输入上的简单开关;

int *main; // unsorted
int *source; // sorted

我想知道这是否已经是一种非常有效的方法呢？我想知道我是否会通过使用战术部署的atomicAdd来获得某些东西，例如;

input[i] = source[i] > 1 ? 1 : 0

我目前没有通过源阵列插入那么多项目，但这可能会在未来发生变化。

这感觉它应该是以前已经解决的常见问题，我想知道推力库是否有帮助，但是浏览适当的功能它对我想要的东西感觉不太对完成（不太适合我已经拥有的代码）

经验丰富的Cuda开发人员的想法表示赞赏！

Answer 1

您可以将您的查找算法（分类为流压缩过程）和您的插入（分类为分散过程）分离。但是，您可以合并两者的功能。

假设srcPtr是一个指针，其内容位于全局内存中，并且在内核启动之前已经设置为零。

__global__ void find_and_insert( int* destination, int const* source, int const N, int* srcPtr ) {    // Assuming N is the length of the destination buffer and also the length of the source buffer is less than N.

int const idx = blockIdx.x * blockDim.x + threadIdx.x;

// Get the assigned element.
int const dstElem = destination[ idx ];
bool const pred = ( dstElem == 0 );

// Intra-warp binary reduction to count the total number of lanes with empty elements.
int const predBallot = __ballot( pred );
int const intraWarpRed = __popc( predBallot );

// Warp-aggregated atomics to reduce the contention over the srcPtr content.
unsigned int laneID; asm( "mov.u32 %0, %laneid;" : "=r"(laneID) ); //const uint laneID = tidWithinCTA & ( WARP_SIZE - 1 );
int posW;
if( laneID == 0 )
    posW = atomicAdd( srcPtr, intraWarpRed );
posW = __shfl( posW, 0 );

// Threads that have found empty elements can fill out their assigned positions from the src. Intra-warp binary prefix sum is used here.
uint laneMask; asm( "mov.u32 %0, %lanemask_lt;" : "=r"(laneMask) ); //const uint laneMask =  0xFFFFFFFF >> ( WARP_SIZE - laneID ) ;
int const positionToRead = posW + __popc( predBallot & laneMask );
if( pred )
    destination[ idx ] = source[ positionToRead ];

}

一些事情：

这个内核只是建议你如何做到这一点。 warps中的线程协作完成任务。您可以在线程块上扩展二进制缩减和前缀和。
我在浏览器中编写了这个内核，并没有对它进行测试。所以要小心。
整个设计并不是什么新鲜事。已经实施了类似的方法（例如this paper），并且主要基于the work done by Mark Harris and Michael Garland。

Cuda有效地将数据插入到未排序的填充阵列中

1 个答案: