我有一个矩阵作为一维数组存储在GPU中,我正在尝试制作一个OpenCL内核,它将在该矩阵的每一行中使用缩减,例如:
让我们考虑我的矩阵是2x3的元素[1,2,3,4,5,6],我想要做的是:
[1, 2, 3] = [ 6]
[4, 5, 6] [15]
显然,当我谈论减少时,每行的实际回报可能超过一个元素:
[1, 2, 3] = [3, 3]
[4, 5, 6] [9, 6]
然后我可以在另一个内核或CPU中进行最终计算。
好吧,到目前为止,我所拥有的是一个内核,它使用数组的所有元素进行缩减,如下所示:
[1, 2, 3] = [21]
[4, 5, 6]
这样做的实际还原内核就是那个(我实际上是从stackoverflow中得到的):
__kernel void
sum2(__global float *inVector, __global float *outVector,
const unsigned int inVectorSize, __local float *resultScratch)
{
const unsigned int localId = get_local_id(0);
const unsigned int workGroupSize = get_local_size(0);
if (get_global_id(0) < inVectorSize)
resultScratch[localId] = inVector[get_global_id(0)];
else
resultScratch[localId] = 0;
for (unsigned int a = workGroupSize >> 1; a > 0; a >>= 1)
{
barrier(CLK_LOCAL_MEM_FENCE);
if (a > localId)
resultScratch[localId] += resultScratch[localId + a];
}
if (localId == 0)
outVector[get_group_id(0)] = resultScratch[0];
barrier(CLK_LOCAL_MEM_FENCE);
}
答案 0 :(得分:0)
我认为一种解决方案是修改缩减内核,这样可以减少数组的部分。
__kernel void
sum2(__global float *inVector,
__global float *outVector,
unsigned int inVectorOffset,
unsigned int inVectorSize,
__local float *resultScratch)
{
const unsigned int localId = get_local_id(0);
const unsigned int workGroupSize = get_local_size(0);
if (get_global_id(0) < inVectorSize)
resultScratch[localId] = inVector[inVectorOffset + get_global_id(0)];
else
resultScratch[localId] = 0;
for (unsigned int a = workGroupSize >> 1; a > 0; a >>= 1)
{
barrier(CLK_LOCAL_MEM_FENCE);
if (a > localId)
resultScratch[localId] += resultScratch[localId + a];
}
if (localId == 0)
outVector[get_group_id(0)] = resultScratch[0];
barrier(CLK_LOCAL_MEM_FENCE);
}
然后你可以减少矩阵的一行,提供行的开头inVectorOffset和行中元素的inVectorSize。