Question

我是OpenCL的新手，我正在浏览Altera OpenCL示例。在它们的矩阵乘法示例中，它们使用了块的概念，其中输入矩阵的维数是块大小的倍数。这是代码：

void matrixMult( // Input and output matrices
        __global float *restrict C,
        __global float *A,
        __global float *B, 
        // Widths of matrices.
        int A_width, int B_width)
{
    // Local storage for a block of input matrices A and B
    __local float A_local[BLOCK_SIZE][BLOCK_SIZE];
    __local float B_local[BLOCK_SIZE][BLOCK_SIZE];

    // Block index
    int block_x = get_group_id(0);
    int block_y = get_group_id(1);

    // Local ID index (offset within a block)
    int local_x = get_local_id(0);
    int local_y = get_local_id(1);

    // Compute loop bounds
    int a_start = A_width * BLOCK_SIZE * block_y;
    int a_end   = a_start + A_width - 1;
    int b_start = BLOCK_SIZE * block_x;

    float running_sum = 0.0f;
    for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
    {
        A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
        B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
        #pragma unroll
        for (int k = 0; k < BLOCK_SIZE; ++k)
        {
            running_sum += A_local[local_y][k] * B_local[local_x][k];
        }
    }

    // Store result in matrix C
    C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}

假设块大小为2，则：block_x和block_y均为0; local_x和local_y均为0 然后A_local[0][0]为A[0]，B_local[0][0]为B[0] A_local和B_local的尺寸各为4个元素。

在这种情况下，A_local和B_local如何在该迭代中访问块的其他元素？
还会为每个local_x和local_y？

分配线程/核心

Answer 1

您的代码示例中肯定存在障碍。如果所有工作项都以锁步方式执行指令，那么外部for循环将只产生正确的结果，从而保证在for循环之前填充本地内存。

Altera和其他FPGA可能就是这种情况，但这对于CPU和GPU来说并不正确。

如果您收到意外结果，或者想要与其他类型的硬件兼容，则应添加屏障（CLK_LOCAL_MEM_FENCE）; 。

float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
    A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
    B_local[local_x][local_y] = B[b + B_width * local_y + local_x];

    barrier(CLK_LOCAL_MEM_FENCE);

    #pragma unroll
    for (int k = 0; k < BLOCK_SIZE; ++k)
    {
        running_sum += A_local[local_y][k] * B_local[local_x][k];
    }
}

Answer 2

A_local和B_local都由工作组的所有工作项共享，因此所有元素都会并行（由工作组的所有工作项加载））在包含for循环的每一步。

然后，每个工作项使用一些加载的值（不一定是工作项自身加载的值）来完成计算的份额。

最后，工作项将其各自的结果存储到全局输出矩阵中。

它是矩阵矩阵乘法的经典平铺实现。但是，我真的很惊讶没有看到任何类型的内存同步函数调用，例如work_group_barrier(CLK_LOCAL_MEM_FENCE)和A_local的加载之间的B_local以及它们在{{1}中的使用循环...但我可能在这里忽略了一些东西。

OpenCL矩阵乘法Altera示例

2 个答案: