Question

下面是一个opencl内核，它为多个独立矩阵执行阻塞矩阵乘法。 selectMatrixA和selectMatrixB以行主要顺序存储多个矩阵（相同大小和方形矩阵）。

// Matrix multiplication: C = A * B.


#define BLOCK_SIZE 20
#define MATRIX_SIZE 100 * 100

#define BLOCK_DIMX 5 // Number of blocks in the x dimension

__kernel void
batchedMatrixMul(__global float *selectMatrixC, __global float *selectMatrixA, __global   
float *selectMatrixB, int wA, int wB)
{
    // Block index
    int bx = get_group_id(0);
    int by = get_group_id(1);


    __global float *C = selectMatrixC + (bx/BLOCK_DIMX) * MATRIX_SIZE;
    __global float *A = selectMatrixA + (bx/BLOCK_DIMX) * MATRIX_SIZE;
    __global float *B = selectMatrixB + (bx/BLOCK_DIMX) * MATRIX_SIZE;



    int tx = get_local_id(0);
    int ty = get_local_id(1);

    float Csub = 0;

    // Identify the row and column of the C matrix to work on

    int Row = (by * BLOCK_SIZE)  + ty;
    int Col = ((bx %(BLOCK_DIMX)) * BLOCK_SIZE) + tx;

    // Declaration of the local memory array As used to store the sub-matrix of A
    __local float As[BLOCK_SIZE][BLOCK_SIZE];

    // Declaration of the local memory array Bs used to store the sub-matrix of B
    __local float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Loop over all the sub-matrices of A and B required to compute the block sub-matrix
    for (int m = 0; m < wA / BLOCK_SIZE; ++m) 
    {

        // Load the matrices from global memory to local memory. Each thread loads one   
        //element of each matrix
        As[ty][tx] = A[Row * wA + m * BLOCK_SIZE + tx];
        Bs[ty][tx] = B[(m * BLOCK_SIZE + ty)*wA + Col];

        // Synchronize to make sure the matrices are loaded
        barrier(CLK_LOCAL_MEM_FENCE);

        // Multiply the two matrices together each thread computes one element of the block 
        //sub-matrix
        for (int k = 0; k < BLOCK_SIZE; ++k)
            Csub += As[ty][k] * Bs[k][tx];

        // Synchronize to make sure that the preceding computation is done before loading 
        //two new sub-matrices of A and B in the next iteration
        barrier(CLK_LOCAL_MEM_FENCE);

    }

    // Write the block sub-matrix to device memory each thread writes one element
    C[Row * wA + Col] = Csub;

}

以下是我启动内核的方法：

localWorkSize[0] = BLOCK_SIZE;
localWorkSize[1] = BLOCK_SIZE;

// for a 100 X 100 matrix, MATRIX_DIMX = MATRIX_DIMY = 100
globalWorkSize[0] = MATRIX_DIMX * NUM_MATRICES;
globalWorkSize[1] = MATRIX_DIMY ;

cl_event             event;
errcode = clEnqueueNDRangeKernel(clCommandQueue, 
          clKernel, 2, NULL, globalWorkSize, 
          localWorkSize, 0, NULL, &event);

以下是在NVIDIA Grid K520上运行时的一些性能数字：

1. matrix size:100 X 100 . Number of matrices = 20000. Time taken for multiplication = 
0.262 seconds. As shown in the code, the block size was set to 20. Block size of 10 was 
slower. This calculates to around 152 GFLOPS

2. matrix size: 10000 X 10000. Number of matrices = 1. Time taken for multiplication = 10.6 
seconds. Here also the block size was 20. Using a block size of 50 is not possible due to   
the size of the local memory.

有人可以帮助我理解为什么代码运行缓慢，以及为什么2.比2慢得多。我是OpenCL的新手，我想学习如何根据底层架构细节优化代码

Answer 1

您的第一次测试速度之快的原因是因为每项测试的工作量存在差异。实际上，因子是50倍。

用于方阵乘法的Big-O是O（n ^ 3）。请参阅：why is the time complexity of square matrix multiplication defined as O(n^3)?因此，10k平方矩阵实际上需要比单个100x100乘法多100万倍的乘法工作。 20000次100x100乘法的执行并不能弥补一次乘以大型矩阵所需的大量工作。

矩阵乘法只是很多点积。你的算法只会将点积分成小组以便于处理，并且不会使用任何特殊技巧来减少我计算中的数字。

对于小矩阵测试：

Total dot products: 10^4
MADs per dot product: 10^2
Total matrix-multiply operations: 20000 = 2*10^4
Total multiply-adds: 2* 10^(4+2+4) = 2*10^10 = 20,000,000,000

200亿。

大矩阵测试：

Total dot products: 10^8
MADs per dot product: 10^4
Total multiply operations: 1 (or 10^0)
Grand total multiply-adds: 10 ^ (8 + 4 + 0) = 10^12 = 1,000,000,000,000

1000亿。

您的10000x10000测试技术运行速度更快 - 运行时间仅增加50倍，运行时间仅增加40倍。

在此处详细了解“特殊技巧”：http://en.wikipedia.org/wiki/Strassen_algorithm。尽管该算法对于GPU计算而言并不实用。也存在复杂的算法，但图形硬件上的暴力方法似乎最常使用。

为什么你的内核一般运行缓慢？您可以使用许多不同的优化来加快速度。以下是您可以谷歌和自己试验的一些。你可能会遇到一些我在这里没有提到的。

优化工作组和块大小。请参阅opencl PREFERRED_WORK_GROUP_SIZE
使用float4数据类型。 opencl包含一个点积函数，用于计算floatn数据类型的点积。
在运行内核之前转置矩阵B.你可以使用另一个内核进行转置。

Answer 2

在我看来，2.如此慢的原因是矩阵乘法的访问模式不是那么缓存友好。如果需要获取第一行的第一个值和第二行的第一个值，它们将被存储到彼此相距很远的内存中。如果矩阵大小增加，它们甚至会彼此远离地存储。这将导致大量缓存未命中。

我对矩阵乘法没有任何个人经验，但我只是认为可以将数据存储在Z-order curve中以实现更多缓存友好模式。从维基百科的引用来看，Valsalam & al 2002已经完成了类似的事情。

另一个快速修复，我会在使用大量时间进行Z排序之前尝试，就是使用私有变量并摆脱障碍。即使它需要来自全局内存的更多负载，编译器也可能对该代码进行更好的优化。

优化批量矩阵乘法opencl代码

2 个答案: