
时间:2014-09-17 21:05:33

标签: opencl matrix-multiplication blas

下面是一个opencl内核,它为多个独立矩阵执行阻塞矩阵乘法。 selectMatrixA和selectMatrixB以行主要顺序存储多个矩阵(相同大小和方形矩阵)。

// Matrix multiplication: C = A * B.

#define BLOCK_SIZE 20
#define MATRIX_SIZE 100 * 100

#define BLOCK_DIMX 5 // Number of blocks in the x dimension

__kernel void
batchedMatrixMul(__global float *selectMatrixC, __global float *selectMatrixA, __global   
float *selectMatrixB, int wA, int wB)
    // Block index
    int bx = get_group_id(0);
    int by = get_group_id(1);

    __global float *C = selectMatrixC + (bx/BLOCK_DIMX) * MATRIX_SIZE;
    __global float *A = selectMatrixA + (bx/BLOCK_DIMX) * MATRIX_SIZE;
    __global float *B = selectMatrixB + (bx/BLOCK_DIMX) * MATRIX_SIZE;

    int tx = get_local_id(0);
    int ty = get_local_id(1);

    float Csub = 0;

    // Identify the row and column of the C matrix to work on

    int Row = (by * BLOCK_SIZE)  + ty;
    int Col = ((bx %(BLOCK_DIMX)) * BLOCK_SIZE) + tx;

    // Declaration of the local memory array As used to store the sub-matrix of A
    __local float As[BLOCK_SIZE][BLOCK_SIZE];

    // Declaration of the local memory array Bs used to store the sub-matrix of B
    __local float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Loop over all the sub-matrices of A and B required to compute the block sub-matrix
    for (int m = 0; m < wA / BLOCK_SIZE; ++m) 

        // Load the matrices from global memory to local memory. Each thread loads one   
        //element of each matrix
        As[ty][tx] = A[Row * wA + m * BLOCK_SIZE + tx];
        Bs[ty][tx] = B[(m * BLOCK_SIZE + ty)*wA + Col];

        // Synchronize to make sure the matrices are loaded

        // Multiply the two matrices together each thread computes one element of the block 
        for (int k = 0; k < BLOCK_SIZE; ++k)
            Csub += As[ty][k] * Bs[k][tx];

        // Synchronize to make sure that the preceding computation is done before loading 
        //two new sub-matrices of A and B in the next iteration


    // Write the block sub-matrix to device memory each thread writes one element
    C[Row * wA + Col] = Csub;



localWorkSize[0] = BLOCK_SIZE;
localWorkSize[1] = BLOCK_SIZE;

// for a 100 X 100 matrix, MATRIX_DIMX = MATRIX_DIMY = 100
globalWorkSize[0] = MATRIX_DIMX * NUM_MATRICES;
globalWorkSize[1] = MATRIX_DIMY ;

cl_event             event;
errcode = clEnqueueNDRangeKernel(clCommandQueue, 
          clKernel, 2, NULL, globalWorkSize, 
          localWorkSize, 0, NULL, &event);

以下是在NVIDIA Grid K520上运行时的一些性能数字:

1. matrix size:100 X 100 . Number of matrices = 20000. Time taken for multiplication = 
0.262 seconds. As shown in the code, the block size was set to 20. Block size of 10 was 
slower. This calculates to around 152 GFLOPS

2. matrix size: 10000 X 10000. Number of matrices = 1. Time taken for multiplication = 10.6 
seconds. Here also the block size was 20. Using a block size of 50 is not possible due to   
the size of the local memory.


2 个答案:

答案 0 :(得分:3)


用于方阵乘法的Big-O是O(n ^ 3)。请参阅:why is the time complexity of square matrix multiplication defined as O(n^3)?因此,10k平方矩阵实际上需要比单个100x100乘法多100万倍的乘法工作。 20000次100x100乘法的执行并不能弥补一次乘以大型矩阵所需的大量工作。



Total dot products: 10^4
MADs per dot product: 10^2
Total matrix-multiply operations: 20000 = 2*10^4
Total multiply-adds: 2* 10^(4+2+4) = 2*10^10 = 20,000,000,000



Total dot products: 10^8
MADs per dot product: 10^4
Total multiply operations: 1 (or 10^0)
Grand total multiply-adds: 10 ^ (8 + 4 + 0) = 10^12 = 1,000,000,000,000  


您的10000x10000测试技术运行速度更快 - 运行时间仅增加50倍,运行时间仅增加40倍。



  • 优化工作组和块大小。请参阅opencl PREFERRED_WORK_GROUP_SIZE
  • 使用float4数据类型。 opencl包含一个点积函数,用于计算floatn数据类型的点积。
  • 在运行内核之前转置矩阵B.你可以使用另一个内核进行转置。

答案 1 :(得分:1)


我对矩阵乘法没有任何个人经验,但我只是认为可以将数据存储在Z-order curve中以实现更多缓存友好模式。从维基百科的引用来看,Valsalam & al 2002已经完成了类似的事情。
