我是Open-cl的新手,我正在尝试为以下矩阵操作编写内核代码:
A is a 2X2 matrix:
A = [1 2] ----> row1
[3 4] ----->row2
I need to compute:
1) s1 = transpose(row1) X row1
2) s1 = transpose(row2) X row2
3) Sum = s1+s2
我为行级写了内核代码(即我可以做transpose(row1)X row1) - 这只用于第一行的目的
如何使用并行性为每一行计算并找到内核函数中的最终总和?
private static String programSource1 =
"__kernel"+
" void matrixMul(__global float* A, __global float* C, int rowLength)"+
"{"+
"int row = get_global_id(1);"+
"int col = get_global_id(0);"+
"C[row*rowLength+col] = A[col] * A[row];"+
"}";
答案 0 :(得分:1)
#define MAX_ROW_LENGTH 2 // or more
__kernel void matrixMul(__global float* A, __global float* C,
int rowLength)
{
__local float buffer[MAX_ROW_LENGTH * MAX_ROW_LENGTH];
__local float s1[MAX_ROW_LENGTH * MAX_ROW_LENGTH];
int col = get_global_id(0);
int row = get_global_id(1);
int rows = get_global_size(1);
// read the matrix from global to local memory
buffer[row * rowLength + col] = A[row * rowLength + col];
s1[row * rowLength + col] = 0.0f;
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0; i < rows; ++i)
{
s1[row * rowLength + col] +=
buffer[i * rowLength + col] * buffer[i * rowLength + row];
}
C[row * rowLength + col] = s1[row*rowLength+col];
}
这是一些内核代码,它可以满足您对小矩阵的需求。内核使用本地内存来减少全局内存访问。对于这样的小问题(2x2矩阵),这需要实现任何目标,但如果你计算更大的矩阵,这可以加快一点点。然而,这是一个简短的例子而没有优化。它有一些限制:
如果你不想要本地内存删除,用A替换for循环中的缓冲区调用,直接写入C而不是s1。