我正在开发一个简单的OpenCL内核,它只进行无内存访问的计算。以下是我们在GPU上执行的内核:
__kernel void WGS512MAPI8LLXOPS64(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) {
const int XGL = get_global_id(0);
const int XGRid = get_group_id(0);
const int XGRnum = get_num_groups(0);
const int XLSize = get_local_size(0);
const int XLid = get_local_id(0);
// Just a private variable
float temp1 = 1.0;
float temp2 = 1.0;
float temp3 = 1.0;
float temp4 = 1.0;
float tempOut;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
// Start of a new level of for loop
long baseIndex1 = XGRid*XLSize*8+XLid;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
...
temp1 += temp1 * MF;
tempOut = temp1 + temp2 + temp3 + temp4;
GOut[XGRid*XLSize*8+XLid] = tempOut;
}
“FMA操作”的总数约为1024.基于内核,由于存在数据依赖性,每条指令都需要先完成先前的指令。我试图优化上面内核的执行,但是利用更多的临时变量来增加数据相关操作之间的差距,如下所示:
__kernel void WGS512MAPI8LLXOPS64(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) {
const int XGL = get_global_id(0);
const int XGRid = get_group_id(0);
const int XGRnum = get_num_groups(0);
const int XLSize = get_local_size(0);
const int XLid = get_local_id(0);
// Just a private variable
float temp1 = 1.0;
float temp2 = 1.0;
float temp3 = 1.0;
float temp4 = 1.0;
float tempOut;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
// Start of a new level of for loop
long baseIndex1 = XGRid*XLSize*8+XLid;
temp1 += temp1 * MF; temp2 += temp2 * NF;
temp1 += temp1 * MF; temp2 += temp2 * NF;
temp1 += temp1 * MF; temp2 += temp2 * NF;
temp1 += temp1 * MF; temp2 += temp2 * NF;
temp1 += temp1 * MF; temp2 += temp2 * NF;
...
temp1 += temp1 * MF; temp2 += temp2 * NF;
tempOut = temp1 + temp2 + temp3 + temp4;
GOut[XGRid*XLSize*8+XLid] = tempOut;
}
执行并计算两个内核的总GFLOP总数(虽然两者的操作总数相同),第一个内核给出大约1186.17 GFLOPs,第二个内核给出大约600.58 GFLOPS,这完全违背我的第一个假设。由于我完全推出了内存访问,我无法对这些数字提出另一种解释。
所以我想知道是否有人知道在执行这些内核时设备级别上发生了什么。