我一直在研究OpenCL,用于优化代码和并行运行任务,以实现比纯Java更快的速度。现在我遇到了一些问题。
我已经使用LWJGL组建了一个Java程序,据我所知,它应该可以执行几乎相同的任务 - 在这种情况下,将两个数组中的元素一起添加并将结果存储在另一个数组中 - 两种不同的方式:一种是纯Java,另一种是OpenCL内核。我正在使用System.currentTimeMillis()
来跟踪每个元素对于具有大量元素(~10,000,000)的数组所花费的时间。无论出于何种原因,纯java循环似乎执行大约3到10次,具体取决于数组大小,比CL程序快。我的代码如下(导入省略):
public class TestCL {
private static final int SIZE = 9999999; //Size of arrays to test, this value is changed sometimes in between tests
private static CLContext context; //CL Context
private static CLPlatform platform; //CL platform
private static List<CLDevice> devices; //List of CL devices
private static CLCommandQueue queue; //Command Queue for context
private static float[] aData, bData, rData; //float arrays to store test data
//---Kernel Code---
//The actual kernel script is here:
//-----------------
private static String kernel = "kernel void sum(global const float* a, global const float* b, global float* result, int const size){\n" +
"const int itemId = get_global_id(0);\n" +
"if(itemId < size){\n" +
"result[itemId] = a[itemId] + b[itemId];\n" +
"}\n" +
"}";;
public static void main(String[] args){
aData = new float[SIZE];
bData = new float[SIZE];
rData = new float[SIZE]; //Only used for CPU testing
//arbitrary testing data
for(int i=0; i<SIZE; i++){
aData[i] = i;
bData[i] = SIZE - i;
}
try {
testCPU(); //How long does it take running in traditional Java code on the CPU?
testGPU(); //How long does the GPU take to run it w/ CL?
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/**
* Test the CPU with pure Java code
*/
private static void testCPU(){
long time = System.currentTimeMillis();
for(int i=0; i<SIZE; i++){
rData[i] = aData[i] + bData[i];
}
//Print the time FROM THE START OF THE testCPU() FUNCTION UNTIL NOW
System.out.println("CPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
}
/**
* Test the GPU with OpenCL
* @throws LWJGLException
*/
private static void testGPU() throws LWJGLException {
CLInit(); //Initialize CL and CL Objects
//Create the CL Program
CLProgram program = CL10.clCreateProgramWithSource(context, kernel, null);
int error = CL10.clBuildProgram(program, devices.get(0), "", null);
Util.checkCLError(error);
//Create the Kernel
CLKernel sum = CL10.clCreateKernel(program, "sum", null);
//Error checker
IntBuffer eBuf = BufferUtils.createIntBuffer(1);
//Floatbuffer for the first array of floats
FloatBuffer aBuf = BufferUtils.createFloatBuffer(SIZE);
aBuf.put(aData);
aBuf.rewind();
CLMem aMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuf, eBuf);
Util.checkCLError(eBuf.get(0));
//And the second
FloatBuffer bBuf = BufferUtils.createFloatBuffer(SIZE);
bBuf.put(bData);
bBuf.rewind();
CLMem bMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuf, eBuf);
Util.checkCLError(eBuf.get(0));
//Memory object to store the result
CLMem rMem = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, SIZE * 4, eBuf);
Util.checkCLError(eBuf.get(0));
//Get time before setting kernel arguments
long time = System.currentTimeMillis();
sum.setArg(0, aMem);
sum.setArg(1, bMem);
sum.setArg(2, rMem);
sum.setArg(3, SIZE);
final int dim = 1;
PointerBuffer workSize = BufferUtils.createPointerBuffer(dim);
workSize.put(0, SIZE);
//Actually running the program
CL10.clEnqueueNDRangeKernel(queue, sum, dim, null, workSize, null, null, null);
CL10.clFinish(queue);
//Write results to a FloatBuffer
FloatBuffer res = BufferUtils.createFloatBuffer(SIZE);
CL10.clEnqueueReadBuffer(queue, rMem, CL10.CL_TRUE, 0, res, null, null);
//How long did it take?
//Print the time FROM THE SETTING OF KERNEL ARGUMENTS UNTIL NOW
System.out.println("GPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
//Cleanup objects
CL10.clReleaseKernel(sum);
CL10.clReleaseProgram(program);
CL10.clReleaseMemObject(aMem);
CL10.clReleaseMemObject(bMem);
CL10.clReleaseMemObject(rMem);
CLCleanup();
}
/**
* Initialize CL objects
* @throws LWJGLException
*/
private static void CLInit() throws LWJGLException {
IntBuffer eBuf = BufferUtils.createIntBuffer(1);
CL.create();
platform = CLPlatform.getPlatforms().get(0);
devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
context = CLContext.create(platform, devices, eBuf);
queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, eBuf);
Util.checkCLError(eBuf.get(0));
}
/**
* Cleanup after CL completion
*/
private static void CLCleanup(){
CL10.clReleaseCommandQueue(queue);
CL10.clReleaseContext(context);
CL.destroy();
}
}
以下是各种测试的一些示例控制台结果:
CPU processing time for 10000000 elements: 24
GPU processing time for 10000000 elements: 88
CPU processing time for 1000000 elements: 7
GPU processing time for 1000000 elements: 10
CPU processing time for 100000000 elements: 193
GPU processing time for 100000000 elements: 943
我的编码是否有问题导致CL更快,或者在这种情况下实际上是预期的?如果案件是后者,那么何时CL更可取?
答案 0 :(得分:0)
我修改了测试以做一些我认为比简单添加计算成本更高的测试。
关于CPU测试,行:
rData[i] = aData[i] + bData[i];
更改为:
rData[i] = (float)(Math.sin(aData[i]) * Math.cos(bData[i]));
在CL内核中,行:
result[itemId] = a[itemId] + b[itemId];
更改为:
result[itemId] = sin(a[itemId]) * cos(b[itemId]);
我现在正在获得控制台结果,例如:
CPU processing time for 1000000 elements: 154
GPU processing time for 1000000 elements: 11
CPU processing time for 10000000 elements: 8699
GPU processing time for 10000000 elements: 98
(CPU花费的时间比我更长,需要为100000000个元素的测试而烦恼。)
为了检查准确性,我添加了比较rData
和res
的任意元素的检查,以确保它们相同。我在这里省略了结果,因为它应该足以说它们是相同的。
现在函数更复杂(两个三角函数相乘),看起来CL内核比纯Java循环效率更高。