所以我在某些openCL设备上运行代码时遇到了一些问题。我将在2013年中期开发15"视网膜屏幕Macbook pro OSX 10.9.5(Mavericks)并使用Xcode 6.0.1
使用clGetDeviceIDs访问所有可用设备并使用clGetDeviceInfo查看每个设备的信息后,我得到以下内容:
Device: Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz
Hardware version: OpenCL 1.2
Software version: 1.1
OpenCL C version: OpenCL C 1.2
Parallel compute units: 8
Device: HD Graphics 4000
Hardware version: OpenCL 1.2
Software version: 1.2(Aug 17 2014 20:29:07)
OpenCL C version: OpenCL C 1.2
Parallel compute units: 16
Device: GeForce GT 650M
Hardware version: OpenCL 1.2
Software version: 8.26.28 310.40.55b01
OpenCL C version: OpenCL C 1.2
Parallel compute units: 2
所以根据这个,我应该有1个CPU和2个GPU:HD Graphics 4000和GeForce GT 650M。
我的问题是,当我尝试调用clGetkernelWorkGroupInfo时,如果我传入两个GPU之一的deviceID,它会返回CL_INVALID_DEVICE错误,但如果我传入CPU ID并且将毫无问题地计算我的内核代码,则工作正常。
这很奇怪,因为直到那时我的所有其他调用都适用于所有3个设备。我可以创建一个包含所有3个设备的上下文,创建3个独立的命令队列(每个设备一个),我可以编译程序并创建内核。但是一旦我接到那个电话就说我的设备无效。
如果我注释掉对clGetKernelWorkGroupInfo的调用并指定我自己的全局/本地工作大小,当我尝试使用CL_INVALID_PROGRAM_EXECUTABLE错误调用clEnqueueNDRangeKernel时,我收到错误。
我的计算机上安装的显卡有问题吗?或者有什么代码我需要做的事情,我不知道?我只是不知道设备如何在一次通话之前有效,然后突然无效。
编辑这是我的代码(CheckError只是我制作的一个函数,如果出现错误会打印出自定义错误消息)
cl_int err; //Error catcher
cl_platform_id platform; //Computer platform
cl_context context; //Single context for whole platform
cl_uint deviceCount; //Number of devices (CPU + GPU) available on machine
cl_device_id *devices; //Array of pointers to devices;
cl_program program; //OpenCL program
cl_command_queue *commandQueues; //One command queue for each device
/*---Definitions---*/
int DATA_SIZE = 16384;
double results[DATA_SIZE]; // results returned from device;
int currDevice = 0; //Use this to just access first available device
/*---Get First Platform---*/
err = clGetPlatformIDs(1, &platform, NULL);
CheckError(err, "A valid platform could not be found on this machine");
/*---Get Device Count---*/
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &deviceCount);
CheckError(err, "Could not determine the number of devices available on this platform");
/*---Get All Devices---*/
devices = new cl_device_id[deviceCount];
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, deviceCount, devices, NULL);
CheckError(err, "Could not access the devices");
/*---Create a single context for all devices---*/
context = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
CheckError(err, "Could not create a context on this platform");
/*---For each device create a separate command queue---*/
commandQueues = new cl_command_queue[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
commandQueues[i] = clCreateCommandQueue(context, devices[i], 0, &err);
string errMsg = "Was unable to successfully set up a command queue for device number " + to_string(i);
CheckError(err, errMsg);
}
/*---Read in cl file---*/
char *KernelSource = ReadFile("./Source/Sampling/Sampler.cl");
// Create the compute program from the source buffer
program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err);
CheckError(err, "Failed to create compute program!");
// Build the program executable
err = clBuildProgram(program, deviceCount, devices, NULL, NULL, NULL);
if (err != CL_SUCCESS)
{
size_t len;
char buffer[2048];
printf("Error: Failed to build program executable!\n");
clGetProgramBuildInfo(program, devices[currDevice], CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
printf("%s\n", buffer);
exit(1);
}
// Create the compute kernel in the program we wish to run
cl_kernel kernel = clCreateKernel(program, "mySampler", &err);
CheckError(err, "Failed to create compute kernel!");
// Create the input array in device memory for our calculation
cl_mem input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(double) * DATA_SIZE, NULL, &err);
CheckError(err, "Failed to allocate device memory");
// Set the arguments to our compute kernel
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
CheckError(err, "Failed to set kernel arguments");
size_t global, local;
// Get the maximum work group size for executing the kernel on the device
err = clGetKernelWorkGroupInfo(kernel, devices[currDevice], CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
CheckError(err, "Failed to retrieve work group info!");
// Execute the kernel over the entire range of our 1d input data set
// using the maximum number of work group items for this device
global = DATA_SIZE;
err = clEnqueueNDRangeKernel(commandQueues[currDevice], kernel, 1, NULL, &global, &local, 0, NULL, NULL);
CheckError(err, "Failed to execute kernel!");
// Wait for the command commands to get serviced before reading back results
clFinish(commandQueues[currDevice]);
// Read back the results from the device to verify the output
err = clEnqueueReadBuffer(commandQueues[currDevice], input, CL_TRUE, 0, sizeof(double) * DATA_SIZE, results, 0, NULL, NULL );
CheckError(err, "Failed to read array");
std::cout<<"DONE!"<<std::endl;
for(int i = 0; i < DATA_SIZE; i++)
{
std::cout<<"RESULT: "<<i<<" "<<results[i]<<std::endl;
}
// Shutdown and cleanup
clReleaseMemObject(input);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(commandQueues[currDevice]);
clReleaseContext(context);
}
答案 0 :(得分:4)
我认为该程序无法为您的一个或两个GPU构建。我刚刚在自己的OS X系统上检查了这一点,clBuildProgram()
如果能够为您通过的任何设备构建程序,则会CL_SUCCESS
返回clBuildProgram()
它,即使其他设备的构建失败。
如果在for (int i = 0; i < deviceCount; i++)
{
cl_build_status status;
clGetProgramBuildInfo(program, devices[i], CL_PROGRAM_BUILD_STATUS,
sizeof(status), &status, NULL);
std::cout << "Build status for device " << i << " = " << status << std::endl;
}
调用之后添加此代码,您可以检查构建是否真的成功了所有内容:
double
我注意到您正在使用double
值 - HD 4000不支持双精度,使用double
类型的内核将无法构建。在编译使用Build status for device 0 = 0
Build status for device 1 = -2
Build status for device 2 = 0
和您的主机代码(以及上面的代码片段)的内核时,我得到以下输出:
{{1}}
如您所见,两个设备的构建成功,但设备1(HD 4000)的构建不成功。
因此,在Apple系统上同时为多个设备构建程序时,您需要小心。