我对OpenCL很陌生。我在大学时学到了一点,有点我的意思是我的图形教授教我们关于GPGPU和OpenCL的一天(而其他课程则专注于着色器和OpenGL等)。
我拿了一个示例程序并将其更改为使用我希望它运行的计算。但是我的程序在CPU上的运行速度明显快于我的GPU,我试图理解为什么。
我的程序接受一个输入浮点数组并有两个输出数组。在单线程环境下,它有三个参数。输入数组的大小为:samplesPerTrace tracesIn sizeof(float),输出数组的大小为:samplesPerTrace tracesOut sizeof(float)。< / p>
我的测试用例一直使用参数25000 2500 250,因为这是我将要使用的阵列的平均大小(可能略高于平均值)。这些值是随机填充的。
以下是OpenCL在内核上构建和运行的源代码;
const char* M_AND_S_OPENCL_SOURCE_TEXT =
"__kernel void sumAllCL(__global const float prestackTraces[],\n"
" __global float stackTracesOut[],\n"
" __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,\n"
" const unsigned int samplesPerTrace) {\n"
"\n"
" unsigned int k = get_global_id(0);\n" // Thread ID
"\n"
" unsigned int kTimesIn = k * nTracesIn;\n" // Store repeat ints
" unsigned int kTimesSamples = k * samplesPerTrace;\n"
"\n"
" for (int j = 0; j < ? ; j++) {\n" // ? position to be replaced (nTracesOut)"
"\n"
" int jTimesSamplesPT = j * samplesPerTrace;\n"
"\n"
" for (int i = 0; i < # ; i++) {\n" // # position to be replaced ()
"\n"
" int valueIndex = i + jTimesSamplesPT;\n"
" float value = prestackTraces[valueIndex];\n"
"\n"
" stackTracesOut[i + kTimesSamples] += value;\n"
" powerTracesOut[i + kTimesSamples] += (value * value);\n"
"\n"
" }\n"
" }\n"
"}\n";
注意那个?并且#在运行时用固定数字替换,我这样做因为我认为这将有助于编译器展开rl
通过上述参数(25000 2500 250~10 <1或2),我的CPU需要大约0.6秒才能完成程序,我的GPU需要大约40秒才能完成。这是一个更大的差异。 Fyi,我一直在搞乱第四个参数,看看哪个值运行得更快,这就是~10的含义。
我的显卡是MSI Radeon R9 390X 8GB,名称为夏威夷。当我有OpenCL打印出关于我的两个设备的信息时,这就是我得到的:
OpenCL Platform 0: AMD Accelerated Parallel Processing
----- OpenCL Device # 0: Hawaii-----
Gflops: 47.520000
Max Clock Frequency: 1080
Max Compute Units: 44
Max Work Group Size: 256
MEMORY...
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 16.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS...
Device Vendor: Advanced Micro Devices, Inc.
Device Version: OpenCL 2.0 AMD-APP (2117.13)
Driver Version: 2117.13 (VM)
Device OpenCL Version: OpenCL C 2.0
----- OpenCL Device # 1: Intel(R) Core(TM) i7-6700K CPU ? 4.00GHz-----
Gflops: 32.064000
Max Clock Frequency: 4008
Max Compute Units: 8
Max Work Group Size: 1024
MEMORY...
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Cache Size: 32.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)
VERSIONS...
Device Vendor: GenuineIntel
Device Version: OpenCL 1.2 AMD-APP (2117.13)
Driver Version: 2117.13 (sse2,avx)
Device OpenCL Version: OpenCL C 1.2
以下是与OpenCL相关的代码。我会发布一个完整的最小可验证的完整示例,但它让我超过了字符限制。
/*
* Prints the given int (numToInsert) at location inside chars.
*/
void PrintIntInStr(char* chars, int location, int numToInsert) {
std::stringstream strs;
strs << numToInsert;
std::string temp_str = strs.str();
char const* numToChars = temp_str.c_str();
int numberLength = strlen(numToChars);
int w;
for (w = 0; w < numberLength; w++) {
chars[location + w] = numToChars[w];
}
}
/*
* Initialize fastest OpenCL device.
*/
int InitOpenCL(int verbose, cl_int deviceType) {
cl_uint Nplat;
cl_int err;
char name[1024];
int MaxGflops = -1;
cl_platform_id winnerPlatform = 0;
// Reset (TODO)
_deviceID = NULL;
_context = NULL;
_queue = NULL;
// Get platforms
cl_platform_id platforms[4];
if (clGetPlatformIDs(4, platforms, &Nplat)) Fatal("Cannot get number of OpenCL platforms\n");
else if (Nplat<1) Fatal("No OpenCL platforms found\n");
// Loop over platforms
for (unsigned int platform = 0; platform < Nplat; platform++) {
if (clGetPlatformInfo(platforms[platform], CL_PLATFORM_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL platform name\n");
if (verbose) printf("OpenCL Platform %d: %s\n", platform, name);
// Get GPU device IDs
cl_uint Ndev;
cl_device_id id[4];
if (clGetDeviceIDs(platforms[platform], deviceType, 4, id, &Ndev))
Fatal("Cannot get number of OpenCL devices: %d\n", platform);
else if (Ndev < 1) Fatal("No OpenCL devices found.\n");
// Find the fastest device
for (unsigned int devId = 0; devId < Ndev; devId++) {
// Print informatio about the device
cl_uint compUnits, freq, cacheLineSize;
cl_ulong memSize, maxAlloc, localMemSize, globalCacheSize;
size_t maxWorkGrps;
char deviceVendor[50];
char deviceVersion[50];
char driverVersion[50];
char deviceOpenCLVersion[50];
// Computing Power...
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compUnits), &compUnits, NULL)) Fatal("Cannot get OpenCL device units\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, NULL)) Fatal("Cannot get OpenCL device frequency\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGrps), &maxWorkGrps, NULL)) Fatal("Cannot get OpenCL max work group size\n");
// Memory...
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(memSize), &memSize, NULL)) Fatal("Cannot get OpenCL memory size.\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(localMemSize), &localMemSize, NULL)) localMemSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(maxAlloc), &maxAlloc, NULL)) Fatal("Cannot get OpenCL memory size.\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(globalCacheSize), &globalCacheSize, NULL)) globalCacheSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, sizeof(cacheLineSize), &cacheLineSize, NULL)) cacheLineSize = 0;
// Versions...
clGetDeviceInfo(id[devId], CL_DEVICE_VENDOR, sizeof(deviceVendor), deviceVendor, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_VERSION, sizeof(deviceVersion), deviceVersion, NULL);
clGetDeviceInfo(id[devId], CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_OPENCL_C_VERSION, sizeof(deviceOpenCLVersion), deviceOpenCLVersion, NULL);
int Gflops = compUnits * freq;
if (verbose) printf(" ----- OpenCL Device # %d: %s-----\n"
"Gflops: %f\n"
"Max Clock Frequency: %d\n"
"Max Compute Units: %d\n"
"Max Work Group Size: %zu\n"
" MEMORY...\n"
"Total Memory of Device: %s (CL_DEVICE_GLOBAL_MEM_SIZE)\n"
"Local Memory of Device: %s (CL_DEVICE_LOCAL_MEM_SIZE)\n"
"Max Memory Object Allocation: %s (CL_DEVICE_MAX_MEM_ALLOC_SIZE)\n"
"Cache Size: %s (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)\n"
"Cacheline Size: %s (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)\n"
" VERSIONS...\n"
"Device Vendor: %s\n"
"Device Version: %s\n"
"Driver Version: %s\n"
"Device OpenCL Version: %s\n",
devId,
name,
(1e-3 * Gflops),
freq,
compUnits,
maxWorkGrps,
byteConverter((unsigned long)memSize),
byteConverter((unsigned long)localMemSize),
byteConverter((unsigned long)maxAlloc),
byteConverter((unsigned long)globalCacheSize),
byteConverter((unsigned long)cacheLineSize),
deviceVendor,
deviceVersion,
driverVersion,
deviceOpenCLVersion);
if(Gflops > MaxGflops)
{
_deviceID = id[devId];
MaxGflops = Gflops;
winnerPlatform = platforms[platform];
}
}
}
// Print fastest device info (TODO: don't get name twice)
if (clGetDeviceInfo(_deviceID, CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n");
printf("\n Selected Fastest Open CL Device: %s (#%lu)\n", name, (unsigned long)_deviceID);
// Check thread count
size_t mwgs;
if (clGetDeviceInfo(_deviceID, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(mwgs), &mwgs, NULL))
Fatal("Cannot get OpenCL max work group size\n");
// Create OpenCL context for fastest device
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)winnerPlatform,
(cl_context_properties)0
};
_context = clCreateContextFromType(cps, deviceType, NULL, NULL, &err);
if (!_context || err) Fatal("Cannot create OpenCL Context\n");
// Properties for create command queue; currently nothing
// cl_command_queue_properties *propers;
cl_command_queue_properties prop = 0;
//prop |= CL_QUEUE_PROFILING_ENABLE;
//prop |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE;
// propers = ∝
_queue = clCreateCommandQueueWithProperties(_context, _deviceID, &prop, &err); // Create OpenCL command queue for fastest device
// _queue = clCreateCommandQueue(_context, _deviceID, &prop, &err);
if (!_queue || err) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create OpenCL command cue: CL_INVALID_CONTEXT\n");
else if (err == CL_INVALID_DEVICE) Fatal("Cannot create OpenCL command cue: CL_INVALID_DEVICE\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create OpenCL command cue: CL_INVALID_VALUE\n");
else if (err == CL_INVALID_QUEUE_PROPERTIES) Fatal("Cannot create OpenCL command cue: CL_INVALID_QUEUE_PROPERTIES\n");
else if (err == CL_OUT_OF_RESOURCES) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_RESOURCES\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_HOST_MEMORY\n");
else if (!_queue) Fatal("Cannot create OpenCL command cue: !queue\n");
else Fatal("Cannot create OpenCL command cue: ?????\n");
}
if (_VERBOSE) printf("Init complete.\n");
return mwgs;
}
/*
* Modify the source text to fit this run.
*/
char* ModifySourceText(unsigned int nTracesIn, unsigned int samplesPerT) {
size_t sourceSize = strlen(M_AND_S_OPENCL_SOURCE_TEXT) + 1;
char* moveStackSourceCode = new char[sourceSize];
strncpy(moveStackSourceCode, M_AND_S_OPENCL_SOURCE_TEXT, sourceSize);
moveStackSourceCode[sourceSize] = '\0';
// Print out the locations of the characters where we should insert other text if asked to do so
if (_FIND_INSERT_LOCATIONS) {
size_t z;
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '@') {
printf("Found @ at position %zu\n", z);
break;
}
}
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '#') {
printf("Found # at position %zu\n", z);
break;
}
}
}
// Insert the digit that for loops go to inside of the source
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_1, nTracesIn);
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_2, samplesPerT);
// Print the modified source code if verbose
if (_FIND_INSERT_LOCATIONS) {
printf("\n GPU Source Code: \n");
printf("%s\n", moveStackSourceCode);
}
return moveStackSourceCode;
}
/*
* Wait for event and then release it.
*/
static void WaitForEventAndRelease(cl_event *event) {
printf("WaitForEventAndRelease()\n");
cl_int status = CL_SUCCESS;
status = clWaitForEvents(1, event);
if (status) Fatal("clWaitForEvents Failed with Error Code");
printf("About to release event...\n");
status = clReleaseEvent(*event);
if (status) Fatal("clReleaseEvent Failed with Error Code");
}
// Runs the program via open CL
static double RunOpenCL(float prestackTracesArray[], float stackTracesOut1DArray[], float powerTracesOut1DArray[],
unsigned int nTracesOut, unsigned int nTracesIn, unsigned int samplesPerT,
size_t inXsamples, size_t outXsamples,
unsigned int localThreadCount)
{
cl_int err;
// Get the source code
char* modifiedGpuSource = ModifySourceText(nTracesIn, samplesPerT);
// Allocate device memory
// CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR | CL_MEM_USE_PERSISTENT_MEM_AMD (?)
// Input...
cl_mem prestackTracesCL = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
inXsamples * sizeof(cl_float), prestackTracesArray, &err);
if (err) FatalBufferCreation("Prestack traces", err);
// Output... TODO: How do we know that the output is zeroed out?
cl_mem stackTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Stack traces", err);
cl_mem powerTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Power traces", err);
// Compile the source code
char* gpuSourceText[1];
gpuSourceText[0] = modifiedGpuSource;
size_t sourceLength[1];
sourceLength[0] = strlen(modifiedGpuSource);
cl_program moveoutAndStackCLProgram = clCreateProgramWithSource(_context, 1, (const char**)gpuSourceText,
(const size_t*)sourceLength, &err);
if (err != CL_SUCCESS) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create program: CL_INVALID_CONTEXT\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create program: CL_INVALID_VALUE\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create program: CL_OUT_OF_HOST_MEMORY\n");
else Fatal("Cannot create program_S %d\n", err);
}
// Build the program
cl_int buildCode = clBuildProgram(moveoutAndStackCLProgram, 0, NULL, NULL, NULL, NULL);
if (buildCode != CL_SUCCESS) {
// Attempt to get compile errors
char log[1048576];
if (clGetProgramBuildInfo(moveoutAndStackCLProgram, _deviceID, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL)) {
log[0] = '\0'; // Failed to get the log file
}
if (buildCode == CL_INVALID_PROGRAM) Fatal("Cannot build program: CL_INVALID_PROGRAM\n%s", log);
else if (buildCode == CL_INVALID_VALUE) Fatal("Cannot build program: CL_INVALID_VALUE\n%s", log);
else if (buildCode == CL_INVALID_DEVICE) Fatal("Cannot build program: CL_INVALID_DEVICE\n%s", log);
else if (buildCode == CL_INVALID_BINARY) Fatal("Cannot build program: CL_INVALID_BINARY\n%s", log);
else if (buildCode == CL_INVALID_BUILD_OPTIONS) Fatal("Cannot build program: CL_INVALID_BUILD\n_OPTIONS\n%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log);
else if (buildCode == CL_COMPILER_NOT_AVAILABLE) Fatal("Cannot build program: CL_COMPILER_NOT_AVAILABLE\n%s", log);
else if (buildCode == CL_BUILD_PROGRAM_FAILURE) Fatal("Cannot build program: CL_BUILD_PROGRAM_FAILURE\n%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log);
else if (buildCode == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot build program: CL_OUT_OF_HOST_MEMORY\n%s", log);
else Fatal("Cannot build program: %d\n%s", buildCode, log);
}
// Compile the source code & build the kernel
cl_kernel kernel = clCreateKernel(moveoutAndStackCLProgram, "sumAllCL", &err);
if (err) {
if (err == CL_INVALID_PROGRAM) Fatal("Cannot create kernel: CL_INVALID_PROGRAM\n");
else if (err == CL_INVALID_PROGRAM_EXECUTABLE) Fatal("Cannot create kernel: CL_INVALID_PROGRAM_EXECUTABLE\n");
else if (err == CL_INVALID_KERNEL_NAME) Fatal("Cannot create kernel: CL_INVALID_KERNEL_NAME\n");
else if (err == CL_INVALID_KERNEL_DEFINITION) Fatal("Cannot create kernel: CL_INVALID_KERNEL_DEFINITION\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create kernel: CL_INVALID_VALUE\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create kernel: CL_OUT_OF_HOST_MEMOR\n");
else Fatal("Cannot create kernel: %d\n", err);
}
// Set program parameters
cl_int returnValArgSet;
returnValArgSet = clSetKernelArg(kernel, 0, sizeof(cl_mem), &prestackTracesCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("prestackTracesCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 1, sizeof(cl_mem), &stackTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("stackTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 2, sizeof(cl_mem), &powerTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("powerTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 3, sizeof(unsigned int), &nTracesOut);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesOut", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 4, sizeof(unsigned int), &nTracesIn);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesIn", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 5, sizeof(unsigned int), &samplesPerT);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("samplesPerT", returnValArgSet);
// TODO: verbose
printf("About to run Kernel...\n");
// Start timer TODO: move?
double runTime = GetTime();
// Run the kernel (& also set the number of threads)
cl_event runEvent;
size_t Global[1] = { nTracesOut };
size_t Local[1] = { localThreadCount };
if (localThreadCount > 0) err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, Local, 0, NULL, &runEvent);
else err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, NULL, 0, NULL, &runEvent);
if (err) {
if (err == CL_INVALID_PROGRAM_EXECUTABLE) {
Fatal("Cannot run Kernel: No successfully built program executable available.\n");
} else if (err == CL_INVALID_COMMAND_QUEUE) {
Fatal("Cannot run Kernel: Command_queue is not a valid command-queue.\n");
} else if (err == CL_INVALID_KERNEL) {
Fatal("Cannot run Kernel: Kernel is not a valid kernel object.\n");
} else if (err == CL_INVALID_CONTEXT) {
Fatal("Cannot run Kernel: Context associated with command_queue and kernel is not the same or if "
"the context associated with command_queue and events in event_wait_list are not the same.\n");
} else if (err == CL_INVALID_KERNEL_ARGS) {
Fatal("Cannot run Kernel: Kernel argument values have not been specified.\n");
} else if (err == CL_INVALID_WORK_DIMENSION) {
Fatal("Cannot run Kernel: work_dim is not a valid value (must be between 1 and 3).\n");
} else if (err == CL_INVALID_WORK_GROUP_SIZE) {
Fatal("Cannot run Kernel: local_work_size is specified and number of work-items specified by global_work_size "
"is not evenly divisable by size of work-group given by local_work_size or does not match the "
"work-group size specified for kernel using the __attribute__((reqd_work_group_size(X, Y, Z))) "
"qualifier in program source.\n");
} else if (err == CL_INVALID_WORK_ITEM_SIZE) {
Fatal("Cannot run Kernel: If the number of work-items specified in any of local_work_size[0], ... "
"local_work_size[work_dim - 1] is greater than the corresponding values specified "
"by CL_DEVICE_MAX_WORK_ITEM_SIZES[0], .... CL_DEVICE_MAX_WORK_ITEM_SIZES[work_dim - 1]. .\n");
} else if (err == CL_INVALID_GLOBAL_OFFSET) {
Fatal("Cannot run Kernel: Global_work_offset is not NULL.\n");
} else if (err == CL_OUT_OF_RESOURCES) {
Fatal("Cannot run Kernel: CL_OUT_OF_RESOURCES.\n");
} else if (err == CL_MEM_OBJECT_ALLOCATION_FAILURE) {
Fatal("Cannot run Kernel: Failure to allocate memory for data store associated with image or buffer "
"objects specified as arguments to kernel.\n");
} else if (err == CL_INVALID_EVENT_WAIT_LIST) {
Fatal("Cannot run Kernel: event_wait_list is NULL and num_events_in_wait_list > 0, or event_wait_list "
"is not NULL and num_events_in_wait_list is 0, or if event objects in event_wait_list "
"are not valid events..\n");
} else if (err == CL_OUT_OF_HOST_MEMORY) {
Fatal("Cannot run Kernel: Failure to allocate resources required by the OpenCL implementation on the host.\n");
} else {
Fatal("Cannot run Kernel: Unknown Error. (clEnqueueNDRangeKernel)");
}
}
// Flush the program & wait for the program to finish executing
if (clFlush(_queue)) printf("Flush Fail (Run)");
WaitForEventAndRelease(&runEvent);
// Copy the end result back to CPU memory side
if (clEnqueueReadBuffer(_queue, stackTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), stackTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy stackTracesOutCL from device to host\n");
if (clEnqueueReadBuffer(_queue, powerTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), powerTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy powerTracesOutCL from device to host\n");
// Release kernel and program
if (clReleaseKernel(kernel)) Fatal("Cannot release kernel\n");
if (clReleaseProgram(moveoutAndStackCLProgram)) Fatal("Cannot release program\n");
// Free device memory
clReleaseMemObject(prestackTracesCL);
clReleaseMemObject(stackTracesOutCL);
clReleaseMemObject(powerTracesOutCL);
// Release the context and queue
clReleaseCommandQueue(_queue);
clReleaseContext(_context);
// Return the time it took to run this program
return runTime;
}
double RunProg(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut,
unsigned int localThreadCount, unsigned int deviceType) {
// Stores sizes of the various arrays
size_t tracesInxSample = nTracesIn * samplesPerTrace;
size_t tracesOutxSample = nTracesOut * samplesPerTrace;
// Allocate arrays
float* prestackTraces1D = (float*)malloc(tracesInxSample * sizeof(float));
float* stackTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
float* powerTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
// Count how much memory all of this is
if (_VERBOSE)
{
// Make sure it is consistent with above allocation
unsigned long allocatedMemory = 0;
allocatedMemory += tracesInxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);
printf("TOTAL MEMORY ALLOCATED: %s\n", byteConverter(allocatedMemory));
printf("Input Array Sizes: %s\n", byteConverter((unsigned int)(tracesInxSample * sizeof(float))));
printf("Output Array Sizes: %s\n", byteConverter((unsigned int)(tracesOutxSample * sizeof(float))));
}
// Fill in array with randoms
RandomFillArray(prestackTraces1D, (unsigned int)tracesInxSample);
// Init OpenCL using the desired device type
double preInitTime = GetTime();
int maxWorkGroupSize;
if (deviceType == 0) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_ALL);
else if (deviceType == 1) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_GPU);
else maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_CPU);
printf("Max work size for the device is: %d\n", maxWorkGroupSize);
// --- ACTUAL TEST ---
// Run OpenCL
double startTime = GetTime();
double runTime = RunOpenCL(prestackTraces1D, stackTracesOut1Dgpu, powerTracesOut1Dgpu, // arrays
nTracesOut, nTracesIn, samplesPerTrace, // ints
tracesInxSample, tracesOutxSample,
localThreadCount); // samples
// Display run time
double endTime = GetTime();
printf("Elapsed Time: %fsecs\n", (endTime - runTime));
printf(" %fsecs (Before Function Call)\n", (endTime - startTime));
printf(" %fsecs (Including Init)\n\n", (endTime - preInitTime));
// Free the 1D arrays
free(powerTracesOut1Dgpu);
free(stackTracesOut1Dgpu);
free(prestackTraces1D);
return (endTime - startTime);
}
我首先想到为什么它在我的GPU上运行速度比我的CPU慢得多,这可能是因为我在运行之前在显卡上运行了这么多数据。也许更好的实现可能涉及在多次运行中拆分工作负载,以便代码可以在执行更多数据时执行(我认为这是一件事)。然而,现在我想到这几乎肯定是错误的,因为正如我所说,我基于一个例子编写了这个程序,并且该示例进行了矩阵乘法,并且该示例在GPU上比我的CPU运行得快得多。我真的不知道区别是什么。
答案 0 :(得分:1)
问题在于缓存;我从数组中读过很多文字。所以我在写入数组之前创建了一个尽可能多地写入局部变量的版本,现在它在GPU上运行得更快。