我正在尝试测量GPU的执行时间并将其与CPU进行比较。 我写了一个simple_add函数来添加一个短int向量的所有元素。 内核代码是:
abc(1,2)
我为此功能编写了另一个CPU版本,并在100次执行后测量了它们的执行时间
var res =abc(1,2);
function abc(mid, type) {
$http({
...
}
}).then(function (response) {
$scope.output = response.data;
console.log($scope.output)
return response.data;
}, function (response) {
});
}
在调用整个测量函数10次后,结果如下:
global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
int AH=0xFFFF0000 & AA;
int AL=0x0000FFFF & AA;
int BH=0xFFFF0000 & BB;
int BL=0x0000FFFF & BB;
int CL=(AL+BL)&0x0000FFFF;
int CH=(AH+BH)&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
问题是我真的希望GPU比CPU快得多,但事实并非如此。我无法理解为什么我的GPU速度不比CPU高很多。我的代码有问题吗? 这是我的GPU属性:
clock_t before_GPU = clock();
for(int i=0;i<100;i++)
{
queue.enqueueNDRangeKernel(kernel_add,1,
cl::NDRange((size_t)(NumberOfAllElements/4)),cl::NDRange(64));
queue.finish();
}
clock_t after_GPU = clock();
clock_t before_CPU = clock();
for(int i=0;i<100;i++)
AddImagesCPU(A,B,C);
clock_t after_CPU = clock();
只是为了比较这是我的CPU规格:
CPU time: 1359
GPU time: 1372
----------------
CPU time: 1336
GPU time: 1269
----------------
CPU time: 1436
GPU time: 1255
----------------
CPU time: 1304
GPU time: 1266
----------------
CPU time: 1305
GPU time: 1252
----------------
CPU time: 1313
GPU time: 1255
----------------
CPU time: 1313
GPU time: 1253
----------------
CPU time: 1384
GPU time: 1254
----------------
CPU time: 1300
GPU time: 1254
----------------
CPU time: 1322
GPU time: 1254
----------------
我还使用QueryPerformanceCounter测量了挂钟时间,结果如下:
-----------------------------------------------------
------------- Selected Platform Properties-------------:
NAME: AMD Accelerated Parallel Processing
EXTENSION: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
VENDOR: Advanced Micro Devices, Inc.
VERSION: OpenCL 1.2 AMD-APP (937.2)
PROFILE: FULL_PROFILE
-----------------------------------------------------
------------- Selected Device Properties-------------:
NAME : ATI RV730
TYPE : 4
VENDOR : Advanced Micro Devices, Inc.
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.0 AMD-APP (937.2)
EXTENSIONS : cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
MAX_COMPUTE_UNITS : 8
MAX_WORK_GROUP_SIZE : 128
OPENCL_C_VERSION : OpenCL C 1.0
DRIVER_VERSION: CAL 1.4.1734
==========================================================
我再次尝试使用opencl profiling执行时间。
------------- CPU Properties-------------:
NAME : Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
TYPE : 2
VENDOR : GenuineIntel
PROFILE : FULL_PROFILE
VERSION : OpenCL 1.2 AMD-APP (937.2)
MAX_COMPUTE_UNITS : 4
MAX_WORK_GROUP_SIZE : 1024
OPENCL_C_VERSION : OpenCL C 1.2
DRIVER_VERSION: 2.0 (sse2,avx)
==========================================================
一次性执行的结果大致相同:
CPU time: 1304449.6 micro-sec
GPU time: 1401740.82 micro-sec
----------------------
CPU time: 1620076.55 micro-sec
GPU time: 1310317.64 micro-sec
----------------------
CPU time: 1468520.44 micro-sec
GPU time: 1317153.63 micro-sec
----------------------
CPU time: 1304367.29 micro-sec
GPU time: 1251865.14 micro-sec
----------------------
CPU time: 1301589.17 micro-sec
GPU time: 1252889.4 micro-sec
----------------------
CPU time: 1294750.21 micro-sec
GPU time: 1257017.41 micro-sec
----------------------
CPU time: 1297506.93 micro-sec
GPU time: 1252768.9 micro-sec
----------------------
CPU time: 1293511.29 micro-sec
GPU time: 1252019.88 micro-sec
----------------------
CPU time: 1320753.54 micro-sec
GPU time: 1248895.73 micro-sec
----------------------
CPU time: 1296486.95 micro-sec
GPU time: 1255207.91 micro-sec
----------------------
答案 0 :(得分:1)
ATI RV730具有VLIW结构,因此最好尝试uint4
和int4
个矢量类型,总线程数为1/4(即NumberOfAllElements / 16)。这也有助于为每个工作项更快地从内存加载。
与内存操作相比,内核没有太多计算。使缓冲区映射到RAM将具有更好的性能。不要复制数组,使用map / unmap enqueue命令将它们映射到内存。
如果它仍然不快,你可以同时使用gpu和cpu来完成上半年和下半年的工作,以便在50%的时间内完成它。
也不要把clFinish放在循环中。把它放在循环结束后。通过这种方式,它可以更快地排队,并且已经按顺序执行,因此在完成第一项之前不会启动其他项目。我认为它是有序队列,并且在每个队列之后添加clfinish是额外的开销。在最新内核之后只有一个clfinish就足够了。
ATI RV730:64个VLIW单元,每个单元至少有4个流核心。 750 MHz。
i3-2100:2个核心(仅用于防冒泡的线程),每个核心都具有能够同时传输8个操作的AVX。所以这可以在飞行中进行16次操作。超过3 GHz。
简单地将流操作与频率相乘:
ATI RV730 = 192个单位(多个添加功能,每个vliw的第5个元素)
i3-2100 = 48个单位
所以gpu应该至少快4倍(使用int4,uint4)。这适用于简单的ALU和FPU操作,例如按位运算和乘法运算。特殊功能,例如trancandentals性能可能会有所不同,因为它们仅在每个vliw中的第5个单元上运行。
答案 1 :(得分:0)
我做了一些额外的测试,并意识到GPU已针对浮点运算进行了优化。 我更改了测试代码如下:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*cos(AL)+BL*sin(BL))&0x0000FFFF;
int CH=(int)(AH*cos(AH)+BH*sin(BL))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
得到了我预期的结果(大约快了10倍):
CPU time: 741046.665 micro-sec
GPU time: 54618.889 micro-sec
----------------------------------------------------
CPU time: 741788.112 micro-sec
GPU time: 54875.666 micro-sec
----------------------------------------------------
CPU time: 739975.979 micro-sec
GPU time: 54560.445 micro-sec
----------------------------------------------------
CPU time: 755848.937 micro-sec
GPU time: 54582.111 micro-sec
----------------------------------------------------
CPU time: 724100.716 micro-sec
GPU time: 56893.445 micro-sec
----------------------------------------------------
CPU time: 744476.351 micro-sec
GPU time: 54596.778 micro-sec
----------------------------------------------------
CPU time: 727787.538 micro-sec
GPU time: 54602.445 micro-sec
----------------------------------------------------
CPU time: 731132.939 micro-sec
GPU time: 54710.000 micro-sec
----------------------------------------------------
CPU time: 727899.150 micro-sec
GPU time: 54583.444 micro-sec
----------------------------------------------------
CPU time: 727089.880 micro-sec
GPU time: 54594.778 micro-sec
----------------------------------------------------
有点像下面这样重要的浮点运算:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
结果或多或少相同:
CPU time: 3905725.933 micro-sec
GPU time: 354543.111 micro-sec
-----------------------------------------
CPU time: 3698211.308 micro-sec
GPU time: 354850.333 micro-sec
-----------------------------------------
CPU time: 3696179.243 micro-sec
GPU time: 354302.667 micro-sec
-----------------------------------------
CPU time: 3692988.914 micro-sec
GPU time: 354764.111 micro-sec
-----------------------------------------
CPU time: 3699645.146 micro-sec
GPU time: 354287.666 micro-sec
-----------------------------------------
CPU time: 3681591.964 micro-sec
GPU time: 357071.889 micro-sec
-----------------------------------------
CPU time: 3744179.707 micro-sec
GPU time: 354249.444 micro-sec
-----------------------------------------
CPU time: 3704143.214 micro-sec
GPU time: 354934.111 micro-sec
-----------------------------------------
CPU time: 3667518.628 micro-sec
GPU time: 354809.222 micro-sec
-----------------------------------------
CPU time: 3714312.759 micro-sec
GPU time: 354883.888 micro-sec
-----------------------------------------