我的opencl测试运行速度不如CPU快

时间:2017-02-21 06:20:48

标签: c++ parallel-processing opencl gpu

我正在尝试测量GPU的执行时间并将其与CPU进行比较。 我写了一个simple_add函数来添加一个短int向量的所有元素。 内核代码是:

abc(1,2)

我为此功能编写了另一个CPU版本,并在100次执行后测量了它们的执行时间

var res =abc(1,2);
    function abc(mid, type) {
       $http({
               ...
             }
           }).then(function (response) {

               $scope.output = response.data;
               console.log($scope.output)
               return response.data;
          }, function (response) {
         });
      }

在调用整个测量函数10次后,结果如下:

global const int * A, global const uint * B, global int* C)
    {
        ///------------------------------------------------
        /// Add 16 bits of each
        int AA=A[get_global_id(0)];
        int BB=B[get_global_id(0)];
        int AH=0xFFFF0000 & AA;
        int AL=0x0000FFFF & AA;
        int BH=0xFFFF0000 & BB;
        int BL=0x0000FFFF & BB;
        int CL=(AL+BL)&0x0000FFFF;
        int CH=(AH+BH)&0xFFFF0000;      
        C[get_global_id(0)]=CH|CL;               
     }

问题是我真的希望GPU比CPU快得多,但事实并非如此。我无法理解为什么我的GPU速度不比CPU高很多。我的代码有问题吗? 这是我的GPU属性:

clock_t before_GPU = clock();
for(int i=0;i<100;i++)
{
  queue.enqueueNDRangeKernel(kernel_add,1,
  cl::NDRange((size_t)(NumberOfAllElements/4)),cl::NDRange(64));
  queue.finish();
 }
 clock_t after_GPU = clock();


 clock_t before_CPU = clock();
 for(int i=0;i<100;i++)
     AddImagesCPU(A,B,C);
  clock_t after_CPU = clock();

只是为了比较这是我的CPU规格:

        CPU time: 1359
        GPU time: 1372
        ----------------
        CPU time: 1336
        GPU time: 1269
        ----------------
        CPU time: 1436
        GPU time: 1255
        ----------------
        CPU time: 1304
        GPU time: 1266
        ----------------
        CPU time: 1305
        GPU time: 1252
        ----------------
        CPU time: 1313
        GPU time: 1255
        ----------------
        CPU time: 1313
        GPU time: 1253
        ----------------
        CPU time: 1384
        GPU time: 1254
        ----------------
        CPU time: 1300
        GPU time: 1254
        ----------------
        CPU time: 1322
        GPU time: 1254
        ----------------

我还使用QueryPerformanceCounter测量了挂钟时间,结果如下:

        -----------------------------------------------------
        ------------- Selected Platform Properties-------------:
        NAME:   AMD Accelerated Parallel Processing
        EXTENSION:      cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
        VENDOR:         Advanced Micro Devices, Inc.
        VERSION:        OpenCL 1.2 AMD-APP (937.2)
        PROFILE:        FULL_PROFILE
        -----------------------------------------------------
        ------------- Selected Device Properties-------------:
        NAME :  ATI RV730
        TYPE :  4
        VENDOR :        Advanced Micro Devices, Inc.
        PROFILE :       FULL_PROFILE
        VERSION :       OpenCL 1.0 AMD-APP (937.2)
        EXTENSIONS :    cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
        MAX_COMPUTE_UNITS :     8
        MAX_WORK_GROUP_SIZE :   128
        OPENCL_C_VERSION :      OpenCL C 1.0
        DRIVER_VERSION:         CAL 1.4.1734
        ==========================================================

我再次尝试使用opencl profiling执行时间。

        ------------- CPU Properties-------------:
        NAME :          Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
        TYPE :  2
        VENDOR :        GenuineIntel
        PROFILE :       FULL_PROFILE
        VERSION :       OpenCL 1.2 AMD-APP (937.2)
        MAX_COMPUTE_UNITS :     4
        MAX_WORK_GROUP_SIZE :   1024
        OPENCL_C_VERSION :      OpenCL C 1.2
        DRIVER_VERSION:         2.0 (sse2,avx)
        ==========================================================

一次性执行的结果大致相同:

            CPU time: 1304449.6  micro-sec
            GPU time: 1401740.82  micro-sec
            ----------------------
            CPU time: 1620076.55  micro-sec
            GPU time: 1310317.64  micro-sec
            ----------------------
            CPU time: 1468520.44  micro-sec
            GPU time: 1317153.63  micro-sec
            ----------------------
            CPU time: 1304367.29  micro-sec
            GPU time: 1251865.14  micro-sec
            ----------------------
            CPU time: 1301589.17  micro-sec
            GPU time: 1252889.4  micro-sec
            ----------------------
            CPU time: 1294750.21  micro-sec
            GPU time: 1257017.41  micro-sec
            ----------------------
            CPU time: 1297506.93  micro-sec
            GPU time: 1252768.9  micro-sec
            ----------------------
            CPU time: 1293511.29  micro-sec
            GPU time: 1252019.88  micro-sec
            ----------------------
            CPU time: 1320753.54  micro-sec
            GPU time: 1248895.73  micro-sec
            ----------------------
            CPU time: 1296486.95  micro-sec
            GPU time: 1255207.91  micro-sec
            ----------------------

2 个答案:

答案 0 :(得分:1)

ATI RV730具有VLIW结构,因此最好尝试uint4int4个矢量类型,总线程数为1/4(即NumberOfAllElements / 16)。这也有助于为每个工作项更快地从内存加载。

与内存操作相比,内核没有太多计算。使缓冲区映射到RAM将具有更好的性能。不要复制数组,使用map / unmap enqueue命令将它们映射到内存。

如果它仍然不快,你可​​以同时使用gpu和cpu来完成上半年和下半年的工作,以便在50%的时间内完成它。

也不要把clFinish放在循环中。把它放在循环结束后。通过这种方式,它可以更快地排队,并且已经按顺序执行,因此在完成第一项之前不会启动其他项目。我认为它是有序队列,并且在每个队列之后添加clfinish是额外的开销。在最新内核之后只有一个clfinish就足够了。

ATI RV730:64个VLIW单元,每个单元至少有4个流核心。 750 MHz。

i3-2100:2个核心(仅用于防冒泡的线程),每个核心都具有能够同时传输8个操作的AVX。所以这可以在飞行中进行16次操作。超过3 GHz。

简单地将流操作与频率相乘:

ATI RV730 = 192个单位(多个添加功能,每个vliw的第5个元素)

i3-2100 = 48个单位

所以gpu应该至少快4倍(使用int4,uint4)。这适用于简单的ALU和FPU操作,例如按位运算和乘法运算。特殊功能,例如trancandentals性能可能会有所不同,因为它们仅在每个vliw中的第5个单元上运行。

答案 1 :(得分:0)

我做了一些额外的测试,并意识到GPU已针对浮点运算进行了优化。 我更改了测试代码如下:

void kernel simple_add(global const int * A, global const uint * B, global int* C)
    {
        ///------------------------------------------------
        /// Add 16 bits of each
        int AA=A[get_global_id(0)];
        int BB=B[get_global_id(0)];
        float AH=0xFFFF0000 & AA;
        float AL=0x0000FFFF & AA;
        float BH=0xFFFF0000 & BB;
        float BL=0x0000FFFF & BB;
        int CL=(int)(AL*cos(AL)+BL*sin(BL))&0x0000FFFF;
        int CH=(int)(AH*cos(AH)+BH*sin(BL))&0xFFFF0000;
           C[get_global_id(0)]=CH|CL;               
     }

得到了我预期的结果(大约快了10倍):

                CPU time:      741046.665  micro-sec
                GPU time:       54618.889  micro-sec
                ----------------------------------------------------
                CPU time:      741788.112  micro-sec
                GPU time:       54875.666  micro-sec
                ----------------------------------------------------
                CPU time:      739975.979  micro-sec
                GPU time:       54560.445  micro-sec
                ----------------------------------------------------
                CPU time:      755848.937  micro-sec
                GPU time:       54582.111  micro-sec
                ----------------------------------------------------
                CPU time:      724100.716  micro-sec
                GPU time:       56893.445  micro-sec
                ----------------------------------------------------
                CPU time:      744476.351  micro-sec
                GPU time:       54596.778  micro-sec
                ----------------------------------------------------
                CPU time:      727787.538  micro-sec
                GPU time:       54602.445  micro-sec
                ----------------------------------------------------
                CPU time:      731132.939  micro-sec
                GPU time:       54710.000  micro-sec
                ----------------------------------------------------
                CPU time:      727899.150  micro-sec
                GPU time:       54583.444  micro-sec
                ----------------------------------------------------
                CPU time:      727089.880  micro-sec
                GPU time:       54594.778  micro-sec
                ----------------------------------------------------

有点像下面这样重要的浮点运算:

        void kernel simple_add(global const int * A, global const uint * B, global int* C)
            {
                ///------------------------------------------------
                /// Add 16 bits of each
                int AA=A[get_global_id(0)];
                int BB=B[get_global_id(0)];
                float AH=0xFFFF0000 & AA;
                float AL=0x0000FFFF & AA;
                float BH=0xFFFF0000 & BB;
                float BL=0x0000FFFF & BB;
                int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
                        BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
                int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
                        BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
                        C[get_global_id(0)]=CH|CL;

             }

结果或多或少相同:

                CPU time:     3905725.933  micro-sec
                GPU time:      354543.111  micro-sec
                -----------------------------------------
                CPU time:     3698211.308  micro-sec
                GPU time:      354850.333  micro-sec
                -----------------------------------------
                CPU time:     3696179.243  micro-sec
                GPU time:      354302.667  micro-sec
                -----------------------------------------
                CPU time:     3692988.914  micro-sec
                GPU time:      354764.111  micro-sec
                -----------------------------------------
                CPU time:     3699645.146  micro-sec
                GPU time:      354287.666  micro-sec
                -----------------------------------------
                CPU time:     3681591.964  micro-sec
                GPU time:      357071.889  micro-sec
                -----------------------------------------
                CPU time:     3744179.707  micro-sec
                GPU time:      354249.444  micro-sec
                -----------------------------------------
                CPU time:     3704143.214  micro-sec
                GPU time:      354934.111  micro-sec
                -----------------------------------------
                CPU time:     3667518.628  micro-sec
                GPU time:      354809.222  micro-sec
                -----------------------------------------
                CPU time:     3714312.759  micro-sec
                GPU time:      354883.888  micro-sec
                -----------------------------------------