提高非常简单的OpenCL内核

时间:2016-02-11 14:06:58

标签: performance opencl

明天我有一个覆盖OpenCL的测试。我们有一些示例测试,但没有解决方案。 鉴于此代码:

void scalar_add(int n, double *a, double *b, double *result) {
    for (int i=0; i<n; i++)
        result[i] = a[i] + b[i];
}

第一项任务是编写OpenCL内核。所以我的解决方案:

__kernel void scalar_add(
                          __global double *a,
                          __global double *b,
                          __global double *result
) {
       size_t i = get_global_id(0;)
       result[i] = a[i] + b[i];
    }

对于每个元素,我从A读取一次,从B读取一次,然后写一次到C.我不知道如何使用私有或本地内存来提高速度。 接下来的问题是如何通过一个简单的改变来提高速度(“WelchekleineÄnderungkönnteaufeiner Standard-Grafikkarte zu einer deutlichenLeistungssteigerungführen?”)。 有没有办法提高速度?

内核只能从A和B读取,所以可能会使用它。我尝试使用“__local”作为参数A和B,但这不会编译或运行。

2 个答案:

答案 0 :(得分:0)

下面的代码应该为您加快一些速度。您需要使用单个工作组进行调用。尝试组大小是32 - 64的倍数通常是好的。试用和错误是找到硬件组大小最佳位置的一种非常好的方法。我假设你正在为这个内核使用GPU。

我还添加了maxSize作为循环计数器的上限,它应该等于矢量的长度。您可以将其作为参数传递,或者根据需要将其编码为常量。

__kernel void scalar_add(
                          __global double *a,
                          __global double *b,
                          __global double *result
) 
{
    size_t gid = get_global_id(0);
    size_t groupSize = get_global_size(0);
    for(int i = gid; i< maxSize; i+= groupSize){
        result[i] = a[i] + b[i];
    }
}

不同的是内存合并。当每个工作项读取其下一个元素时,读取由硬件组合在一个低级别上。每次读取时应该能够获得32个字节或更多字节,因此价格为1的4个双倍值。

答案 1 :(得分:0)

The one noticable latency here is kernel overhead. One kernel execution just for a '+' is overkill(but still faster) unless it is already in gpu(not fed from cpu to gpu and has much headroom to optimize).

If it is gpu, then you can simply move this addition operation to another kernel and gain some time around hundreds of microseconds to several milliseconds. Of course this is applicable when the other kernel items is completely independent from their neighbours' results since there is no synchronization for whole item array (only compute unit-wide sync is done, not all items)

public ActionResult Index(string MyCheck = null)

If consecutive kernels seem to be dependent, try to decompose one of them into two lighter weight components and one of them has to be independent from this addition and other part has to be independent from a third kernel and merging both component should give two instead of three kernel executions. Such decomposition could look like:

__kernel void scalar_add(
                          __global double *a,
                          __global double *b,
                          __global double *otherParameters

) 
{
    size_t i = get_global_id(0);
    double result[i] = a[i] + b[i];
    // other computations that are independent from neighbours' result 
    // or a,b are not dependant from other things of neighbour items.
    // or
    otherParameters[i]=sin(result[i]);  // ok for example
    otherParameters[i]=cos(result[i+1]) // not ok 
}

If you can do this, you can get rid of unnecessary global memory accesses to *result array.