Question

我的内核代码。

__kernel void OUT__1__1527__(__constant float *A,__constant float *B,__global float *res)
 {
  int i = get_global_id(0);
  float C=0;
  if (i <= 5 - 1) {
  C += (A[i] * B[i]);
  *res=C;
 }
}

A＆amp; B的值均为{1,2,3,4,5}。对于这个内核我得到结果25，这是5 * 5，我希望结果是55.（1 * 1 + 2 * 2 + 3 * 3 + 4 * 4 + 5 * 5）

需要插入什么代码才能进行同步以及需要插入代码的位置。

Answer 1

这些东西没有神奇的解决方法。这是典型的缩减问题（您希望在单个变量中合并多个结果）。

如果这不是算法的瓶颈，你可以使用原子（但不能使用浮点值）（即：你在内核的其他地方做了其他更昂贵的过程）。但如果这是内核的核心。然后你应该完全改变你的算法。

您可以先阅读：http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/

你的代码也错了，它实际上从来没有＆＃34;合并res中的任何数据。它只会将res设置为C的值。并且C对每个线程都是私有的，因此，它不会对其中的任何内容求和。只有最后一个线程实际上赢得了datarace，导致答案为25。

有一个技巧我不推荐，以便在浮点数中使用原子，基于联合和多次读/写全局内存：

inline void AtomicAdd(volatile __global float *source, const float operand) {
    union {
        unsigned int intVal;
        float floatVal;
    } newVal;
    union {
        unsigned int intVal;
        float floatVal;
    } prevVal;
    do {
        prevVal.floatVal = *source;
        newVal.floatVal = prevVal.floatVal + operand;
    } while (atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}

__kernel void OUT__1__1527__(__constant float *A,__constant float *B,__global float *res)
 {
  int i = get_global_id(0);
  if (i <= 5 - 1) {
    AtomicAdd(res, (A[i] * B[i]));
 }
}

opnecl dot产品。这里我试图将结果存储在局部变量中，每次都重置为零

1 个答案: