Question

我有一个OpenCL内核，它使用给定的矩阵进行一些计算。该内核接收大小约为1280x720的矩阵。它完美无缺，输出我想要的结果。与之唯一相关的是它仅使用我的NVIDIA专用显卡的9％-10％处理能力。这意味着我可以将处理时间提高大约10倍，但我还没有想出如何做到这一点。我想我可以使用本地记忆而不是全局记忆，但我不知道该怎么做。我可以做些什么来使用我的总GPU处理能力，或者这是不可能的？

这是内核代码（删除if-else语句不会改进任何内容。我已经尝试过了）：

__kernel void matrix_dot_vector(const unsigned int size, const unsigned int width, const unsigned int height, __global const float4 *frame1, __global const float4 *frame2, __global const float2 *flow, __global float4 *result)
{
    int x = get_global_id(1);
    int y = get_global_id(0);

    if(flow[x + size * y].x >= 1 || 
       flow[x + size * y].y >= 1 || 
       flow[x + size * y].x <= -1 || 
       flow[x + size * y].y <= -1)
    {
        // For Windows
        int x_coord_0 = min((int) (x + 0.5f * flow[x + size * y].x), (int)(width - 1));
        int y_coord_0 = min((int) (y + 0.5f * flow[x + size * y].y), (int)(height - 1));
        int x_coord_1 = min((int) (x - 0.5f * flow[x + size * y].x), (int)(width - 1));
        int y_coord_1 = min((int) (y - 0.5f * flow[x + size * y].y), (int)(height - 1));

        result[x + size * y] = 0.5f * frame1[x_coord_0 + size * y_coord_0] + 0.5f * frame2[x_coord_1 + size * y_coord_1];
    }
    else
    {
        result[x + size * y] = frame1[x + size * y];
    }
}

亲切地问任何回答这个问题的人。谢谢！

OpenCL内核不使用完整的GPU电源

0 个答案: