Question

我尝试以良好的性能在GPU上逐个归一化矩阵。我写了这个函数：

__global__ void test(float * x){      // data of Matrix3Xf x
 int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < colSize) // stay within the limits x
{
    int col = id * 3;
    float norm = 1 / norm3df(x[col], x[col + 1], x[col + 2]);

    x[col] = x[col] * norm;
    x[col + 1] = x[col + 1] * norm;
    x[col + 2] = x[col + 2] * norm;
}

我在Cg工具包中找到了数据类型float3，但是它不再是最新的了（参见https://developer.nvidia.com/cg-toolkit）……有什么想法可以使其更快？我的环境是Visual Studio 2017和CUDA 9.2。

先谢谢了。

Answer 1

目前为止效果最好：

__global__ void test( float * x, int colSize )     // data of Matrix3Xf x
{      
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    if (id < colSize) // stay within the limits of x
    {
        int col = id * 3;
        float norm = x[col] * x[col] + x[col + 1] * x[col + 1] + x[col + 2] * x[col + 2];
        norm = rsqrtf(norm);

        x[col] = x[col] * norm;
        x[col + 1] = x[col + 1] * norm;
        x[col + 2] = x[col + 2] * norm;
    }
}

在GPU上对矩阵进行归一化

1 个答案: