我尝试以良好的性能在GPU上逐个归一化矩阵。我写了这个函数:
__global__ void test(float * x){ // data of Matrix3Xf x
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < colSize) // stay within the limits x
{
int col = id * 3;
float norm = 1 / norm3df(x[col], x[col + 1], x[col + 2]);
x[col] = x[col] * norm;
x[col + 1] = x[col + 1] * norm;
x[col + 2] = x[col + 2] * norm;
}
我在Cg工具包中找到了数据类型float3,但是它不再是最新的了(参见https://developer.nvidia.com/cg-toolkit)……有什么想法可以使其更快?我的环境是Visual Studio 2017和CUDA 9.2。
先谢谢了。
答案 0 :(得分:-1)
目前为止效果最好:
__global__ void test( float * x, int colSize ) // data of Matrix3Xf x
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < colSize) // stay within the limits of x
{
int col = id * 3;
float norm = x[col] * x[col] + x[col + 1] * x[col + 1] + x[col + 2] * x[col + 2];
norm = rsqrtf(norm);
x[col] = x[col] * norm;
x[col + 1] = x[col + 1] * norm;
x[col + 2] = x[col + 2] * norm;
}
}