CUDA指针,调用内核

时间:2014-01-13 14:52:30

标签: c pointers cuda kernel call

如果我在C函数中使用指针,例如:

void processCalcNorm(float* a, float* b, float* c, float* d, float* e, float* f)
    {
            *a = *a + *b;
            *c = *c + *d;
            *e = *e + *f;
    }

for(id = 0; id < 1000; id++)
    {
            processCalcNorm(&xcord[id],&lvelox[id],&ycord[id],&lveloy[id],&zcord[id],&lveloz[id]);
    }

执行内核时应如何调用内核?

1 个答案:

答案 0 :(得分:1)

这样的东西应该有效(用浏览器编写,未经测试):

__global__ void processCalcNorm_kernel(float* a, float* b, float* c, float* d, float* e, float* f, int len)
    {
       int idx = threadIdx.x + blockDim.x*blockIdx.x;
       if (idx < len){
            a[idx] = a[idx] + b[idx];
            c[idx] = c[idx] + d[idx];
            e[idx] = e[idx] + f[idx];}
    }

#define DATA_LEN 1000
#define nTPB 256
...
processCalcNorm_kernel<<<(DATA_LEN+nTPB-1)/nTPB, nTPB>>>(d_xcord,d_lvelox,d_ycord,d_lveloy,d_zcord,d_lveloz,DATA_LEN);

d_...变量是具有相似名称的主变量的设备副本,适合使用cudaMalloccudaMemcpy调用进行设置,如下所示(使用xcord作为示例):

float *d_xcord;
cudaMalloc((void **)&d_xcord, DATA_LEN*sizeof(float));
cudaMemcpy(d_xcord, xcord, DATA_LEN*sizeof(float), cudaMemcpyHostToDevice);

(并为其他变量创建类似的序列)

请注意,不再需要原始C代码中的for循环,因为GPU通过对内核的单次调用有效地处理了for循环的每次迭代。