使用CUDA

时间:2016-02-13 03:18:58

标签: cuda

给定一个nxn矩阵,我想找到相邻元素的总和(上,下,左,右),并用CUDA替换中间元素,即一切应该并行完成。

例如,有3个元素

1 2 3

4 5 6

7 8 9

如果我们取中间元素(坐标(1,1)),相邻元素的总和是(2 + 4 + 6 + 8 =)20。这应该用中间元素替换

1 2 3

4 20 6

7 8 9

这是我写的代码。如果n = 3,此代码将起作用,如果n更大(例如:n = 5),则此代码将不起作用。请建议我一种概括此代码的方法。

请帮帮我。

#include <stdio.h>
#include <stdlib.h>

#define N 3
#define BLOCK_DIM 3

__global__ void matrixAdd (int *a, int *c) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    int index = col + row * N;

    int sum = 0;
    if (row == 1 && col == 1 && col < N && row < N) {
        sum = sum + a[index - 1];
        sum = sum + a[index + 1];
        sum = sum + a[index - 3];
        sum = sum + a[index + 3];
    }
    c[index] = sum;
}

void printMatrix(int a[N][N] )
{
    for(int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            printf("%d\t", a[i][j] );
        }
        printf("\n");
    }
}

int main() {
    int a[N][N], c[N][N];
    int *dev_a, *dev_c;

    int size = N * N * sizeof(int);

    for(int i=0; i<N; i++)
        for (int j=0; j<N; j++){
            a[i][j] = rand() % 256;
        }

    printf("Matrix A\n");
    printMatrix(a);

    cudaMalloc((void**)&dev_a, size);
    cudaMalloc((void**)&dev_c, size);

    cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);

    dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
    dim3 dimGrid((N+dimBlock.x-1)/dimBlock.x, (N+dimBlock.y-1)/dimBlock.y);

    printf("dimGrid.x = %d, dimGrid.y = %d\n", dimGrid.x, dimGrid.y);

    matrixAdd<<<dimGrid,dimBlock>>>(dev_a,dev_c);
    cudaDeviceSynchronize();
    cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);

    printf("Matrix c\n");
    printMatrix(c);

    cudaFree(dev_a);
    cudaFree(dev_c);
}

1 个答案:

答案 0 :(得分:0)

如果你想要的是每个输出元素是输入中顶部,底部,左边和右边邻居的总和,你只需要对代码进行一些小改动就可以处理任意大小,由此:

if (row == 1 && col == 1 && col < N && row < N) {
    sum = sum + a[index - 1];
    sum = sum + a[index + 1];
    sum = sum + a[index - 3];
    sum = sum + a[index + 3];
}
c[index] = sum;

到此:

if (row > 0 && col > 0 && col < N-1 && row < N-1 ) {  //note the change
    sum = sum + a[index - 1];
    sum = sum + a[index + 1];
    sum = sum + a[index - N];                         //note the change
    sum = sum + a[index + N];                         //note the change
    c[index] = sum;                                   //note the change
}

这排除了边框,因为没有为边框定义操作,与示例代码一致。如果要使用输入数据填充边框区域(与问题文本中提供的示例一致,这与示例代码的行为方式不一致),有几种方法可以执行此操作。一种方法是在运行上面修改过的内核之前将输入数据复制到输出数据:

cudaMemcpy(dev_c, a, size, cudaMemcpyHostToDevice);