将简单的C ++代码段重写为CUDA代码

时间:2012-06-16 02:37:07

标签: parallel-processing cuda

我编写了以下简单的C ++代码。

#include <iostream>
#include <omp.h>

int main()
    int myNumber = 0;
    int numOfHits = 0;

    cout << "Enter my Number Value" << endl;
    cin >> myNumber;

    #pragma omp parallel for reduction(+:numOfHits)

    for(int i = 0; i <= 100000; ++i)
        for(int j = 0; j <= 100000; ++j)
            for(int k = 0; k <= 100000; ++k)
                if(i + j + k == myNumber)

    cout << "Number of Hits" << numOfHits << endl;

    return 0;


1 个答案:

答案 0 :(得分:1)


首先,您需要使用CUDA设置MS Visual Studio,这很容易遵循本指南:http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/

现在,您需要阅读“NVIDIA CUDA编程指南”(免费pdf),文档和CUDA示例(我强烈建议您学习CUDA)。


这是一个非常算术的重型和数据轻度计算 - 实际上它可以在没有这种强力方法的情况下相当简单地计算,但这不是您正在寻找的答案。我为内核建议这样的东西:

__global__ void kernel(int* myNumber, int* numOfHits){

    //a shared value will be stored on-chip, which is beneficial since this is written to multiple times
    //it is shared by all threads
    __shared__ int s_hits = 0;

    //this identifies the current thread uniquely
    int i = (threadIdx.x + blockIdx.x*blockDim.x);
    int j = (threadIdx.y + blockIdx.y*blockDim.y);
    int k = 0;

    //we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
    for(; i < 100000; i += blockDim.x*gridDim.x){
        for(; j < 100000; j += blockDim.y*gridDim.y){
                  //Thanks to talonmies for this simplification
               if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
                  //you should actually use atomics for this
                 //otherwise, the value may change during the 'read, modify, write' process

    //synchronize threads, so we now s_hits is completely updated

    //again, atomics
    //we make sure only one thread per threadblock actually adds in s_hits
    if(threadIdx.x == 0 && threadIdx.y == 0)
        *numOfHits += s_hits;



dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);


免责声明:我没有测试过我的代码,而且我不是专家 - 它可能是愚蠢的。