Question

所以，我开始对CUDA感到非常沮丧，于是我决定编写最简单的代码，只是为了得到我的支持。但似乎有些事情在我脑海中浮现。在我的代码中，我只是添加两个数组，然后将它们存储在第三个数组中，如下所示：

#include <stdio.h>
#include <stdlib.h>

__global__ void add(int* these, int* those, int* answers)
{
    int tid = blockIdx.x;
    answers[tid] = these[tid] + those[tid];
}

int main()
{
    int these[50];
    int those[50];
    int answers[50];

    int *devthese;
    int *devthose;
    int *devanswers;

    cudaMalloc((void**)&devthese, 50 * sizeof(int));
    cudaMalloc((void**)&devthose, 50 * sizeof(int));
    cudaMalloc((void**)&devanswers, 50 * sizeof(int));


    int i;
    for(i = 0; i < 50; i++)
    {
        these[i] = i;
        those[i] = 2 * i;
    }

    cudaMemcpy(devthese, these, 50 * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(devthose, those, 50 * sizeof(int), cudaMemcpyHostToDevice);
    add<<<50,1>>>(devthese, devthose, devanswers);

    cudaMemcpy(answers, devanswers, 50 * sizeof(int), cudaMemcpyDeviceToHost);
    for(i = 0; i < 50; i++)
    {
        fprintf(stderr,"%i\n",answers[i]);
    }
    return 0;
}

然而，正在打印的int值并不是遵循3的倍数序列，这正是我所期待的。任何人都可以解释出现了什么问题吗？

Answer 1

从评论来看，问题显然与在编译过程中使用不正确的目标架构有关，导致无法在OP的GPU上运行的可执行文件。

已添加此社区wiki答案，以便将其从未应答的队列中删除。如果/当OP返回时它可以被删除并提供更全面的答案。

简单的CUDA内核没有按预期返回值

1 个答案: