Question

我正在尝试学习CUDA，并试图完成一个简单的程序。该程序查看一个预先填充的数组，该数组填充有0,1,2，然后计算共享数组中链接数的出现（即，多少个00,01,02,10,11,12,20,21,22组合）。不幸的是，似乎每次出现仅计数1，然后停止。

预填充数组具有（0,1,0,2,0,0,2,0,1,0）预期输出应为（1,2,2,2,2,0,0,2,0,0）实际输出为（1,1,1,1,1,0,0,1,0,0）

int a *是预填充数组，int b *是组合的“共享”数组。

当前正在用10个线程的单个块调用全局内核。（稍后我想将其更改为多个块，但我想让线程首先工作。）

有什么建议吗？

我尝试使用 shared 和__syncthreads共享阵列，但是我的问题可能还需要其他内容。

__device__ int GetIndex(int a, int b) {
    if (a == 0 && b == 0) return 0;
    if (a == 0 && b == 1) return 1;
    if (a == 0 && b == 2) return 2;
    if (a == 1 && b == 0) return 3;
    if (a == 1 && b == 1) return 4;
    if (a == 1 && b == 2) return 5;
    if (a == 2 && b == 0) return 6;
    if (a == 2 && b == 1) return 7;
    if (a == 2 && b == 2) return 8;
}

__global__ void CalculateRecurrences(int *a, int *b) {

    __shared__ int s[TOTAL_COMBINATIONS];
    int e = threadIdx.x + blockIdx.x * blockDim.x;

    for (int i = 0; i < 10; i++)
    {
        s[i] = b[i];        
    }
    __syncthreads();
    if (e < 10) {
        int index;
        int next = a[e + 1];
        printf("%i %i", a[e], next);
        index = GetIndex(a[e], next);
        s[index] += 1;
    }

    for (int i = 0; i < 10; i++)
    {
        b[i] = s[i];        
    }
    __syncthreads();
}

先谢谢了。请让我知道是否需要澄清。

Answer 1

这里有很多问题。

回想一下，每个线程都在执行您编写的内核。因此，代码段如下所示：

for (int i = 0; i < 10; i++)
{
    s[i] = b[i];        
}

被所有10个线程执行。因此，所有10个线程都在读取和写入输入数组的所有10个元素到共享数组。真浪费！您有10个线程，并且有10个元素；您可以通过将上面的for循环替换为：来告诉每个线程使用其中一个元素：

if (e < 10)
    s[e] = b[e];

同样，您有10个线程都在尝试执行下一个代码块。您正在以非线程安全的方式访问内存。最简单的解决方案是使用atomicAdd而不是+ =。

您在这里也有非法的内存访问权限；如果a在0-9范围内定义，而e在0-9范围内定义，则e+1将超出a的范围：

int next = a[e + 1];  // Undefined behavior!!!

最后，像上面一样，您使每个线程都运行最终循环，以将s的元素复制到b。您有10个线程和10个元素可以使用，因此应让每个线程在其各自的索引上进行操作：

b[e] = s[e];

编辑：将所有代码放在一起，可能看起来像这样（未经测试）：

__global__ void CalculateRecurrences(int *a, int *b) {    
    int e = threadIdx.x + blockIdx.x * blockDim.x;
    if (e >= 10) {
        return;
    }

    __shared__ int s[10];

    // All threads read and assign a value to shared memory
    s[e] = 0;  // You're counting; assign 0s everywhere, we don't need array b for this
    // Wait for all threads to complete initialization of shared array
    __syncthreads();

    // Each thread compares its indexed value to the value in the next index
    if (e < 9) {
        int next = a[e + 1];  
        printf("%i %i", a[e], next);
        int index = GetIndex(a[e], next);
        // Since multiple threads may receive same index, need atomicAdd:
        atomicAdd(&s[index], 1);
    }
    // Each thread may be updating different indices than its own.
    // Thus need to wait for all threads to complete
    __syncthreads();

    // Each thread writes its indexed value to global output array
    b[e] = s[e];
}

请记住，所有线程都执行相同的内核代码。因此，应尽可能将线程索引映射到数组索引，如上所示。我还将指出，在上面的示例中，使用共享数组可能没有任何好处，您可以直接对b数组进行初始化和操作，但是您可能只是在练习一个数组。

如何共享具有多个块和线程的单个阵列？

1 个答案: