Question

我想在CUDA中编写一个并行生成the Halton sequence的内核，每个线程生成并存储1个值。

查看序列，似乎生成序列中的每个后续值涉及生成先前值所做的工作。从头开始生成每个值将涉及冗余工作，并导致线程执行时间之间存在较大差距。

有没有办法用改进串行算法的并行内核来做到这一点？我对并行编程非常陌生，所以如果答案是一些众所周知的模式，请原谅这个问题。

注意：我确实在教科书中找到了this link（使用它而没有描述它是如何工作的）但是那里的文件链接已经死了。

Answer 1

Halton序列由：

生成

在base-p数字系统中获得i的表示
反转位顺序

例如，base-2 Halton序列：

index      binary     reversed     result
1             1           1           1 /   10 = 1 / 2
2            10          01          01 /  100 = 1 / 4
3            11          11          11 /  100 = 3 / 4
4           100         001         001 / 1000 = 1 / 8
5           101         101         101 / 1000 = 5 / 8
6           110         011         011 / 1000 = 3 / 8
7           111         111         111 / 1000 = 7 / 8

所以在逐位反转中确实有很多重复的工作。我们可以做的第一件事是重复使用以前的结果。

在base-p Halton序列中计算索引为i的元素时，我们首先确定i的base-p表示的前导位和剩余部分（这可以通过以base-p方式调度线程来完成）。然后我们有

out[i] = out[remaining_part] + leading_bit / p^(length_of_i_in_base_p_representation - 1)
//"^" is used for convenience

为了避免不必要的全局内存读取，每个线程应该处理所有具有相同＆＃34;剩余部分＆＃34;但不同的＆＃34;领先位＆＃34;。如果我们在p ^ n和p ^（n + 1）之间生成Halton序列，那么概念上应该是p ^ n个并行任务。但是，如果我们为一个线程分配一组任务，它就没有问题。

可以通过混合重新计算和从内存加载来进一步优化。

示例代码：

总线程数应为p ^ m。

const int m = 3 //any value
__device__ void halton(float* out, int p, int N)
{
    const int tid = ... //globally unique and continuous thread id
    const int step = p^m; //you know what I mean
    int w = step; //w is the weight of the leading bit
    for(int n = m; n <= N; ++n) //n is the position of the leading bit
    {
        for(int r = tid; r < w; r += step) //r is the remaining part
        for(int l = 1; l < p; ++l) //l is the leading bit
            out[l*w + r] = out[r] + l/w;
        w *= p;
    }
}

注意：此示例不计算Halton序列中的第一个p ^ m元素，但仍然需要这些值。

CUDA - 并行生成Halton序列

1 个答案: