Question

我希望生成一组非常大的准随机数。（非常大＆＃39;，我的意思是比当前CUDA设备可以支持的最大并发线程数大得多，要求每个线程循环，或者以大网格大小启动内核。我想要quasirandom的低差异属性。）对于伪随机数，每次调用curand_init都可以采用不同的序列参数this seems simple。

为了生成N个准随机数，其中N大于gridDim.x * blockDim.x，是否存在比任何一个更有效的解决方案

为N个州运行curand_init次N次，在[0，N）中为每个呼叫提供唯一偏移量;
对于该数量的状态仅运行curand_init gridDim.x * blockDim.x次，但是给每次调用一个偏移，例如10*threadID，如果我希望每个线程必须生成10个数字？

（忽略因大偏移导致的任何开销，即忽略skip_ahead()。）

我查看了CUDA 6.0示例中的代码，并且MC_EstimatePiInlineQ 出现以执行我在二维中寻找的内容。但是，当要生成的点数超过gridDim.x * blockDim.x时，我相信此代码实际上会多次生成相同的点。这是一个问题，因为gridDim.x不一定足够大以适应此示例中的问题大小;它被调整为在设备上每个多处理器大约10个块。

相关代码如下（为简洁起见略有改动）：

// RNG init kernel
template <typename rngState_t, typename rngDirectionVectors_t>
__global__ void initRNG(rngState_t *const rngStates,
                        rngDirectionVectors_t *const rngDirections)
{
    // Determine thread ID
    unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int step = gridDim.x * blockDim.x;

    // Initialise the RNG
    curand_init(rngDirections[0], tid, &rngStates[tid]);
    curand_init(rngDirections[1], tid, &rngStates[tid + step]);
}

和

// Estimator kernel
template <typename Real, typename rngState_t>
__global__ void computeValue(unsigned int *const results,
                             rngState_t *const rngStates,
                             const unsigned int numSims)
{
    // Determine thread ID
    unsigned int bid = blockIdx.x;
    unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int step = gridDim.x * blockDim.x;

    // Initialise the RNG
    rngState_t localState1 = rngStates[tid];
    rngState_t localState2 = rngStates[tid + step];

    // Count the number of points which lie inside the unit quarter-circle
    unsigned int pointsInside = 0;

    for (unsigned int i = tid ; i < numSims ; i += step)
    {
        Real x = curand_uniform(&localState1);
        Real y = curand_uniform(&localState2);

        // Do something.
    }

    // Do some more.
}

假设gridDim.x * blockDim.x < N，那么至少线程tid = 0将在for中循环两次。在第二次运行中，它将生成相对于其初始化偏移量0的第二个随机数;这相当于相对于初始化偏移量1的第一个随机数，这正是第一次时间的tid = 1。所以这一点已经产生了！除了具有最高tid的线程（即gridDim.x * blockDim.x的某个倍数）之外的所有线程都是如此，如果它甚至多次循环。充其量这是无用的工作，对我的用例来说，这将是有害的。

我创建了一个stripped-down version提到的示例，基于一些假设设备，我们每个块只有4个线程，只有2个块，但希望生成16个点。注意，输出的第9-15行与第2-8行相同。只有第16行才是新的一点。

Answer 1

这只是阅读the docs的情况，但实际上我发现限制你生成的状态数确实快得多。

这对应于问题中的选项2：每个帖子的offset到curand_init应为n * tid，其中n至少与数字一样大您希望每个线程生成的随机数。如果在州代中不知道，则可以在调用curand，curand_uniform等之前使用skip_ahead(n * tid, &state)。

CURAND - 为每个线程生成多个准随机数

1 个答案: