在函数CalculateValue(curandState * localStat)和GetExponential(curandState * localState)中通过引用传递随机生成器状态(CUDA toolkit 3.2 curand.lib)时,以下代码是否正确?
由于
__device__ double GetExponential(curandState *localState) {
double u1 = curand_uniform_double(localState); }
__device__ double CalculateValue(curandState *localStat) {
double x = GetExponential(localState);
return x; }
__global__ void RunMonteCarloKernel(curandState *state, double *results) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Copy state to local memory for efficiency */
curandState localState = state[threadIdx.x + blockIdx.x * blockDim.x];
results[i] = CalculateValue(&localState);
/* Copy state back to global memory */
state[threadIdx.x + blockIdx.x * blockDim.x] = localState; }
__global__ void setup_kernel(curandState *state) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
/* Each thread gets different seed, a different sequence number, no offset */
curand_init(i, i, 0, &state[i]); }
int main(void) {
double *devResults;
curandState *devStates;
/* Allocate space for prng states on device */
CUDA_CALL(cudaMalloc((void **)&devStates, totalThreads * sizeof(curandState)));
/* Setup prng states */
setup_kernel<<<totalBlocks, threadsPerBlock>>>(devStates);
for(int i=0; i< 1000; i++)
{
RunMonteCarloKernel(devStates, devResults);
} }
答案 0 :(得分:3)
有问题吗?它看起来还不错。
您可能需要查看3.2 SDK的MonteCarloCURAND目录中的EstimatePiInlineP示例。它使用C ++样式传递引用来避免获取局部变量的地址。您需要将状态存储在内核末尾的内存中(就像在代码中一样)。
通过C ++引用可以通过清楚地显示该函数可以直接在原始寄存器中对数据进行操作来帮助编译器。如果编译器不能确定所有线程都相同地处理指针(即指针上的相同操作),则在GPU中获取本地阵列的地址可能对性能有害,在这种情况下,它会将阵列溢出到本地存储器。它会起作用,但可能会更慢。