Question

我正在尝试在cuda中使用动态并行性。我的情况是父内核有一个需要传递给子进行进一步计算的变量。我已经浏览了网络资源 here

并且它提到局部变量不能传递给子kernal并且已经提到了传递变量的方法，并且我试图将变量传递给

#include <stdio.h>
#include <cuda.h>


__global__ void square(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if(N==10)
  {
  a[idx] = a[idx] * a[idx];
  }
}
// Kernel that executes on the CUDA device
__global__ void first(float *arr, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int n=N; // this value of n can be changed locally and need to be passed
  printf("%d\n",n);
  cudaMalloc((void **) &n, sizeof(int));

  square <<< 1, N >>> (arr, n);

}

// main routine that executes on the host
int main(void)
{
  float *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:

  first <<< 1, 1 >>> (a_d, N);
  //cudaThreadSynchronize();
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);
}

并且未传递父到子内核的值。我怎样才能传递局部变量的值。有没有办法这样做？

Answer 1

此操作不合适：

int n=N; // this value of n can be changed locally and need to be passed

cudaMalloc((void **) &n, sizeof(int)); // illegal

它不适用于主机代码，也不适用于设备代码。 n是int变量。你不应该指定它。当您尝试在64位环境中执行此操作时，您尝试在32位int数量的顶部写入64位指针。它不起作用。

目前尚不清楚为什么你还需要它。 n是一个整数参数，可能会指定arr float数组的大小。你不需要在它上面分配任何东西。

如果您使用cuda-memcheck运行此代码，则可以轻松发现该错误。您也可以使用与在主机代码中完全相同的方式在设备代码中执行proper cuda error checking。

当我在cudaMalloc内核中注释掉first行时，您的代码会正确运行。

在cuda中以动态并行方式将变量从父内核传递到子内核

1 个答案: