Question

我正在尝试使用CUDA创建神经网络：

我的内核看起来像：

__global__ void feedForward(float *input, float *output, float **weight) {

//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;

//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;

if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
        * input[weightIndex];
}

将输出复制回主机时，我收到错误

错误未指定第xx行的启动失败

第xx行：

CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));

我在这里做错了吗？

是因为我如何同时使用块索引和线程索引来引用权重矩阵。或问题出在其他地方？

我正在按如下方式涂抹重量矩阵：

cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);

我的内核调用是：

feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);

之后我打电话给：的cudaThreadSynchronize（）;

我是CUDA编程的新手。任何帮助将不胜感激。

由于

Answer 1

输出代码存在问题。虽然它不会产生所描述的错误，但会产生不正确的结果。

int neuronIndex = blockIdx.x;

if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];

我们可以看到单个块中的所有线程同时写入一个存储器单元。所以预计会有未定义的结果。为了避免这种情况，我建议减少共享内存中块内的所有值，并执行对全局内存的单次写入。像这样：

__global__ void feedForward(float *input, float *output, float **weight) {

  int weightIndex = threadIdx.x;
  int neuronIndex = blockIdx.x;
  __shared__ float out_reduce[NO_OF_WEIGHTS];

  out_reduce[weightIndex] = 
     (weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ? 
       weight[neuronIndex][weightIndex] * input[weightIndex]
       : 0.0;
  __syncthreads();

  for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
  {
    if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
    __syncthreads();
  }

  if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex]; 
}

事实证明，我必须重写一半的小内核以帮助减少代码...

Answer 2

我使用CUDA构建了一个非常简单的MLP网络。如果您感兴趣，可以在此处找到我的代码：https://github.com/PirosB3/CudaNeuralNetworks/ 如有任何问题，请拍！

丹尼尔

Answer 3

您正在使用cudaMallocPitch，但未显示变量的初始化方式;我愿意打赌，这就是你的错误所源自的地方。 cudaMallocPitch相当棘手;第3个参数应该是以字节为单位，而第4个参数则不是。即。

int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);

你的变量input_size是以字节为单位的吗？如果没有，那么你可能会分配太少的内存（即你会认为你正在请求64个元素，而是你将获得64个字节），因此你将在内核中访问超出范围的内存。根据我的经验，“未指定的启动失败”错误通常意味着我有一个段错误

使用CUDA实现神经网络

3 个答案: