Question

我的目标是编写一个自定义的归约内核，该内核将返回每行的argmax以及max和submax（第二大的max）之间的差。我是CUDA的新手，并且正在研究cupy。第一步，我尝试编写自己的import cupy as cp import numpy as np maxval2d = cp.RawKernel(r''' extern "C" __global__ #define THREADS_PER_BLOCK (32*32) void my_maxval2d(unsigned int cols, int* src, int* dst) { __shared__ int block_data[THREADS_PER_BLOCK]; unsigned int row = blockDim.y * blockIdx.y + threadIdx.y; unsigned int col = blockDim.x * blockIdx.x + threadIdx.x; unsigned int threadId = threadIdx.y * blockDim.x + threadIdx.x; unsigned int i = row * cols + col; block_data[threadId] = src[i]; __syncthreads(); // do reduction in shared mem for(unsigned int stride = blockDim.x/2; stride > 0; stride >>= 1) { if (threadIdx.x < stride) { int& a = block_data[threadId]; const int& b = block_data[threadId + stride]; if(b > a) { a = b; } } __syncthreads(); } // write result for this block to global memory if (threadIdx.x == 0) { unsigned int left_col = row * cols + blockIdx.x; dst[left_col] = block_data[blockDim.x * threadIdx.y]; } } ''', 'my_maxval2d') cols = 32*32 rows = 32 cp.random.seed(1) src = cp.random.random((rows, cols)) src = (src*900 + 100).astype(cp.int32) # make integers from 100-999 dst = cp.zeros((rows, cols)) dst = dst.astype(cp.int32) print('baseline:', src.max(axis=1)[0]) threads = 32 remaining = cols counter = 0 while remaining > 1: block_dim = (remaining//threads, rows) thread_dim = (threads, rows) print(f'loop {counter}, remaining: {remaining}, block_dim: {block_dim}, thread_dim: {thread_dim}') maxval2d(block_dim, thread_dim, (cols, src, dst)) remaining //= threads src, dst = dst, src counter += 1 print('custom:', dst[0,0])内核。有时它可以工作，但是对于大型矩阵，它将崩溃。

baseline: 996
loop 0, remaining: 1024, block_dim: (32, 32), thread_dim: (32, 32)
loop 1, remaining: 32, block_dim: (1, 32), thread_dim: (32, 32)
custom: 996

内核的基本轮廓来自CUDA Webinar slides。我知道对于非32次幂矩阵，此代码可能会产生错误的结果，但对于我的（32，1024）矩阵，我希望得到以下结果：

cols = 32

确实，当我设置print(dst[0,0])和baseline: 994 loop 0, remaining: 32, block_dim: (1, 32), thread_dim: (32, 32) custom: 994时，我得到了：

---------------------------------------------------------------------------
CUDARuntimeError                          Traceback (most recent call last)
<ipython-input-17-858a0ab67cd5> in <module>()
     58     src, dst = dst, src
     59     counter += 1
---> 60 print('custom:', src[0,0])

cupy/core/core.pyx in cupy.core.core.ndarray.__str__()

cupy/core/core.pyx in cupy.core.core.ndarray.get()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPointer.copy_to_host()

cupy/cuda/runtime.pyx in cupy.cuda.runtime.memcpy()

cupy/cuda/runtime.pyx in cupy.cuda.runtime.check_status()

CUDARuntimeError: cudaErrorIllegalAddress: an illegal memory access was encountered

但是有了（32、1024）矩阵，我得到了：

total = (block_dim[0]*block_dim[1])*(thread_dim[0]*thread_dim[1])

我的直觉说，内核中的某个位置超出了范围。但我不知道那可能在哪里。如何修复此代码以获得预期的结果？

Answer 1

在我撰写本文时，我意识到了这个错误。如果为total，则src.size应该小于或等于block_dim[1]。但是我在y轴上有32个块，在y轴上有 32个线程，这产生了超出范围的错误。如果thread_dim[1]或BufferedImage bufferedImage = new BufferedImage(408, 408, BufferedImage.TYPE_INT_RGB); Graphics2D g2d = bufferedImage.createGraphics(); List<Pixel> pixels = cacheRepo.findAll(); pixels.stream().forEach(pixel -> { g2d.setColor(getColorFromPixel(pixel)); g2d.fillRect(getPos(pixel.getPosition().x), getPos(pixel.getPosition().y), 20, 20); });之一设置为1，则可以使用。

为什么我的RawKernel减速器会导致cudaErrorIllegalAddress？

1 个答案: