Question

我是Cuda编程的新手。我正在尝试RGB到灰度转换。但是，我无法弄清楚如何选择块大小和网格大小。我遇到了这段代码并且执行得当。但我无法理解如何选择gridSize。我正在使用Tegra TK1 Gpu，它有 -

1 MP，192 cuda cores / MP。
最大线程/块数= 1024。
驻留扭曲的最大数量/ mp = 64。
thread / block的最大尺寸大小=（1024,1024,64）。
网格尺寸的最大尺寸=（2147483647,65535,65535）。

我怀疑是 -

如何确定块大小和网格大小？
如果我将块大小从（16,16,1）更改为（32,32,1），则所花费的时间更长。那是为什么？

您是否也可以链接任何与此相关的优秀论文/书籍？提前谢谢。

这是代码 -

body {
  margin: 20px 0;  
  font-family: sans-serif;
}

.category-products {
  width: 1000px;
  margin: auto;
}

.cp-1{
  width: 32%;
  float: left;
  background: #999;
  text-align: center;
}

.cp-2{
  width: 32%;
  float: left;
  background: #666;
  text-align: center;
  margin-left: 2%;
}

.cp-3{
  width: 32%;
  float: left;
  background: #333;
  text-align: center;
  margin-left: 2%;
}

.image {
  height: 350px;
  width: inherit;
  display: table-cell;
  vertical-align: middle;
  text-align: center;
  border: 1px solid #b9b9b9;
  background: #fe0000;
}

.title {
  font-size: 20px;
  line-height: 20px;
  padding: 12px 0;
  font-weight: bold;
}

.price {
  font-size: 18px;
}


/**
 * For modern browsers
 * 1. The space content is one way to avoid an Opera bug when the
 *    contenteditable attribute is included anywhere else in the document.
 *    Otherwise it causes space to appear at the top and bottom of elements
 *    that are clearfixed.
 * 2. The use of `table` rather than `block` is only necessary if using
 *    `:before` to contain the top-margins of child elements.
 */
.cf:before,
.cf:after {
    content: " "; /* 1 */
    display: table; /* 2 */
}

.cf:after {
    clear: both;
}

/**
 * For IE 6/7 only
 * Include this rule to trigger hasLayout and contain floats.
 */
.cf {
    *zoom: 1;
}

编辑 - 我在使用上面提到的代码之前使用的代码，将2D数组映射到CUDA中的块网格是 -

_global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x; //Column
    int j = blockIdx.y * blockDim.y + threadIdx.y; //Row

    int idx = j * numCols + i;

    if(i>=numCols || j>=numRows) return;

    float channelSum = .299f * rgbaImage[idx].x + .587f * rgbaImage[idx].y + .114f *     rgbaImage[idx].z;
    greyImage[idx]= channelSum;
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols + (blockSize.x-1)) /blockSize.x , (numRows +(blockSize.y-1)) /blockSize.y, 1);
rgba_to_greyscale<<<gridSize,blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); 
checkCudaErrors(cudaGetLastError());
}

我理解这段代码中的错误。这里的错误是，如果numRows和numCols大于1024，它将显示一个错误，因为每个块的最大线程数是1024.所以，我可以使用最大1024 * 1024像素。如果图像有更多像素，我就无法使用它。现在我得到了第一个代码（最顶层的代码）的输出，但我无法理解它背后的逻辑。

Answer 1

在具有计算能力3.2的technical specification for CUDA devices中，例如Tegra TK1，我们可以看到一些与您描述的性能结果相关的限制因素。例如见：

每个多处理器的最大线程数：2048

每个块的最大线程数：1024

每个多处理器的最大驻留块数：16

每个多处理器的最大驻留warp数：64

如果我们（我）可以假设没有任何限制因素execpt最大线程数（内核不使用共享内存，我认为每个线程的寄存器数量将少于63）。

然后，使用16 x 16个线程的块，即256个线程或8 warp，每个SM最多有8个并发块（受限于每个SM的最大并发warp数）。如果将块的大小更改为32 x 32（1024个线程或32 warp），则最大并发块数将为2。这可能是主要原因，因为第二种配置的执行时间更长。

块大小的最佳配置通常有点棘手，它基于试验和错误。默认情况下，我们（我）始终开始最大化占用率，然后尝试其他配置。

将2D阵列映射到CUDA中RGB到GreyScale程序的块网格

1 个答案: