GPU / CUDA:重新订购设备内存

时间:2015-12-23 13:23:42

标签: c++ arrays cuda gpu

我有一个存储在设备内存中的多维数组。我希望"permute" / "transpose",即根据维度的新顺序重新排列其元素。

例如,如果我有一个2D数组

A = [0, 1, 2
     3, 4, 5]

我想改变维度的顺序,所以我得到了

B = [0, 3
     1, 4
     2, 5]

此重新排序实际上按照[0,1,2,3,4,5]的顺序复制存储在内存中的元素,并返回新的排序[0,3,1,4,2,5]

我知道如何将索引从A映射到B,我的问题是如何使用cuda在设备上有效地执行此映射?

1 个答案:

答案 0 :(得分:3)

您可以查看此http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

天真矩阵转置:

__global__ void transposeNaive(float *odata, const float *idata)
{
  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
    odata[x*width + (y+j)] = idata[(y+j)*width + x];
}

通过共享内存合并转置: enter image description here

__global__ void transposeCoalesced(float *odata, const float *idata)
{
  __shared__ float tile[TILE_DIM][TILE_DIM];

  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

  __syncthreads();

  x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
  y = blockIdx.x * TILE_DIM + threadIdx.y;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}