CUDA - 从3D数组中提取图层

时间:2018-04-13 13:28:12

标签: parallel-processing cuda gpu gpgpu pycuda


我有一个3D矩阵,其中x-y平面代表一个图像,z平面代表图像层 问题是当我尝试使用idz提取第一个(或其他层)时,我没有得到预期的结果。看起来像CUDA中的数组,x,y或z的索引与我期望的不同(如pycuda)。我在下面的结果数组中看到了这个。

以下是这个迷你示例的分步过程(我使用通用的int数字来表示我的图像以保存上传图像和整个代码)!
在这里我导入库并定义图像大小和图层......

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
from pycuda.gpuarray import to_gpu

row = 10
column = 10
depth = 5

然后我定义输入的3D数组和输出的2D数组......

#--==== Input 3D Array ====---
arrayA = numpy.full((row, column, depth), 0)

#populate each layer with fixed values
for i in range(depth):
    arrayA[:,:,i] = i + 1

arrayA = arrayA.astype(numpy.uint16)
arrayA_gpu = cuda.mem_alloc(arrayA.nbytes)
cuda.memcpy_htod(arrayA_gpu, arrayA)
arrayA_Answer = numpy.empty_like(arrayA)

#--==== Output 2D array container ====---
arrayB = numpy.zeros([row, column], dtype = numpy.uint16)
arrayB_gpu = cuda.mem_alloc(arrayB.nbytes)
cuda.memcpy_htod(arrayB_gpu, arrayB)
arrayB_Answer = numpy.empty_like(arrayB)

接下来,我在pycuda

中定义CUDA核心和函数
mod = SourceModule("""
    __global__ void getLayer(int *arrayA, int *arrayB)
    {
        int idx = threadIdx.x + (blockIdx.x * blockDim.x); // x coordinate (numpy axis 2) 
        int idy = threadIdx.y + (blockIdx.y * blockDim.y); // y coordinate (numpy axis 1)
        int idz = 0; //The first layer, this can set in range from 0-4 
        int x_width = (blockDim.x * gridDim.x); 
        int y_width = (blockDim.y * gridDim.y); 

        arrayB[idx + (x_width * idy)] = arrayA[idx + (x_width * idy) + (x_width * y_width) * idz];
    }
    """)

func = mod.get_function("getLayer")
func(arrayA_gpu, arrayB_gpu, block=(row, column, 1), grid=(1,1))

使用标准pycuda命令,我提取结果(不是我所期望的)
arrayA [:,:0] = 10x10矩阵填充1' s(好)

print(arrayA_Answer[:,:,0])
[[1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1]]

arrayB [:,]] = 10x10矩阵填充以下(坏),预计等于arrayA [:,:,0] ......

print(arrayB_Answer)
[[1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]
 [1 2 3 4 5 1 2 3 4 5]]

1 个答案:

答案 0 :(得分:1)

正如here所讨论的那样,numpy 3D存储顺序模式是“z”(即“第3”)索引是快速变化的索引,因为你在内存中线性前进。您的代码假定第一个索引(“x”)是快速变化的索引。

由于您的内核已经被组织为有效(“合并”)加载/存储行为,您可以通过重新排序numpy中的图像/图层/切片的存储来解决此问题。这是一个有效的例子:

$ cat t10.py
from __future__ import print_function
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
from pycuda.gpuarray import to_gpu

row = 5
column = 10
depth = 10

#--==== Input 3D Array ====---
arrayA = numpy.full((row, column, depth), 0)
my_slice=numpy.int32(3)  # choose the layer
#populate each layer with fixed values
for i in range(row):
    arrayA[i,:,:] = i + 1

arrayA = arrayA.astype(numpy.int32)
arrayA_gpu = cuda.mem_alloc(arrayA.nbytes)
cuda.memcpy_htod(arrayA_gpu, arrayA)
arrayA_Answer = numpy.empty_like(arrayA)

#--==== Output 2D array container ====---
arrayB = numpy.zeros([column, depth], dtype = numpy.int32)
arrayB_gpu = cuda.mem_alloc(arrayB.nbytes)
cuda.memcpy_htod(arrayB_gpu, arrayB)
arrayB_Answer = numpy.empty_like(arrayB)

mod = SourceModule("""
    __global__ void getLayer(int *arrayA, int *arrayB, int slice)
    {
        int idx = threadIdx.x + (blockIdx.x * blockDim.x); // x coordinate (numpy axis 2)
        int idy = threadIdx.y + (blockIdx.y * blockDim.y); // y coordinate (numpy axis 1)
        int idz = slice; //The "layer"
        int x_width = (blockDim.x * gridDim.x);
        int y_width = (blockDim.y * gridDim.y);

        arrayB[idx + (x_width * idy)] = arrayA[idx + (x_width * idy) + (x_width * y_width) * idz];
    }
    """)

func = mod.get_function("getLayer")
func(arrayA_gpu, arrayB_gpu, my_slice, block=(depth, column, 1), grid=(1,1))
cuda.memcpy_dtoh(arrayB_Answer,arrayB_gpu)

print(arrayA[my_slice,:,:])

print(arrayB_Answer[:,:])
$ python t10.py
[[4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]]
[[4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]
 [4 4 4 4 4 4 4 4 4 4]]
$

请注意,我还将uint16的使用更改为int32,以匹配内核类型int