Question

我正在处理大型，非均匀的矩阵，并且遇到了我认为在元素上不匹配的问题。

在example.py中，get_simulated_ipp（）构建echo和tx，两个分别为250000和25000的线性数组。代码也硬编码sr = 25.

我的代码试图将tx复用乘以不同段的echo，具体取决于sr的指定范围和值。然后将其存储在数组S中。

在搜索了其他人的例子之后，我找到了一种构建块和网格here的方法，我认为这种方法很有效。我不熟悉C代码，但过去一周一直在努力学习。这是我的代码：

#!/usr/bin/python

#This iteration only works on the first and last elements, mismatching  after that.
# However, this doesn't result in any empty elements in S

import numpy as np
import example as ex
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

#pull simulated data and get info about it
((echo,tx)) = ex.get_simulated_ipp()
ranges = np.arange(4000,6000).astype(np.int32)
S = np.zeros([len(ranges),len(tx)],dtype=np.complex64)
sr = ex.sr

#copying input to gpu
# will try this explicitly if in/out (in the function call) don't work

block_dim_x = 8                                   #thread number is product of block dims,
block_dim_y = 8                                   # want a multiple of 32 (warp multiple)
blocks_x = np.ceil(len(ranges)/block_dim_x).astype(np.int32).item()
blocks_y = np.ceil(len(tx)/block_dim_y).astype(np.int32).item()


kernel_code="""
#include  <cuComplex.h>
__global__ void complex_mult(cuFloatComplex *tx, cuFloatComplex *echo, cuFloatComplex *result, 
                         int *ranges, int sr)
{
  unsigned int block_num        = blockIdx.x + blockIdx.y * gridDim.x;
  unsigned int thread_num       = threadIdx.x + threadIdx.y * blockDim.x;
  unsigned int threads_in_block = blockDim.x * blockDim.y;
  unsigned long int idx         = threads_in_block * block_num + thread_num;

//aligning the i,j to idx, something is mismatched?

  int i = ((idx % (threads_in_block * gridDim.x)) % blockDim.x) +
      ((block_num % gridDim.x) * blockDim.x);
  int j = ((idx - (threads_in_block * block_num)) / blockDim.x) +
      ((block_num / gridDim.x) * blockDim.y);


  result[idx] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);

}
"""
 ## want something to work like this:
 ## result[i][j] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);

#includes directory of where cuComplex.h is located
mod = SourceModule(kernel_code, include_dirs=['/usr/local/cuda-7.0/include/'])
complex_mult = mod.get_function("complex_mult")
complex_mult(cuda.In(tx), cuda.In(echo), cuda.Out(S), cuda.In(ranges), np.int32(sr),
         block=(block_dim_x,block_dim_y,1),
         grid=(blocks_x,blocks_y))


compare = np.zeros_like(S) #built to compare CPU vs GPU calcs
txidx = np.arange(len(tx))
for ri,r in enumerate(ranges):
        compare[ri,:] = echo[txidx+r*sr]*tx

print np.subtract(S, compare)

在这里的底部，我已经实现了我尝试完成并进行减法的CPU实现。结果是第一个和最后一个元素出现为0 + 0j，但其余元素则没有。内核试图将i和j与idx对齐，这样我就可以更容易地遍历echo，range和tx。

有没有更好的方法来实现这样的东西？另外，为什么结果不会像我想要的那样全部出现0 + 0j？

编辑：尝试一个小例子来更好地掌握如何使用这种块/网格配置对数组进行索引，我偶然发现了一些非常奇怪的事情。之前，我尝试索引元素，我只是想运行一点测试乘法。看起来我的块/网格覆盖了所有的ary_in矩阵，但结果只会使ary_in的上半部分加倍，而下半部分则返回先前从下半部分计算剩余的内容。

如果我将blocks_x更改为4以便我覆盖的空间超过了所需的空间，那么加倍工作正常。如果我然后使用4x4网格运行它，而使用* 3，那么ary_out可以很好地处理ary_in三倍。当我用2x4网格再次运行并且只加倍时，ary_out的上半部分返回doubled ary_in，但是下半部分返回内存中的前一个结果，而是三倍的值。我会理解这是我的索引/块/网格映射错误的值，但我无法弄清楚是什么。

ary_in = np.arange(128).reshape((8,16))
print ary_in
ary_out = np.zeros_like(ary_in)

block_dim_x = 4
block_dim_y = 4
blocks_x    = 2
blocks_y    = 4
limit = block_dim_x * block_dim_y * blocks_x * blocks_y

mod = SourceModule("""
__global__ void indexing_order(int *ary_in, int *ary_out, int n)
{

  unsigned int block_num        = blockIdx.x + blockIdx.y * gridDim.x;
  unsigned int thread_num       = threadIdx.x + threadIdx.y * blockDim.x;
  unsigned int threads_in_block = blockDim.x * blockDim.y;
  unsigned int idx              = threads_in_block * block_num + thread_num;

  if (idx < n) {
  // ary_out[idx] = thread_num;
  ary_out[idx] = ary_in[idx] * 2;
  }
}
""")

indexing_order = mod.get_function("indexing_order")

indexing_order(drv.In(ary_in), drv.Out(ary_out), np.int32(limit),
               block=(block_dim_x,block_dim_y,1),
               grid=(blocks_x,blocks_y))
print ary_out

最终编辑：我弄清楚了问题。在上面的编辑中，ary_in默认为int64，与int32的C代码中的int初始化不匹配。这仅为整个阵列分配了GPU所需数据量的一半，因此只有上半部分被移动并进行操作。添加.astype（np.int32）解决了这个问题。

这使我能够在我的案例中找出索引的顺序并修复主代码：

int i = idx / row_len;
int j = idx % row_len;

我仍然不明白如何使用非均匀的块尺寸划分到输出数组中（例如16x16），即使使用if（idx）也是如此

Answer 1

我发现了问题。在上面的编辑中，ary_in默认为int64，与int32的C代码中的int初始化不匹配。这仅为整个阵列分配了GPU所需数据量的一半，因此只有上半部分被移动并进行操作。添加.astype（np.int32）解决了这个问题。

这使我能够在我的案例中找出索引的顺序并修复主代码：

int i = idx / row_len;
int j = idx % row_len;

PyCUDA大型非均匀矩阵运算

1 个答案: