我正在处理大型,非均匀的矩阵,并且遇到了我认为在元素上不匹配的问题。
在example.py中,get_simulated_ipp()构建echo和tx,两个分别为250000和25000的线性数组。代码也硬编码sr = 25.
我的代码试图将tx复用乘以不同段的echo,具体取决于sr的指定范围和值。然后将其存储在数组S中。
在搜索了其他人的例子之后,我找到了一种构建块和网格here的方法,我认为这种方法很有效。我不熟悉C代码,但过去一周一直在努力学习。这是我的代码:
#!/usr/bin/python
#This iteration only works on the first and last elements, mismatching after that.
# However, this doesn't result in any empty elements in S
import numpy as np
import example as ex
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
#pull simulated data and get info about it
((echo,tx)) = ex.get_simulated_ipp()
ranges = np.arange(4000,6000).astype(np.int32)
S = np.zeros([len(ranges),len(tx)],dtype=np.complex64)
sr = ex.sr
#copying input to gpu
# will try this explicitly if in/out (in the function call) don't work
block_dim_x = 8 #thread number is product of block dims,
block_dim_y = 8 # want a multiple of 32 (warp multiple)
blocks_x = np.ceil(len(ranges)/block_dim_x).astype(np.int32).item()
blocks_y = np.ceil(len(tx)/block_dim_y).astype(np.int32).item()
kernel_code="""
#include <cuComplex.h>
__global__ void complex_mult(cuFloatComplex *tx, cuFloatComplex *echo, cuFloatComplex *result,
int *ranges, int sr)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned long int idx = threads_in_block * block_num + thread_num;
//aligning the i,j to idx, something is mismatched?
int i = ((idx % (threads_in_block * gridDim.x)) % blockDim.x) +
((block_num % gridDim.x) * blockDim.x);
int j = ((idx - (threads_in_block * block_num)) / blockDim.x) +
((block_num / gridDim.x) * blockDim.y);
result[idx] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
}
"""
## want something to work like this:
## result[i][j] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
#includes directory of where cuComplex.h is located
mod = SourceModule(kernel_code, include_dirs=['/usr/local/cuda-7.0/include/'])
complex_mult = mod.get_function("complex_mult")
complex_mult(cuda.In(tx), cuda.In(echo), cuda.Out(S), cuda.In(ranges), np.int32(sr),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
compare = np.zeros_like(S) #built to compare CPU vs GPU calcs
txidx = np.arange(len(tx))
for ri,r in enumerate(ranges):
compare[ri,:] = echo[txidx+r*sr]*tx
print np.subtract(S, compare)
在这里的底部,我已经实现了我尝试完成并进行减法的CPU实现。结果是第一个和最后一个元素出现为0 + 0j,但其余元素则没有。内核试图将i和j与idx对齐,这样我就可以更容易地遍历echo,range和tx。
有没有更好的方法来实现这样的东西?另外,为什么结果不会像我想要的那样全部出现0 + 0j?
编辑: 尝试一个小例子来更好地掌握如何使用这种块/网格配置对数组进行索引,我偶然发现了一些非常奇怪的事情。之前,我尝试索引元素,我只是想运行一点测试乘法。看起来我的块/网格覆盖了所有的ary_in矩阵,但结果只会使ary_in的上半部分加倍,而下半部分则返回先前从下半部分计算剩余的内容。
如果我将blocks_x更改为4以便我覆盖的空间超过了所需的空间,那么加倍工作正常。如果我然后使用4x4网格运行它,而使用* 3,那么ary_out可以很好地处理ary_in三倍。当我用2x4网格再次运行并且只加倍时,ary_out的上半部分返回doubled ary_in,但是下半部分返回内存中的前一个结果,而是三倍的值。我会理解这是我的索引/块/网格映射错误的值,但我无法弄清楚是什么。
ary_in = np.arange(128).reshape((8,16))
print ary_in
ary_out = np.zeros_like(ary_in)
block_dim_x = 4
block_dim_y = 4
blocks_x = 2
blocks_y = 4
limit = block_dim_x * block_dim_y * blocks_x * blocks_y
mod = SourceModule("""
__global__ void indexing_order(int *ary_in, int *ary_out, int n)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned int idx = threads_in_block * block_num + thread_num;
if (idx < n) {
// ary_out[idx] = thread_num;
ary_out[idx] = ary_in[idx] * 2;
}
}
""")
indexing_order = mod.get_function("indexing_order")
indexing_order(drv.In(ary_in), drv.Out(ary_out), np.int32(limit),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
print ary_out
最终编辑: 我弄清楚了问题。在上面的编辑中,ary_in默认为int64,与int32的C代码中的int初始化不匹配。这仅为整个阵列分配了GPU所需数据量的一半,因此只有上半部分被移动并进行操作。添加.astype(np.int32)解决了这个问题。
这使我能够在我的案例中找出索引的顺序并修复主代码:
int i = idx / row_len;
int j = idx % row_len;
我仍然不明白如何使用非均匀的块尺寸划分到输出数组中(例如16x16),即使使用if(idx)也是如此
答案 0 :(得分:1)
我发现了问题。在上面的编辑中,ary_in默认为int64,与int32的C代码中的int初始化不匹配。这仅为整个阵列分配了GPU所需数据量的一半,因此只有上半部分被移动并进行操作。添加.astype(np.int32)解决了这个问题。
这使我能够在我的案例中找出索引的顺序并修复主代码:
int i = idx / row_len;
int j = idx % row_len;