我正在尝试使用pycuda编写一个简单的程序来测试它,然后将它与我的opencl实现进行比较。然而,我在添加2个1D阵列时遇到了麻烦。问题是我似乎无法找到每个元素的正确ID。
我的代码非常简单:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
#Host variables
a = np.array([[1.0, 2,0 , 3.0]], dtype=np.float32)
b = np.array([[4.0, 5,0 , 6.0]], dtype=np.float32)
k = np.float32(2.0)
#Device Variables
a_d = cuda.mem_alloc(a.nbytes)
b_d = cuda.mem_alloc(b.nbytes)
cuda.memcpy_htod(a_d, a)
cuda.memcpy_htod(b_d, b)
s_d = cuda.mem_alloc(a.nbytes)
m_d = cuda.mem_alloc(a.nbytes)
#Device Source
mod = SourceModule("""
__global__ void S(float *s, float *a, float *b)
{
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = by * blockDim.y + ty;
int col = bx * blockDim.x + tx;
int dim = gridDim.x * blockDim.x;
const int id = row * dim + col;
s[id] = a[id] + b[id];
}
__global__ void M(float *m, float *a, float k)
{
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = by * blockDim.y + ty;
int col = bx * blockDim.x + tx;
int dim = gridDim.x * blockDim.x;
const int id = row * dim + col;
m[id] = k * a[id];
}
""")
#Vector addition
func = mod.get_function("S")
func(s_d, a_d, b_d, block=(1,3,1))
s = np.empty_like(a)
cuda.memcpy_dtoh(s, s_d)
#Vector multiplication by constant
func = mod.get_function("M")
func(m_d, a_d, k, block=(1,3,1))
m = np.empty_like(a)
cuda.memcpy_dtoh(m, m_d)
print "Vector Addition"
print "Expected: " + str(a+b)
print "Result: " + str(s) + "\n"
print "Vector Multiplication"
print "Expected: " + str(k*a)
print "Result: " + str(m)
我的输出是:
Vector Addition
Expected: [[ 5. 7. 0. 9.]]
Result: [[ 5. 7. 0. 6.]]
Vector Multiplication
Expected: [[ 2. 4. 0. 6.]]
Result: [[ 2. 4. 0. 6.]]
我真的不明白这个索引在CUDA中是如何工作的。我在网上找到了一些文档,让我对网格,块和线程如何工作有了一些了解,但仍然无法让它正常工作。我肯定错过了什么。 每一条信息都非常受欢迎。
答案 0 :(得分:1)
你的索引看起来很好,即使这个小例子有点过载(只考虑一个维度就足够了)。
问题是,您的数组a
和b
各有4个元素。但是你的内核函数只对前3个元素进行操作。因此,第4个元素的结果与预期不符。
您的意思是以下吗?
a = np.array([[1.0, 2.0, 3.0]], dtype=np.float32)
b = np.array([[4.0, 5.0, 6.0]], dtype=np.float32)