我希望从CUDA中的输入数据中选择一些项目(不是全部)。
我的输入数组d_in
大小为53 * 53,(对不起,它很长):
$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
z$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy
yz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwx
xyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw
wxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuv
vwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstu
uvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrst
tuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrs
stuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr
rstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnopq
qrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmnop
pqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmno
opqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklmn
nopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijklm
mnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijkl
lmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghijk
klmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghij
jklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefghi
ijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefgh
hijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdefg
ghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcdef
fghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcde
efghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabcd
defghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzabc
cdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyzab
bcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyza
abcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxyz
zabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwxy
yzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvwx
xyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuvw
wxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstuv
vwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrstu
uvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrst
tuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqrs
stuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopqr
rstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnopq
qrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmnop
pqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmno
opqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklmn
nopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijklm
mnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijkl
lmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghijk
klmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghij
jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefghi
ijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefgh
hijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdefg
ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcdef
fghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcde
efghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abcd
defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$abc
cdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$ab
bcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$a
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz$
我想从输入到输出d_out
中选择每行的最后一项。这样,输出大小应为53.这是我的代码。用于处理preSort
到d_in
和temp
到d_out
的数据,为两个指针分配内存并启动内核。
//variables declared
const int ARRAY_BYTES_IN = CAPACITY * sizeof(char);
const int ARRAY_BYTES_ST = CAPACITY * CAPACITY * sizeof(char);
const int CAPACITY = 53;
char preSort[CAPACITY * CAPACITY];
char temp[CAPACITY];
void getLast(){
//two pointers
char* d_in;
char* d_out;
//allocate gpu memory
cudaMalloc(&d_in, ARRAY_BYTES_ST);
cudaMalloc(&d_out, ARRAY_BYTES_IN);
//transfer input into gpu
cudaMemcpy(d_in, preSort, ARRAY_BYTES_ST, cudaMemcpyHostToDevice);
int size = CAPACITY*CAPACITY;
int blockSize = 1024;
int numbBlock = (size + blockSize - 1) / blockSize;
//Launch the kernel
DoGetLast<<<numbBlock, blockSize>>>(d_out, d_in);
//Copy back to the host
cudaMemcpy(temp, d_out, ARRAY_BYTES_IN, cudaMemcpyDeviceToHost);
cudaFree(d_in);
cudaFree(d_out);
}
GPU内核
__global__ void DoGetLast(char* d_out, char* d_in){
int CAP = 53*53;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
char f;
//get the output trmo the input, It's a 1-D array actually, so pick
//only one character through every 53 characters from d_in
if(idx % CAP == (CAP - 1)){
f = d_in[idx];
d_out[idx] = f;
}
}
在main中,我只调用getLast()
方法,并使用循环显示输出。我希望输出看起来像:
zyxwvutsrqponmlkjihgfedcbazyxwvutsrqponmlkjihgfedcba$
但是,我的输出只有一个字母输出,只有z
。
任何人都可以在我的代码中说出问题?并给予帮助?
答案 0 :(得分:2)
你的内核有一些错误。您应该像这样修改内核。
__global__ void DoGetLast(char* d_out, char* d_in)
{
int CAP = 53; // not 53*53
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx < CAP * CAP) // d_in boundary check
if (idx % CAP == (CAP - 1)) {
char f = d_in[idx];
d_out[idx / CAP] = f; // not d_out[idx]
}
}
如果您使用proper CUDA error checking method和cuda-memcheck
工具检查了代码,您就会知道内核存在内存访问错误,您可能已经发现自己的代码出了什么问题。< / p>
(如果我使用cuda-memcheck
运行您的代码,则会在Invalid __global__ write of size 1
行显示d_out[idx] = f
。)
另外,请注意您的代码的GPU利用率较低,因为在您启动的numBlock * blockSize
个线程中,只有53个线程可以正常工作。其余的(超过52 * 53个线程)没有做任何事情。
答案 1 :(得分:0)
在CUDA中,最好创建相当于输出大小或更小的线程。在您的情况下,您应该只创建53个线程;
int size = CAPACITY;
int blockSize = 64;
int numbBlock = (size + blockSize - 1) / blockSize;
现在,在内核端,每个线程都会选择行的最后一项,如:
__global__ void DoGetLast(char* d_out, char* d_in){
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int CAP = 53;
if(idx >= CAP)
return;
d_out[idx] = d_in[idx*CAP];
}