(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
EDIT 1
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
return;
}
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
rowCount++;
}
kBitLocation++;
}
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
}
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
rowCount++;
}
}
}
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
return;
}
}
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}
答案 0 :(得分:2)
(这个答案假设输出顺序无关紧要,有效值的位置也是如此。)
从概念上讲,您的分析会生成一组有效值。您描述的实现使用此集合的密集表示:每个潜在值一位。然而,您已经表明数据非常稀少(5e-2或1000/10 ^ 9 = 1e-6);而且,跨PCI Express复制数据非常麻烦。
那么,为什么不考虑稀疏表示呢?最简单的一个只是有效值的无序序列。当然,编写需要跨线程进行一些同步 - 甚至可能跨越块。粗略地说,你可以让warp在共享内存中收集它们的有效值;然后在块级同步以收集块的有效值(对于它已分析的输入的给定块);最后使用atomics从所有块中收集数据。哦,也 - 让每个主题分析多个值,这样你就不必进行那么多的同步。
答案 1 :(得分:1)
因此,您希望在从计算返回之前让每个线程分析多个数字(数千或数百万)。因此,如果您分析线程中的一百万个数字,那么您只需要该空间量的%5就可以保存该计算的结果。