我试图修改the Intel's Bitonic Sorting算法,对cl_int
的数组进行排序,以对cl_int2
的数组进行排序(基于键 - 即{{1} }})。
英特尔的例子包括一个简单的主机代码和一个OpenCL内核,在一个排序操作(多通道)中多次调用。
内核一次加载4个数组项cl_int2.x
并对它们进行操作。
我没有修改主机代码算法,只修改了设备代码。 内核函数的更改列表:
cl_int4
修改为int4*
(以加载四个键值对)int8*
元素的.even
个元素来比较值(theArray
)<
&#34; (pseudomask
)并在此基础上,将int4
创建为mask
(以捕获值) 虽然我修改后的内核的输出是由第一个组件(pseudomask.xxyyzzww
)完美排序的cl_int2
数组,但值(cl_int2.x
)不正确 - 值为1对于接下来的4个或8个项目重复项目,然后使用新值并重复...
我确定这是一个微不足道的错误,但我无法找到它。
Diff of the original Intel code and my modified version
cl_int2.y
)都是唯一的时,cl_int2
数组会完美排序。示例输入:http://pastebin.com/92qB1csT
示例输出:http://pastebin.com/dsU97Npn
(正确排序的数组:http://pastebin.com/Nb56BuQK)
修改后的内核代码(注释):
cl_int2.x
主机端代码:
// Copyright (c) 2009-2011 Intel Corporation
// https://software.intel.com/en-us/articles/bitonic-sorting
// Modified to sort int2 key-value array
__kernel void BitonicSort(__global int8* theArray,
const uint stage,
const uint passOfStage,
const uint dir)
{
size_t i = get_global_id(0);
int8 srcLeft, srcRight, mask;
int4 pseudomask;
int4 imask10 = (int4)(0, 0, -1, -1);
int4 imask11 = (int4)(0, -1, 0, -1);
if(stage > 0)
{
if(passOfStage > 0) // upper level pass, exchange between two fours,
{
size_t r = 1 << (passOfStage - 1);
size_t lmask = r - 1;
size_t left = ((i>>(passOfStage-1)) << passOfStage) + (i & lmask);
size_t right = left + r;
srcLeft = theArray[left];
srcRight = theArray[right];
pseudomask = srcLeft.even < srcRight.even;
mask = pseudomask.xxyyzzww;
int8 imin = (srcLeft & mask) | (srcRight & ~mask);
int8 imax = (srcLeft & ~mask) | (srcRight & mask);
if( ((i>>(stage-1)) & 1) ^ dir )
{
theArray[left] = imin;
theArray[right] = imax;
}
else
{
theArray[right] = imin;
theArray[left] = imax;
}
}
else // last pass, sort inside one four
{
srcLeft = theArray[i];
srcRight = srcLeft.s45670123;
pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
mask = pseudomask.xxyyzzww;
if(((i >> stage) & 1) ^ dir)
{
srcLeft = (srcLeft & mask) | (srcRight & ~mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
}
else
{
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
}
}
}
else // first stage, sort inside one four
{
/*
* To convert this code to int2 sorter, do this:
* 1. instead of loading int4, load int8 (key,value, key,value, ...)
* 2. when there is a vector swizzling, replace component index with two consecutive indices:
* srcLeft.yxwz -> srcLeft.s23016745
* use this rewrite rule:
* x y z w
* 01 23 45 67
* 3. replace comparison operands with only their keys swizzled:
* mask = srcLeft < srcRight; -> pseudomask = srcLeft.even < srcRight.even; mask = pseudomask.xxyyzzww;
*/
// make bitonic sequence out of 4.
int4 imask0 = (int4)(0, -1, -1, 0); // -1 in comparison = true (all bits set - two's complement)
srcLeft = theArray[i];
srcRight = srcLeft.s23016745;
/*
* This XOR mask flips bits, so that in `mask` are the following
* results (remember that srcRight is srcLeft with swapped component pairs):
*
* [ left.x<left.y, left.x<left.y, left.w<left.z, left.w<left.z ]
* or: [ left.x<left.y, left.x<left.y, left.z>left.w, left.z>left.w ]
*/
pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
mask = pseudomask.xxyyzzww;
if( dir )
srcLeft = (srcLeft & mask) | (srcRight & ~mask); // make sure the numbers are sorted like this:
else
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
/*
* Now the pairs of numbers in `srcLeft` are sorted according to the specified `dir`ection.
* If dir == true, then
* The components `x` and `y` are swapped so that `x` < `y`. Moreover `z` and `w` are swapped so that `z` > `w`. This resembles up-hill: /\
* else
* The components `x` and `y` are swapped so that `x` > `y`. Moreover `z` and `w` are swapped so that `z` < `w`. This resembles down-hill: \/
*
* This swapping is achieved by creating `srcLeft`, which is in normal order, and `srcRight`, which has component pairs switched (xyzw -> yxwz).
* Then the `mask` is created. The mask bits are redundant because it applies to vector component pairs (so in order to implement key-value sorting,
* I have to increase the length of masks!).
*
* The non-ordered component pairs in `srcLeft` are masked out by `mask` while the inverted `mask` is applied to the (pair-wise switched) `srcRight`.
*
* This (the previous) first flipping just makes a 4-bitonic sequence.
*/
/*
* This second step just sorts the bitonic sequence
*/
srcRight = srcLeft.s45670123; // inverts the bitonic sequence
// [ left.a<left.c, left.b<left.d, left.a<left.c, left.b<left.d ]
pseudomask = (srcLeft.even < srcRight.even) ^ imask10; // imask10 = (noflip, noflip, flip, flip)
mask = pseudomask.xxyyzzww;
// even or odd (The output of this thread is sorted monotonic sequence. The monotonicity changes and thus preparing bitonic sequence for the next pass.).
if((i & 1) ^ dir)
{
// this sorts the bitonic sequence, hence splitting it
srcLeft = (srcLeft & mask) | (srcRight & ~mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
}
else
{
srcLeft = (srcLeft & ~mask) | (srcRight & mask);
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
mask = pseudomask.xxyyzzww;
theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
}
}
}
答案 0 :(得分:3)
我终于解决了这个问题!
棘手的部分是原始英特尔代码在加载的4元组中处理相邻对的相等值的方式 - 它没有明确处理它!
错误出现在每个其他stage
的最后一个passOfStage
和的最后passOfStage = 0
(即stage
)中。这些代码部分在一个4元组内部交换单个2元组(由cl_int8
数组theArray
表示)。
让我们考虑这个摘录(例如,对于4元组中相等的相邻2元组,它没有正常运行):
imask0 = (int4)(0, -1, -1, 0);
srcLeft = theArray[i]; // int8
srcRight = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
mask = pseudomask.xxyyzzww;
result = (srcLeft & mask) | (srcRight & ~mask);
想象一下当我们使用这个(不固定的)代码和srcLeft.even = (int4)(7,7, 5,5)
时会发生什么。操作srcLeft.even < srcRight.even
会产生(int4)(0,0,0,0)
,然后我们会将此结果屏蔽为imask0
,我们会得到...... pseudomask = (int4)(0,-1,-1,0)
- 即imask本身。然而,这是错误的。
形成此模式需要pseudomask
的值:(int4)(a,a, b,b)
(其中a
和b
可以是0
或{ {1}})。这意味着,进行以下比较以形成正确的-1
:mask
就足够了。然后将正确的掩码创建为quasimask = srcLeft.s07 < srcRight.s07
。前2个mask = quasimask.xxxxyyyy
掩盖了4元组的第一个2元组中的第一个键值对(4元组= x
中的一个元素)。由于我们想要将相应的2元组(由theArray
指定为imask0
- 0
对)进行位掩码,我们添加另一个-1
。我们类似地为4元组中的第二个2元组进行位掩码,这使我们留下了xx
。
使用yyyy
imask11
固定的,功能齐全的版本(我已经评论了固定部分):
srcLeft: x y z w
< < < <
srcRight [relative to srcLeft]: y x w z
^ imask0: 0 -1 0 1
------------------------------------------
(srcLeft<srcRight)^imask0: x x z z