Question

我试图修改the Intel's Bitonic Sorting算法，对cl_int的数组进行排序，以对cl_int2的数组进行排序（基于键 - 即{{1} }}）。

英特尔的例子包括一个简单的主机代码和一个OpenCL内核，在一个排序操作（多通道）中多次调用。内核一次加载4个数组项cl_int2.x并对它们进行操作。

我没有修改主机代码算法，只修改了设备代码。 内核函数的更改列表：

将第一个内核的参数类型从cl_int4修改为int4*（以加载四个键值对）
仅使用int8*元素的.even个元素来比较值（theArray）
创建＆＃34; <＆＃34; （pseudomask）并在此基础上，将int4创建为mask（以捕获值）

虽然我修改后的内核的输出是由第一个组件（pseudomask.xxyyzzww）完美排序的cl_int2数组，但值（cl_int2.x）不正确 - 值为1对于接下来的4个或8个项目重复项目，然后使用新值并重复...

我确定这是一个微不足道的错误，但我无法找到它。

Diff of the original Intel code and my modified version

编辑：当每个密钥（`cl_int2.y`）都是唯一的时，`cl_int2`数组会完美排序。

示例输入：http://pastebin.com/92qB1csT

示例输出：http://pastebin.com/dsU97Npn

（正确排序的数组：http://pastebin.com/Nb56BuQK）

修改后的内核代码（注释）：

cl_int2.x

主机端代码：

// Copyright (c) 2009-2011 Intel Corporation
// https://software.intel.com/en-us/articles/bitonic-sorting

// Modified to sort int2 key-value array

__kernel void BitonicSort(__global int8* theArray,
                         const uint stage,
                         const uint passOfStage,
                         const uint dir)
{
    size_t i = get_global_id(0);
    int8 srcLeft, srcRight, mask;
    int4 pseudomask;
    int4 imask10 = (int4)(0,  0, -1, -1);
    int4 imask11 = (int4)(0, -1,  0, -1);

    if(stage > 0)
    {
        if(passOfStage > 0)    // upper level pass, exchange between two fours,
        {
            size_t r = 1 << (passOfStage - 1);
            size_t lmask = r - 1;
            size_t left = ((i>>(passOfStage-1)) << passOfStage) + (i & lmask);
            size_t right = left + r;

            srcLeft = theArray[left];
            srcRight = theArray[right];
            pseudomask = srcLeft.even < srcRight.even;
            mask = pseudomask.xxyyzzww;

            int8 imin = (srcLeft & mask) | (srcRight & ~mask);
            int8 imax = (srcLeft & ~mask) | (srcRight & mask);

            if( ((i>>(stage-1)) & 1) ^ dir )
            {
                theArray[left]  = imin;
                theArray[right] = imax;
            }
            else
            {
                theArray[right] = imin;
                theArray[left]  = imax;
            }
        }
        else    // last pass, sort inside one four
        {
            srcLeft = theArray[i];
            srcRight = srcLeft.s45670123;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask10;
            mask = pseudomask.xxyyzzww;

            if(((i >> stage) & 1) ^ dir)
            {
                srcLeft = (srcLeft & mask) | (srcRight & ~mask);

                srcRight = srcLeft.s23016745;
                pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
                mask = pseudomask.xxyyzzww;

                theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
            }
            else
            {
                srcLeft = (srcLeft & ~mask) | (srcRight & mask);

                srcRight = srcLeft.s23016745;
                pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
                mask = pseudomask.xxyyzzww;

                theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
            }
        }
    }
    else    // first stage, sort inside one four
    {
        /*
         *  To convert this code to int2 sorter, do this:
         *      1. instead of loading int4, load int8 (key,value, key,value, ...)
         *      2. when there is a vector swizzling, replace component index with two consecutive indices:
         *           srcLeft.yxwz  ->  srcLeft.s23016745
         *         use this rewrite rule:
         *           x  y  z  w
         *           01 23 45 67
         *      3. replace comparison operands with only their keys swizzled:
         *           mask = srcLeft < srcRight;    ->    pseudomask = srcLeft.even < srcRight.even; mask = pseudomask.xxyyzzww;
         */

        //  make bitonic sequence out of 4.
        int4 imask0 = (int4)(0, -1, -1,  0); // -1 in comparison = true (all bits set - two's complement)
        srcLeft = theArray[i];
        srcRight = srcLeft.s23016745;

        /*
         * This XOR mask flips bits, so that in `mask` are the following
         * results (remember that srcRight is srcLeft with swapped component pairs):
         *
         *      [ left.x<left.y, left.x<left.y,    left.w<left.z, left.w<left.z  ]
         *  or: [ left.x<left.y, left.x<left.y,    left.z>left.w, left.z>left.w  ]
         */
        pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
        mask = pseudomask.xxyyzzww;

        if( dir )
            srcLeft = (srcLeft & mask) | (srcRight & ~mask);  // make sure the numbers are sorted like this:
        else
            srcLeft = (srcLeft & ~mask) | (srcRight & mask);

        /*
         *  Now the pairs of numbers in `srcLeft` are sorted according to the specified `dir`ection.
         *  If dir == true, then
         *    The components `x` and `y` are swapped so that `x` < `y`. Moreover `z` and `w` are swapped so that `z` > `w`. This resembles up-hill: /\
         *  else
         *    The components `x` and `y` are swapped so that `x` > `y`. Moreover `z` and `w` are swapped so that `z` < `w`. This resembles down-hill: \/
         *
         *  This swapping is achieved by creating `srcLeft`, which is in normal order, and `srcRight`, which has component pairs switched (xyzw -> yxwz).
         *  Then the `mask` is created. The mask bits are redundant because it applies to vector component pairs (so in order to implement key-value sorting,
         *  I have to increase the length of masks!).
         *
         *  The non-ordered component pairs in `srcLeft` are masked out by `mask` while the inverted `mask` is applied to the (pair-wise switched) `srcRight`.
         *
         *  This (the previous) first flipping just makes a 4-bitonic sequence.
         */


        /*
         *  This second step just sorts the bitonic sequence
         */
        srcRight = srcLeft.s45670123; // inverts the bitonic sequence

        // [ left.a<left.c, left.b<left.d,    left.a<left.c, left.b<left.d ]
        pseudomask = (srcLeft.even < srcRight.even) ^ imask10;  // imask10 = (noflip, noflip,  flip, flip)
        mask = pseudomask.xxyyzzww;

        // even or odd (The output of this thread is sorted monotonic sequence. The monotonicity changes and thus preparing bitonic sequence for the next pass.).
        if((i & 1) ^ dir)
        {
            // this sorts the bitonic sequence, hence splitting it
            srcLeft = (srcLeft & mask) | (srcRight & ~mask);

            srcRight = srcLeft.s23016745;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
            mask = pseudomask.xxyyzzww;

            theArray[i] = (srcLeft & mask) | (srcRight & ~mask);
        }
        else
        {
            srcLeft = (srcLeft & ~mask) | (srcRight & mask);

            srcRight = srcLeft.s23016745;
            pseudomask = (srcLeft.even < srcRight.even) ^ imask11;
            mask = pseudomask.xxyyzzww;

            theArray[i] = (srcLeft & ~mask) | (srcRight & mask);
        }
    }
}

Answer 1

我终于解决了这个问题！

棘手的部分是原始英特尔代码在加载的4元组中处理相邻对的相等值的方式 - 它没有明确处理它！

错误出现在每个其他stage的最后一个passOfStage 和的最后passOfStage = 0（即stage）中。这些代码部分在一个4元组内部交换单个2元组（由cl_int8数组theArray表示）。

让我们考虑这个摘录（例如，对于4元组中相等的相邻2元组，它没有正常运行）：

imask0     = (int4)(0, -1, -1,  0);
srcLeft    = theArray[i];  // int8
srcRight   = srcLeft.s23016745;
pseudomask = (srcLeft.even < srcRight.even) ^ imask0;
mask       = pseudomask.xxyyzzww;
result     = (srcLeft & mask) | (srcRight & ~mask);

想象一下当我们使用这个（不固定的）代码和srcLeft.even = (int4)(7,7, 5,5)时会发生什么。操作srcLeft.even < srcRight.even会产生(int4)(0,0,0,0)，然后我们会将此结果屏蔽为imask0，我们会得到...... pseudomask = (int4)(0,-1,-1,0) - 即imask本身。然而，这是错误的。

形成此模式需要pseudomask的值：(int4)(a,a, b,b)（其中a和b可以是0或{ {1}}）。这意味着，进行以下比较以形成正确的-1：mask就足够了。然后将正确的掩码创建为quasimask = srcLeft.s07 < srcRight.s07。前2个mask = quasimask.xxxxyyyy掩盖了4元组的第一个2元组中的第一个键值对（4元组= x中的一个元素）。由于我们想要将相应的2元组（由theArray指定为imask0 - 0对）进行位掩码，我们添加另一个-1。我们类似地为4元组中的第二个2元组进行位掩码，这使我们留下了xx。

使用yyyy

进行位移的可视示例

imask11

固定的，功能齐全的版本（我已经评论了固定部分）：

srcLeft:                        x  y  z  w
                                <  <  <  <
srcRight [relative to srcLeft]: y  x  w  z
^ imask0:                       0 -1  0  1
------------------------------------------
(srcLeft<srcRight)^imask0:      x  x  z  z

键/值数组的Bitonic排序

编辑：当每个密钥（`cl_int2.y`）都是唯一的时，`cl_int2`数组会完美排序。

1 个答案:

键/值数组的Bitonic排序

编辑：当每个密钥（cl_int2.y）都是唯一的时，cl_int2数组会完美排序。

1 个答案:

编辑：当每个密钥（`cl_int2.y`）都是唯一的时，`cl_int2`数组会完美排序。