Question

有许多4个元素，6个元素（输入）... 16个输入的排序网络，但我需要32个输入版本有一个32x32剪切排序算法（我计划作为Opencl辅助函数）有一个1024x1024剪切排序opencl算法。如何获得我的32输入排序网络？

也许是一些最小化交换次数的进化算法，我在opencl代码中使用它？
是否有固定的规则？

或刚通过反复试验找到？

Input array: 1M elements ----> 1024x1024  2D matrix with inverted odd-rows (shear)

             each row(1024) of matrix  --------> 32 x 32 2D matrix (shear)

                   32 element row ---------> Sorting  (network)

 Each thread computes one row of 1024 elements. So only 1024 threads for 1M element array.

我计划在网络中使用的非差异比较是：

     if(a>b)              // where a and b are between 0 and 16M
          swap(a,b)

     becomes

      a0=a; b0=b; // saving

      c = a-b 
      d = !(sign bit of c)  (0 for negative,  1 for positive)
      tmp=b*d;      //tmp=a if a>b  otherwise 0
      a=a*d         //a=b   if a>b  otherwise 0
      b=tmp*d;      //b=tmp   if a>b otherwise 0

      // a0 is backup of a, b0 is backup of b
      e = (sign bit of c)  (1 for negative,  0 for positive)
      tmp0=a0*e;      //tmp0=a0 if a0<=b0  otherwise 0
      a0=b0*e         //a0=b0   if a0<=b0  otherwise 0
      b0=tmp0*e;      //b0=tmp0   if a0<=b0 otherwise 0

      aOut=a+a0;      // only a or a0 can be different than zero
      bOut=b+b0;      // only b or b0 can be different than zero

我确定这不是最快的但是我需要快速轻松排序以尝试我的粒子约束求解器，它尖叫排序固定空间索引（网格），我有1M粒子并尝试剪切网络排序剪切

为了验证剪切排序，我在每个线程的基础上实现了32输入排序串行bitonic分类器，以构建每个列和行排序的32x32矩阵。所以32x32 = 1024元素排序需要9毫秒，这对于@ 700 MHz的32核来说太慢了。

1024元素排序需要9毫秒，每1024次排序后需要至少20次迭代才能对1M阵列进行排序。即使它达到90毫秒，这对于只是键来说太慢了。将有许多值绑定到键。（超过100）

尝试使用bubblesort代替bitonic并获得10ms所以问题必须在剪切排序实现中？

Answer 1

目前，已知使用最少比较交换单元对32个元素进行排序的排序网络如下工作：使用most efficient sorting network of size 16对前16个元素进行排序，对以下16个元素执行相同操作，然后使用从Batcher's odd-even mergesort合并步骤。基本上，它给出了以下成对的比较交换单元：

对数组的前半部分进行排序：

[0,1],[2,3],[4,5],[6,7],[8,9],[10,11],[12,13],[14,15],
[0,2],[4,6],[8,10],[12,14],[1,3],[5,7],[9,11],[13,15],
[0,4],[8,12],[1,5],[9,13],[2,6],[10,14],[3,7],[11,15],
[0,8],[1,9],[2,10],[3,11],[4,12],[5,13],[6,14],[7,15],
[5,10],[6,9],[3,12],[13,14],[7,11],[1,2],[4,8],
[1,4],[7,13],[2,8],[11,14],[5,6],[9,10],
[2,4],[11,13],[3,8],[7,12],
[6,8],[10,12],[3,5],[7,9],
[3,4],[5,6],[7,8],[9,10],[11,12],
[6,7],[8,9]

对数组的后半部分进行排序：

[16,17],[18,19],[20,21],[22,23],[24,25],[26,27],[28,29],[30,31],
[16,18],[20,22],[24,26],[28,30],[17,19],[21,23],[25,27],[29,31],
[16,20],[24,28],[17,21],[25,29],[18,22],[26,30],[19,23],[27,31],
[16,24],[17,25],[18,26],[19,27],[20,28],[21,29],[22,30],[23,31],
[21,26],[22,25],[19,28],[29,30],[23,27],[17,18],[20,24],
[17,20],[23,29],[18,24],[27,30],[21,22],[25,26],
[18,20],[27,29],[19,24],[23,28],
[22,24],[26,28],[19,21],[23,25],
[19,20],[21,22],[23,24],[25,26],[27,28],
[22,23],[24,25]

Odd-even合并数组的两半：

[0, 16],
[8, 24],
[8, 16],
[4, 20],
[12, 28],
[12, 20],
[4, 8],
[12, 16],
[20, 24],
[2, 18],
[10, 26],
[10, 18],
[6, 22],
[14, 30],
[14, 22],
[6, 10],
[14, 18],
[22, 26],
[2, 4],
[6, 8],
[10, 12],
[14, 16],
[18, 20],
[22, 24],
[26, 28],
[1, 17],
[9, 25],
[9, 17],
[5, 21],
[13, 29],
[13, 21],
[5, 9],
[13, 17],
[21, 25],
[3, 19],
[11, 27],
[11, 19],
[7, 23],
[15, 31],
[15, 23],
[7, 11],
[15, 19],
[23, 27],
[3, 5],
[7, 9],
[11, 13],
[15, 17],
[19, 21],
[23, 25],
[27, 29],
[1, 2],
[3, 4],
[5, 6],
[7, 8],
[9, 10],
[11, 12],
[13, 14],
[15, 16],
[17, 18],
[19, 20],
[21, 22],
[23, 24],
[25, 26],
[27, 28],
[29, 30]

我使用维基百科页面上给出的oddeven_merge算法生成了以前的索引对。我不能保证它会比你已经拥有的更快，但它至少会将比较交换单元的数量从191（与Batcher的奇偶合并）相比降低到185.我已经阅读了关于此事的研究论文而且似乎我们目前不知道排序网络的比较器数量少于185个，以排序32个元素。

Answer 2

我找到了一些名为Hasse Diagrams的有价值的信息。阅读它，但解决将需要一些时间（也许我永远不会）所以我搜索并找到下面已经完成的解决方案：

N = 32的网络，使用Batcher的合并交换。（192个比较器） http://jgamble.ripco.net/cgi-bin/nw.cgi?inputs=32&algorithm=batcher&output=svg

使用它对32个数组进行排序，每个32个元素（不用于循环，只需手写比较器），并在32个核心上应用剪切排序，与bitonic排序（for loop）+剪切排序相比，内核时间从9ms减少到2ms （用于循环和并行）。

从使用循环的bitonic分拣机到非循环Batcher的合并交换机，这是4倍的加速。

由此：

   void MergeSort32(int * RegisterArray, int dir)"
            {"
                 int n=32;"
                 for (int s=2; s <= n; s*=2) 
                 {
                        for (int i=0; i < n;) 
                        {
                             merge_up((RegisterArray+i),s,dir);
                             merge_down((RegisterArray+i+s),s,dir);
                             i += s*2;
                         }
                  }
            }

到

  swapx(arr,0,16,dir);
  swapx(arr,1,17,dir);
  swapx(arr,2,18,dir);
  swapx(arr,3,19,dir);
  swapx(arr,4,20,dir);
  swapx(arr,5,21,dir);
  swapx(arr,6,22,dir);
  swapx(arr,7,23,dir);
  ...
  ... 191 lines of branchless comparators-swappers 
  (branching version crashes for some reason, maybe because of 
   inlined 382 if sentences per core
   xor swap idiom shouldnt be a problem.)

但opencl编译时间从0.5秒增加到7.5秒。向选项添加-cl-opt-disable会永远编译，直到我按Ctrl + Alt + del。因此，自动优化选项也必须优化编译部分。

编辑：剪切排序（1M）由Batcher的剪切排序（每个1k）构建（每个32个元素）：0.341秒。 HD7870，每个工作组使用64个线程。排序正在验证，但比单核心cpu shell排序（0.050秒）慢得多。

分拣网络如何手工制作

2 个答案: