Question

我获得了一项任务，用于并行化冒泡排序并使用CUDA实现它。我不知道冒泡排序是如何并行化的。我认为它本身就是顺序的。因为，它比较了两个连续的元素，并在条件分支之后交换它们。思绪，有人吗？

Answer 1

说实话，我也很难想出一种并行化冒泡排序的方法。我最初想到的是混合排序，你可以在其中平铺，对每个磁贴进行冒泡排序，然后合并（如果可以使其工作，可能仍会提高性能）。但是，我浏览了“Parallel Bubble Sort”，找到了this page。如果向下滚动，您将找到以下并行冒泡排序算法：

For k = 0 to n-2
If k is even then
    for i = 0 to (n/2)-1 do in parallel
        If A[2i] > A[2i+1] then
            Exchange A[2i] ↔ A[2i+1]
Else
    for i = 0 to (n/2)-2 do in parallel
        If A[2i+1] > A[2i+2] then
            Exchange A[2i+1] ↔ A[2i+2]
Next k

您可以在CPU中运行for循环，然后为每个do in parallel使用内核。这对于大型数组似乎很有效，但对于小型数组而言可能过多。如果您正在编写CUDA实现，则假定使用大型数组。由于这些内核中的交换具有相邻的元素对，因此您应该能够相应地进行切片。我搜索了泛型，非gpu特定的并行气泡排序，这是我能找到的唯一一个。

我确实找到了（非常轻微）helpful visualization here，可以在下面看到。我想在评论中更多地讨论这个问题。

编辑：我发现另一个名为Cocktail Shaker Sort的冒泡排序的并行版本。这是伪代码：

procedure cocktailShakerSort( A : list of sortable items ) defined as:
  do
    swapped := false
    for each i in 0 to length( A ) - 2 do:
      if A[ i ] > A[ i + 1 ] then // test whether the two elements are in the wrong order
        swap( A[ i ], A[ i + 1 ] ) // let the two elements change places
        swapped := true
      end if
    end for
    if not swapped then
      // we can exit the outer loop here if no swaps occurred.
      break do-while loop
    end if
    swapped := false
    for each i in length( A ) - 2 to 0 do:
      if A[ i ] > A[ i + 1 ] then
        swap( A[ i ], A[ i + 1 ] )
        swapped := true
      end if
    end for
  while swapped // if no elements have been swapped, then the list is sorted
end procedure

看起来这也有两个for循环比较相邻的元素 bubbly ..这些算法看起来有点类似，因为第一个算法（我现在学到的算法称为{{3}假设已排序并允许for循环指定false，而鸡尾酒调酒器排序有条件地检查在每个循环中排序。

odd-even sort这篇文章中包含的代码似乎只是运行while循环足以保证排序，维基百科伪代码检查。潜在的第一遍可能是实现这篇文章的算法，然后使用检查进行优化，尽管使用CUDA检查实际上可能会更慢。

无论排序如何，都会很慢。这是一个odd-even sort fyi，但没有多大帮助。他们认为它对小阵列无效，并且真正强调它的失败。

您是在寻找特定的CUDA代码还是这样？您似乎想要了解可能的选项并了解CUDA的实现。

Answer 2

TL; DR;

要完整实现通用并行气泡排序，请查看generic-bubble-sort.cu。这里的“泛型”表示只要您提供比较器，该算法就可以对任何类型的元素进行排序。

充其量

使用与N线程成线性比例的数字（例如N/2），您可以获得时间复杂度为O(N)的并行冒泡排序（其中{{1 }}是您要排序的数组的大小。

提示

这可能并不容易，但是当您仔细观察时，您会意识到，顺序冒泡排序的所有工作都是交换元素对，如果顺序不正确的话一次！。

由于对可以独立排序，因此并行冒泡排序可以利用排序对独立的属性。

一种方法

假设我们要对以下N进行排序：

array

我们首先将未排序的[7][1][3][2][0]初始，并将每两个元素array和array[i]视为独立的对。对于第一次迭代，array [i+1]将是一个偶数索引，因此我们的巴黎是i。

{ {array[0], array[1]} , {array[2],array[3]}, ...}

然后，如果每对中的两个元素的排列顺序不理想，我们将对其进行交换。

  [7][1][3][2][0]   <-- Unsorted array of 5 elements

[7][1]  [3][2]  [0] <-- A set of independent pairs.

第一次迭代后，我们的[7][1] [3][2] [0] --┑ Sorting first set of pairs | [1][7] [2][3] [0] <-┛ starting from an even idx看起来像这样：

array

我们现在要第二次重申，但是与以前不同，我们现在将从 ODD 索引[1][7][2][3][0] <-- Result after first iteration开始对对进行排序。值得一提的是，不会考虑没有同行的元素。

{ {array[1], array[2]} , {array[3],array[4]}, ...}

[1][7][2][3][0] <-- Result after first iteration [1] [7][2] [3][0] --┑ Sorting second set of pairs | [1] [2][7] [0][3] <-┛ starting from an odd idx [1][2][7][0][3] <-- Result after second iteration 偶 / ODD 对排序迭代之后，我们将进行排序 N。

array

使用CUDA进行并行气泡排序

针对上述方法的CUDA程序的直接实现将如下进行：

每个线程将负责对单个对进行排序
您将需要[1][2] [7][0] [3] --┑ [1][2] [0][7] [3] | | [1][2][0][7][3] | | The whole parallel sorting [1] [2][0] [7][3] | will converge after N iterations [1] [0][2] [3][7] | So we keep sorting pairs for 3 more | iterations. [1][0][2][3][7] | | [1][0] [2][3] [7] | [0][1] [2][3] [7] <-┛ [0][1][2][3][7] <-- Sorted array!个线程
因为翘曲散度是我们需要关心的线程同步问题
- 使用单个块：如果我们的线程适合单个块，则在每次迭代后我们仅使用N/2，我们将能够利用通过将所有数组都放在那里来共享内存。
- 使用多个块：我们必须确保内核中所有线程的线程同步。我们只能在每次启动内核时执行一次迭代，然后启动我们的内核N次。坏消息是，由于共享内存只有内核生存期，因此我们只能使用全局内存来处理数组。

某些代码

这是上述说明的简单实现，仅考虑了一个块。整个代码可用in this repo。

__synchronize()

如果您想了解template<typename T> __global__ void bubbleSort(T* v, const unsigned int n, ShouldSwap<T> shouldSwap) { const unsigned int tIdx = threadIdx.x; for (unsigned int i = 0; i < n; i++) { unsigned int offset = i % 2; unsigned int leftIndex = 2 * tIdx + offset; unsigned int rightIndex = leftIndex + 1; if (rightIndex < n) { if (shouldSwap(v[leftIndex ], v[rightIndex ])) { swap<T>(&v[leftIndex ], &v[rightIndex ]); } } __syncthreads(); } }和ShouldSwap的实现，请使用以下代码：

`swap`

用于交换元素的设备功能。

swap

`template<typename T> host device inline void swap (T* a, T* b) { T tmp = a; a = b; b = tmp; }`

用作通用比较器的C ++ Functor。

ShouldSwap

奖金

确保检查github.com/master-hpc以获得更多CUDA入门示例。

使用CUDA进行Parellellize冒泡排序

2 个答案:

TL; DR;

充其量

提示

一种方法

使用CUDA进行并行气泡排序

某些代码

`swap`

`template<typename T> host device inline void swap (T* a, T* b) { T tmp = a; a = b; b = tmp; }`

奖金

使用CUDA进行Parellellize冒泡排序

2 个答案:

TL; DR;

充其量

提示

一种方法

使用CUDA进行并行气泡排序

某些代码

swap

template<typename T> __host__ __device__ __inline__ void swap (T* a, T* b) { T tmp = *a; *a = *b; *b = tmp; }

奖金

`swap`

`template<typename T> host device inline void swap (T* a, T* b) { T tmp = a; a = b; b = tmp; }`