排序在串行情况下需要O(n log n)。如果我们有O(n)处理器,我们希望线性加速。存在O(log n)并行算法,但它们具有非常高的常数。它们也不适用于没有O(n)处理器附近的商品硬件。对于p个处理器,合理的算法应该花费O(n / p log n)时间。
答案 0 :(得分:188)
Parallel sorting algorithms on various architectures
Improvements on sample sort
Parallel sorting pattern
Many-core GPU based parallel sorting
Hybrid CPU/GPU parallel sort
Randomized Parallel Sorting Algorithm with an Experimental Study
Highly scalable parallel sorting
Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach
Parallel Partitioning for Selection and Sorting
Parallel Sorting Algorithms Lecture
Parallel Sorting Algorithms Lecture 2
Parallel Sorting Algorithms Lecture 3
A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
Highly Scalable Parallel Sorting 2
Parallel Merging
Parallel Merging 2
Parallel Self-Sorting System for Objects
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs
Various parallel algorithms (sorting et al) including implementations
GPU和CPU / GPU混合源和论文:
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
Data Sorting Using Graphics Processing Units
Efficient Algorithms for Sorting on GPUs
Designing efficient sorting algorithms for manycore GPUs
Deterministic Sample Sort For GPUs
Fast in-place sorting with CUDA based on bitonic sort
Fast parallel GPU-sorting using a hybrid algorithm
Fast Parallel Sorting Algorithms on GPUs
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
GPU sample sort
GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures
GPUTeraSort: high performance graphics co-processor sorting for large database management
High performance comparison-based sorting algorithm on many-core GPUs
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Sorting on GPUs for large scale datasets: A thorough comparison
答案 1 :(得分:6)
我使用了Parallel Quicksort算法和PSRS算法,它基本上将quicksort与并行相结合。
使用Parallel Quicksort算法,我已经证明了最多4个核心(具有超线程的双核心)的近线性加速,考虑到算法的局限性,这是预期的。纯并行Quicksort依赖于共享堆栈资源,这将导致线程之间的争用,从而减少任何性能增益。该算法的优点在于它“就地”排序,这减少了所需的内存量。如上所述,在排序超过100M的元素时,您可能需要考虑这一点。
我看到你正在寻找一个8-32核心的系统。 PSRS算法避免了共享资源上的争用,从而允许在更多数量的进程中加速。我已经演示了如上所述的最多4个核心的算法,但其他人的实验结果报告接近线性加速,核心数量大得多,32个以上。 PSRS算法的缺点是它不是就地并且需要相当多的存储器。
如果您有兴趣,可以对每种算法使用或仔细阅读我的Java代码。你可以在github上找到它:https://github.com/broadbear/sort。该代码旨在作为Java Collections.sort()的替代品。如果您正在寻找能够在JVM中执行并行排序的功能,那么我的仓库中的代码可能会帮助您解决问题。对于实现Comparable或实现自己的Comparator的元素,API是完全通用的。
答案 2 :(得分:4)
看看这篇论文:A Scalable Parallel Sorting Algorithm Using Exact Splitting。它涉及超过32个核心。但是,它详细描述了一种算法,其运行时复杂度为O(n / p * log(n)+ p * log(n)** 2),适用于任意比较器。
答案 3 :(得分:2)