Which parallel sorting algorithm has the best average case performance?

The following article (PDF download) is a comparative study of parallel sorting algorithms on various architectures:

Parallel sorting algorithms on various architectures

According to the article, sample sort seems to be best on many parallel architecture types.

Update to address Mark’s concern of age:

Here are more recent articles introducing something more novel (from 2007, which, btw, still get compared with sample sort):

Improvements on sample sort
AA-Sort

The bleeding edge (circa 2010, some only a couple months old):

Parallel sorting pattern
Many-core GPU based parallel sorting
Hybrid CPU/GPU parallel sort
Randomized Parallel Sorting Algorithm with an Experimental Study
Highly scalable parallel sorting
Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach

Update for 2013:
Here is the bleeding edge circa January, 2013. (Note: A few of the links are to papers at Citeseer and require registration which is free):

University lectures:
Parallel Partitioning for Selection and Sorting
Parallel Sorting Algorithms Lecture
Parallel Sorting Algorithms Lecture 2
Parallel Sorting Algorithms Lecture 3

Other sources and papers:

A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
Highly Scalable Parallel Sorting 2
Parallel Merging
Parallel Merging 2
Parallel Self-Sorting System for Objects
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs
Various parallel algorithms (sorting et al) including implementations

GPU and CPU/GPU hybrid sources and papers:

An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
Data Sorting Using Graphics Processing Units
Efficient Algorithms for Sorting on GPUs
Designing efficient sorting algorithms for manycore GPUs
Deterministic Sample Sort For GPUs
Fast in-place sorting with CUDA based on bitonic sort
Fast parallel GPU-sorting using a hybrid algorithm
Fast Parallel Sorting Algorithms on GPUs
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
GPU sample sort
GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures
GPUTeraSort: high performance graphics co-processor sorting for large database management
High performance comparison-based sorting algorithm on many-core GPUs
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Sorting on GPUs for large scale datasets: A thorough comparison

Update for 2022: I have not forgotten this answer and like all things computer related, it has not aged well. I will do my best to update and refresh it for current trends as well as the state of the art, at some point prior to the end of this year (2022). If you have interest in this topic and would like to see the update sooner, please either reply to or better yet, upvote the comment I made below this answer, so that I can gauge interest in this topic over others that also are in need of an update.

Leave a Comment