Which parallel sorting algorithm has the best average case performance

algorithmconcurrencysorting

Sorting takes O(n log n) in the serial case. If we have O(n) processors we would hope for a linear speedup. O(log n) parallel algorithms exist but they have a very high constant. They also aren't applicable on commodity hardware which doesn't have anywhere near O(n) processors. With p processors, reasonable algorithms should take O(n/p log n) time.

In the serial case, quick sort has the best runtime complexity on average. A parallel quick sort algorithm is easy to implement (see here and here). However it doesn't perform well since the very first step is to partition the whole collection on a single core. I have found information on many parallel sort algorithms but so far I have not seen anything pointing to a clear winner.

I'm looking to sort lists of 1 million to 100 million elements in a JVM language running on 8 to 32 cores.

Best Answer

The following article (PDF download) is a comparative study of parallel sorting algorithms on various architectures:

Parallel sorting algorithms on various architectures

According to the article, sample sort seems to be best on many parallel architecture types.

Update to address Mark's concern of age:

Here are more recent articles introducing something more novel (from 2007, which, btw, still get compared with sample sort):

Improvements on sample sort
AA-Sort

The bleeding edge (circa 2010, some only a couple months old):

Parallel sorting pattern
Many-core GPU based parallel sorting
Hybrid CPU/GPU parallel sort
Randomized Parallel Sorting Algorithm with an Experimental Study
Highly scalable parallel sorting
Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach

Update for 2013: Here is the bleeding edge circa January, 2013. (Note: A few of the links are to papers at Citeseer and require registration which is free):

University lectures:
Parallel Partitioning for Selection and Sorting
Parallel Sorting Algorithms Lecture
Parallel Sorting Algorithms Lecture 2
Parallel Sorting Algorithms Lecture 3

Other sources and papers:
A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
Highly Scalable Parallel Sorting 2
Parallel Merging
Parallel Merging 2
Parallel Self-Sorting System for Objects
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs
Various parallel algorithms (sorting et al) including implementations

GPU and CPU/GPU hybrid sources and papers:
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
Data Sorting Using Graphics Processing Units
Efficient Algorithms for Sorting on GPUs
Designing efficient sorting algorithms for manycore GPUs
Deterministic Sample Sort For GPUs
Fast in-place sorting with CUDA based on bitonic sort
Fast parallel GPU-sorting using a hybrid algorithm
Fast Parallel Sorting Algorithms on GPUs
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
GPU sample sort
GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures
GPUTeraSort: high performance graphics co-processor sorting for large database management
High performance comparison-based sorting algorithm on many-core GPUs
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Sorting on GPUs for large scale datasets: A thorough comparison

Update for 2021: I have not forgotten this answer and like all things computer related, it has not aged well. I will do my best to update and refresh it for current trends as well as the state of the art, at some point prior to the end of this year (2021).

Related Solutions

.net – the best algorithm for overriding GetHashCode

I usually go with something like the implementation given in Josh Bloch's fabulous Effective Java. It's fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        // Suitable nullity checks etc, of course :)
        hash = hash * 23 + field1.GetHashCode();
        hash = hash * 23 + field2.GetHashCode();
        hash = hash * 23 + field3.GetHashCode();
        return hash;
    }
}

As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good... and although most examples I've seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I've used numbers which apparently work well - but the initial value isn't a prime. (The multiplication constant is prime though. I don't know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is "good enough" and it's incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we're using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn't actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = (int) 2166136261;
        // Suitable nullity checks etc, of course :)
        hash = (hash * 16777619) ^ field1.GetHashCode();
        hash = (hash * 16777619) ^ field2.GetHashCode();
        hash = (hash * 16777619) ^ field3.GetHashCode();
        return hash;
    }
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

You can compute the hash code from fields that are not mutable; or

You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled - The Art of Hashing

Is there ever a good reason to use Insertion Sort

From http://www.sorting-algorithms.com/insertion-sort:

Although it is one of the elementary sorting algorithms with O(n²) worst-case time, insertion sort is the algorithm of choice either when the data is nearly sorted (because it is adaptive) or when the problem size is small (because it has low overhead).

For these reasons, and because it is also stable, insertion sort is often used as the recursive base case (when the problem size is small) for higher overhead divide-and-conquer sorting algorithms, such as merge sort or quick sort.

Best Answer

Related Solutions

.net – the best algorithm for overriding GetHashCode

Is there ever a good reason to use Insertion Sort

Related Topic