Carl Nolan’s ramblings on development
As I mentioned in my previous post, Array.Parallel sort functions demonstrating a Merge Sort using Barrier, I wanted to continue the parallel sort examples with a Quicksort using the Task Parallel Libraries.
F#, as do all functional languages, lend themselves easily to Quicksort implementations. There are many variations of an F# Quicksort, here are a few:
With continuations, to ensure calls are tail recursive:
The fundamental issue with these implementations is that they are not Quick. The reason for their slowness, for large list/array sorts, is the amount of allocations that need to occur when performing the partitioning; the original list/array is not partitioned/sorted in-place. As an example, to sort a list of 5 Million floats takes just under a minute.
One other issue is that these implementations do not lend themselves to being easily parallelized; there is an implicit dependency on the order of the yield operations. Hence to resolve the sort in-place and parallelize problem we need to tackle several issues:
The approach for the Quicksort will be similar to that used for the Merge sort; depending on whether a comparer or projection is specified. If one does not specify a projection the original array is sorted in-place using either a structural comparison or the specified comparer.
If a projection is specified, a separate array is defined for the projected keys. Sorting and comparing is then performed on this keys array. When elements in the array are swapped then both the keys and original array elements are swapped. There is obviously overhead in performing two swaps each time but this is less than calculating the projection for each comparison.
As with the previous sample a full set of sort operations will be demonstrated using the Quicksort; namely:
So once again here is the full code listing:
So a few notes about the code.
You will notice that all the secondary functions are marked as inline. This means that all these functions are integrated into the calling code; needed to ensure the code is performant.
The comparerResult function is the one that compares array elements. As mentioned, depending on whether a projection has been specified, this compares elements from the original array or the keys array. Similarly the swap function swaps elements from the keys array only if a projection has been specified.
The partitioning for this sort takes the approach of using the median of the first, middle, and last elements in the array. The median value is then moved into the first position, low index, and the remainder of the elements are then partitioned based on this first value.
The actual partitioning is performed using a simple for loop. The rational for this rather than something like Seq.fold (where the accumulator is the last index moved) is once again purely performance.
When performing the recursive sort, as most Quicksort implementations do, when the array size drops below a certain threshold a normal array sort is performed; the sort type usually being an Insertion sort.
The last,and probably most important, thing to consider was how best to parallelize the recursive call. Options for this are discussed by Stephen Toub in his whitepaper “Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4”; well worth a read if you haven't already done so.
This code uses the approach that parallelism is invoked only if the array size is greater than a threshold size (as parallel invocations incur cost) and the current degree of parallelism is below a certain threshold (if all cores are busy default to serial invocation). To support this, the following type definitions are used:
Having the CurrentDop defined as static means that if multiple sorts are in flight within the same process the system will not get overloaded with sorting threads.
The alternative approach, again as mentioned by Stephen, is to parallelise to a certain recursive depth after which, when the number of threads is up to the number of cores (log2 of the cores), one defaults to the normal serial behaviour:
As Stephen noted in his paper the issue with this approach are unbalanced partitions affecting the parallelism.
So how does this all perform? Surprisingly, the original Merge sort is a little faster. Using my quad-core laptop with an array of 5 million floats the numbers are (the projection is defined as (sqrt << abs)):
One final metric worth mentioning is that if one creates a comparer from the projection and then performs a sortInPlaceWith then the Quicksort takes about 3 seconds. This is compared with the sortInPlaceBy of about 1 second.
The Quicksort however is faster for smaller arrays (up to 2 Million floats); here is a summary for the sortInPlace operation:
Thus looking at these numbers one may decide to perform a merge sort when the array size exceeds 2 million. In addition at around 50,000 elements you will find that the base sort routines are more performant than a Quicksort. Thus one may define the Parallel.Array extension as follows:
As always hope you find this code useful.