the median of a trillion numbers

the algorithm

partitioning and pivots

not suprisingly it turns out doing a complete sort is a bit wasteful if we're only after the median.
all we want is the element that had the same number of elements on either side.
this type of operation, known as a partition, is actually the basis of that most famous sort of all, quicksort.

the basic method of the quicksort algorithm is to select an element, called a pivot and move it into it's correct sorted position by shuffling elements around it.
all elements less than the pivot are moved to the front of the list and the rest, corresponding to elements equal or greater, are left in place.
after this partitioning the pivot is in it's correct sorted position (though the rest of the list is still a shambles).
the rest can be sorted by recursively applying the same partitioning approach to both the elements before the pivot and the elements after the pivot.

to find just the median though we can use the idea partitioning around the first pivot, without having to do the extra work of the complete sort.

order statistics

first an informal definition: regarding order statistics
the n^th order statistic of a list is the element which is in the n^th postition when the list is sorted
eg the 1^st order statistic of a 9 element list is the minimum
eg the m^th order statistic of a m element list is the maximum
eg the m^th order statistic of a 2m element list is the median

pivot selection

the perfect pivot

let's work through an example
say we have the list

[4 3 1 1 5 9 2 6 5 3 5]

(we'll use an underline to denote the value in the mth order stat position for a 2m element list)

we are after the median, ie the 6th order stat
if we partitioned around the first list element, ie 4, the list would be

[3 1 1 2 3 4 5 9 6 5 5]

(we'll use red values to denote those less than a pivot, green values for those equal or greater than a pivot)

since the 4 has been moved to position 6 we can say 4 is the 6th order statistic of this list.
which, as luck would have it, is exactly what we were looking for!
so this pivot, 4, is the median.

this was much less work than quicksort since we didn't have to sort the sublists [3 1 1 2 3] or [5 9 6 5 5]
but it only worked because 4 was a good choice.
but what if after the parititioning the pivot isn't the 6th order stat?

a pivot to the right of the median postition

consider a case then the pivot ends up "to the right" of the median postition

L1 = [6 3 1 1 5 9 2 4 5 3 7]

if we partitioned around the first element, 6, we have

L1 = [3 1 1 5 5 2 4 3 6 9 7]

this time we have calculated 6 to be the 9th order statistic.
since we were after the 6th order stat so know then that 6 is not the median.
furthermore since the 9th order stat is "to the right" of the 6th order stat we know then that the median must be less than 6.
therefore not only do we know 6 is not the median, it can't be 9 or 7 either.
this is very important since it means we can actually discard these values completely.

L2 = L1 - [6 9 7]

L2 = [3 1 1 5 5 2 4 3]

L1's 6th order stat, it's median, is L2's 6th order stat
(since we discarded values after the middle value we can continue by looking for the same order stat)

take note though that the 6th order stat of L2 is not L2's median.
L2's median would be it's 4th order stat.

a pivot to the left of the median postition

consider a case then the pivot ends up "to the left" of the median postition

L1 = [2 3 1 1 5 9 3 6 5 3 5]

again we are after the median which is the 6th order stat

partitioning around the first element, 2, we have

L1 = [1 1 2 3 5 9 3 6 5 3 5]

in this case our pivot is the 3rd order stat, but we were after the 6th order stat
since the 3rd order stat is "to the left" of the 6th order stat we can additionally say the median isn't 1
discarding this time values less than the pivot

L2 = L1 - [1 1]

L2 = [2 3 5 9 3 6 5 3 5]

L1's 6th order stat, it's median, is L2's 4th order stat
(since we discarded values before the middle value we need to adjust what order stat we are now looking for)

in summary

if we are looking for the kth order stat and we choose a pivot that is it we've found the median.
otherwise we throw some values away, adjust what order stat we're looking for and continue.

corner case

what a surprise, there is a corner case....

say we have list

L1 = [2 3 1 1 5 9 3 6 5 3 5]

and we choose the first value, 2, as a pivot we discard some values (the 1's) and get

L2 = [2 3 5 9 3 6 5 3 5]

now if we continue and choose the first value, 2, as a pivot we end up discarding nothing more and get

L3 = [2 3 5 9 3 6 5 3 5]

and we've gotten nowhere...
so if the pivot ends up in the first place we have to rotate the list (ie move first element to end) and try again.

corner case 2

what a surprise, there's another corner case....
this algorithm doesn't solve that case though when all elements of the list are the same

L1 = [3 3 3 3 3]

no amount of rotation will save us now.
so we need some special handling for this one;
it's simply; if list min = list max then the median is the first element

a few shortcuts

if we ever end up looking for the 1st order stat we can just scan for the minimum
if we ever end up looking for the kth order stat of a k element list we can just scan for the maximum

<< the question index distributing >>

nov 2008