the median of a trillion numbers

<< single process erlang implementation index performance comparisons >>

multiple process erlang implementation

list of lists

in the multi process implementation splits the list of numbers to consider into a sub lists
each sub list can be handled by a seperate erlang process, potentially across different machines
(these lists don't have to be the same size)

single [1,2,3,4,5,6,7,8,9]
multi  [[1,2,3,4],[5,6],[7,8,9]]

recall there a number of operations required in distributed case
includes things like determining total number of elements, determining minimum value, etc
each of these operations can be done against each sub list individually with the results aggregated

eg to determine total number of elements

single -> length( [1,2,3,4,5,6,7,8,9] ) = 9
multi  -> sum ( length([1,2,3,4]), length([5,6]), length([7,8,9]) ) = sum([4,2,3]) = 9

eg to determine minimum value

single -> min( [1,2,3,4,5,6,7,8,9] ) = 1
multi  -> min( min([1,2,3,4]), min([5,6]), min([7,8,9]) ) = min([1,5,7]) = 1

other changes required

rotation

in the single list case we pick the pivot as the first value.
in the multi list case we pick the pivot as the first value of the first list.
recall in the algorithm that rotation is sometimes required.
this is to ensure all potential values for a pivot are explored.
so in the multi list case the rotation needs to operate at two levels; rotate the first list and then rotate the list of lists

before [[1,2,3],[4,5,6],[7,8,9]]
after  [[4,5,6],[7,8,9],[2,3,1]]

empty sublist cleanup

the algorithm needs to reject values less than or greater than a particular value.
in the single list case it's never the case that the list becomes empty though this is possible in the multi list case.

consider the single list case for

[3,1,2,2,2,4,5]
pivot 3, num less than 3 = 4, so 3 is 5rd order stat, discard all over 4
result [3,1,2,2,2]

now consider the multi list case for

[[3,1,2],[2,2],[4,5]]
pivot 3, num less than 3 = 4, so 3 is 5rd order stat, discard all over 4
result [[3,1,2],[2,2],[]]

in this case we can exclude this empty list from further processing

the multi process erlang version splits the work across two modules
a controller orchestrates the process
processing of lists is done by a number of spawned workers
if a worker ends up with no elements is it terminated

try it out with

bash> ./generate_test_data.rb 0 314 1000 10e3 | ./spread_across_files.rb numbers 4
bash> erl -sname bob -noshell -run controller init worker numbers.[0-3]
314

in fact if you have two (or more) machines, you can try it distributed style!

replicate the directory structure you are using (ie code and data) on each machine
make a file called .hosts.erlang according to the format described at the bottom of the net adm man page
copy it to each machine where the code and data is

on all boxes but one run

bash> erl -sname worker -noshell -setcookie 123

on the last box run

bash> erl -sname master -noshell -setcookie 123 -run controller init worker numbers.[0-3]

the code is written such that it spawns a process per file and round robin allocates each process to run on a machine in the cluster.
the code to do this is only fractionally harder than not doing it; this is the joy of erlang!

optimising even MORE

there is another optimisation we can make if we make an assumption about the data.

say we have a list of N ints where for each value e; 1 <= e <= M.

the list can be represented in two ways...
the N elements directly

[e₁, e₂, ... ,e_n]

or a lookup take of the M distinct values and their frequencies

[{1,f₁}, {2,f₂}, ... , {M,f_m}]

the first case uses N ints, the second case uses 2M ints.
both are usable data structures for our algorithm.

in the case that N >>> M we can make huge savings, in both time and space, by using the second.

eg consider a trillion ints with distinct values 1 to a billion
storing the list uses 1e12 ints
storing the lookup when all values are represented uses 2e9 ints
500 times less

this requires some minor mods to the worker module. we'll call it worker_freq

it also requires a different input format, instead of parsing the numbers directly (eg numbers.0)
we can convert to a erlang binary format of the dictionary (eg numbers.0.dict) that can be read by the workers
we use generate_binary_dicts to do this

try it out with

bash> ./generate_test_data.rb 0 314 1000 10e3 | ./spread_across_files.rb numbers 4
bash> erl -noshell -run generate_binary_dicts main numbers.[0-3]
bash> erl -sname bob -noshell -run controller init worker_freq numbers.[0-3].dict
314

<< single process erlang implementation index performance comparisons >>

nov 2008