the median of a trillion numbers

<< multiple process erlang implementation index running on amazon ec2 >>

performance comparisons

lets do some testing on a quad core box with 2gb ram.
we'll generate 10 million numbers into a ramdisk (/dev/shm)

bash> ./generate_test_data.rb 0 31415 1e5 10e6 | ./spread_across_files.rb /dev/shm/numbers 4

first up is the simple ruby impl

bash> time cat /dev/shm/numbers.[0-3] | ./median.rb
2m 38s

this version goes memory nuts and quickly has consumed pretty much all resident memory it can get (just under 2gb)
i wonder if this is usual ruby behaviour? it's not being a good citizen in terms of garbage collection!!!
the process sits at cpu user / system / idle around 25 / 0 / 75 (ie a single core)
once the swapping starts it's reduced to something a bit more erratic like 20 / 5 / 75

next is the multi process erlang impl that doesnt use frequency dictionaries

bash> time erl -sname bob -noshell -run controller init worker /dev/shm/numbers.[0-3]
47s

it sits at about user / system / idle cpu usage of about 70 / 15 / 15
and is using much less ram; about 50mb resident and another 150mb or so for data.

let's have a crack at the frequency dictionary version
it's an unfair comparison to the ruby one though since it uses a different data structure

bash> time erl -noshell -run generate_binary_dicts main /dev/shm/numbers.[0-3]
1m 16s
bash> time erl -sname bob -noshell -run controller init worker_freq /dev/shm/numbers.[0-3].dict
2s

what's going on?
though the processing using the dictionary is wildly fast,
the parsing to generate the dictionary takes longer than the version that parses and processes the list

in fact just testing the parsing time we see quite a variation in something which only differs in the building of the dictionary.
dictionaries in erlang must be slower than i thought!

mat@ubishop:~/dev/median$ erl -noshell -run parse_file test to_list /dev/shm/numbers.1e6
to_list time 8.098419 sec
mat@ubishop:~/dev/median$ erl -noshell -run parse_file test to_dict /dev/shm/numbers.1e6
to_dict time 13.150428 sec

poking around we see it's the io for reading the file from disk in parse.erl

we can try changing from stream based file:io (bolded) ...

to_dict(Filename) ->
    {ok,File} = file:open(Filename,read),
    to_dict(File, dict:new(), next_line(File)).

to_dict(File,Freqs,eof) ->
    file:close(File),
    Freqs;

to_dict(File,Freqs,Line) ->
    Value = line_to_int(Line),
    to_dict(File, dict:update_counter(Value,1,Freqs), next_line(File)).

next_line(File) ->
    io:get_line(File,'').
    
line_to_int(Line) ->
    Line_without_cr = lists:sublist(Line, length(Line)-1),
    list_to_integer(Line_without_cr).

... to a binary parsing (bolded)

to_dict_b(InFilename) ->
    {ok,B} = file:read_file(InFilename),
    to_dict_b(InFilename,B,dict:new()).

to_dict_b(_InFilename, <<>>, Dict) ->
    Dict;
    
to_dict_b(InFilename, Binary, Dict) ->
    { ReducedBinary, Line } = next_line_b(Binary),
    Int = list_to_integer(Line),
    NewDict = dict:update_counter(Int,1,Dict),
    to_dict_b(InFilename, ReducedBinary, NewDict).

next_line_b(Binary) ->
    next_line_b(Binary, []).

next_line_b(<<>>, _Collected) ->
    ignore_last_line_if_didnt_end_in_newline;

next_line_b(<<"\n",Rest/binary>>, Collected) ->  
    { Rest, binary_to_list(list_to_binary(lists:reverse(Collected))) }; % black magic voodoo line

next_line_b(<<C:1/binary,Rest/binary>>, Collected) ->
    next_line_b(Rest, [C|Collected]).

this gives a speed up
from 1m 16s with cpu user / system / idle of 75 / 10 / 15
to 40s with a much better cpu utilisation of user / system / idle 94 / 5 / 1
furthermore the number of context switches is about 20 times less (!!)

but what is the trade off?
apart from the binary voodoo black magic lines required; (in particular binary-to-list-to-binary-list-reverse-to-wtf?!?! )
it is more expensive in memory.
whereas the slower version occupies about 40mb resident memory the binary version is more like 100mb

this binary approach (even though there is a memory cost) doesnt feel right though....
there must be a better way to get a similiar effeciency using standard libraries

<< multiple process erlang implementation index running on amazon ec2 >>

nov 2008