<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>ggplot posixct cheat sheet</title>
      <link>http://matpalm.com/blog/2012/03/18/ggplot_posix_cheat_sheet</link>
      <category><![CDATA[plyr]]></category>
      <category><![CDATA[posixct]]></category>
      <category><![CDATA[ggplot2]]></category>
      <guid>http://matpalm.com/blog/2012/03/18/ggplot_posix_cheat_sheet</guid>
      <description>ggplot posixct cheat sheet</description>
      <content:encoded><![CDATA[<h2>ggplot posixct cheat sheet</h2>
<p>after having to google this stuff three times in the last few months i'm writing it down here so i can just cut and paste next time...</p>
<h3>data with arbitrary date time stamp</h3>
<pre>
> d = read.delim('data.tsv',header=F,as.is=T,col.names=c('dts_str','freq'))
> # YEAR MONTH DAY HOUR
> head(d,3)
        dts_str  freq
1 2012_01_01_00 18393
2 2012_01_01_01 20536
3 2012_01_01_02 91840
> tail(d,3)
          dts_str   freq
732 2012_01_31_21 103107
733 2012_01_31_22 108921
734 2012_01_31_23  78629
> summary(d$freq)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10590   63620   82680   86770  105700  169900 
</pre>

<h3>parse arbitrary strange format to a datetime</h3>
<pre>
> d$dts = as.POSIXct(d$dts_str, format="%Y_%m_%d_%H")

> head(d,3)
        dts_str  freq                 dts
1 2012_01_01_00 18393 2012-01-01 00:00:00
2 2012_01_01_01 20536 2012-01-01 01:00:00
3 2012_01_01_02 91840 2012-01-01 02:00:00

> ggplot(d, aes(dts, freq)) + geom_point()

</pre>

<img src="/blog/imgs/2012/ggplot_posixct/p1.png"/>

<h3>plots by day of week; summary</h3>
<pre>
> d$dow = as.factor(format(d$dts, format="%a"))  # day of week
> head(d,3)
        dts_str  freq                 dts dow
1 2012_01_01_00 18393 2012-01-01 00:00:00 Sun
2 2012_01_01_01 20536 2012-01-01 01:00:00 Sun
3 2012_01_01_02 91840 2012-01-01 02:00:00 Sun
> ggplot(d,aes(dow,freq)) 
 + geom_boxplot()
 + geom_smooth(aes(group=1)) 
 + scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) # provide explicit factor ordering
 + xlab('day of week') + ylab('freq') + opts(title='freq by day of week')
</pre>

<img src="/blog/imgs/2012/ggplot_posixct/p2.png"/>

<h3>plots by day of week; totals</h3>
<pre>
> by_dow = ddply(d, "dow", summarize, freq=sum(freq))
> ggplot(by_dow,aes(dow,freq)) + geom_bar() + 
 scale_x_discrete(limits=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) + 
 xlab('day of week') + ylab('freq') + opts(title='total freq by day of week')  # p3.png
</pre>

<img src="/blog/imgs/2012/ggplot_posixct/p3.png"/>

<h3>plots by hour of day; summary</h3>
<pre>
> d$hr = format(d$dts, format="%H")
> head(d,3)
        dts_str  freq                 dts dow hr
1 2012_01_01_00 18393 2012-01-01 00:00:00 Sun 00
2 2012_01_01_01 20536 2012-01-01 01:00:00 Sun 01
3 2012_01_01_02 91840 2012-01-01 02:00:00 Sun 02
> ggplot(d,aes(hr,freq)) + geom_boxplot() + geom_smooth(aes(group=1)) + 
 xlab('hr of day') + ylab('freq') + opts(title='freq by hr of day')
</pre>

<img src="/blog/imgs/2012/ggplot_posixct/p4.png"/>

<h3>plots by hour of day; totals</h3>
<pre>
> by_hr = ddply(d, "hr", summarize, freq=sum(freq))
> ggplot(by_hr,aes(hr,freq)) + geom_bar() + 
 xlab('hr of day') + ylab('freq') + opts(title='total freq by hr of day')
</pre>

<img src="/blog/imgs/2012/ggplot_posixct/p5.png"/>]]></content:encoded>
    </item>
    <item>
      <title>tokenising the visible english text of common crawl</title>
      <link>http://matpalm.com/blog/2011/12/10/common_crawl_visible_text</link>
      <category><![CDATA[common-crawl]]></category>
      <category><![CDATA[nlp]]></category>
      <guid>http://matpalm.com/blog/2011/12/10/common_crawl_visible_text</guid>
      <description>tokenising the visible english text of common crawl</description>
      <content:encoded><![CDATA[<h2>The common crawl dataset</h2>
<p><a href="http://www.commoncrawl.org/">Common crawl</a> is a publically available 30TB web crawl taken between September 2009 and September 2010. 
As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is 
<a href="https://github.com/matpalm/common-crawl/">on github</a>.</p>
<h2>1. Getting the data</h2>
<p>The first thing was to get the data into a hadoop cluster. 
It's made up of 300,000 100mb gzipped <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">arc files</a> stored in S3.
I wrote a dead simple 
<a href="https://github.com/matpalm/common-crawl/blob/master/java/src/cc/SimpleDistCp.java">distributed copy</a> to do this.</p>
<p>Only a few things of note about this job...</p>
<ol>
<li>The data in S3 is marked as <a href="http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html">requester pays</a>
which, even though it's a no-op if you're accessing the data from EC2, needs the "x-amz-request-payer" header to be set.</li>
<li>Pulling from S3 to EC2 is network bound so I ran using the 
<a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultithreadedMapRunner.html">MultithreadedMapRunner</a> to ensure I could get as much throughput as possible.</li>
<li>The code includes lots of retry logic but also sets 
<a href="http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)">mapred.max.map.failures.percent=100</a> to
allow tasks to fail without killing the entire job (Eg there was one s3 object which had bad ACLs that couldn't be read, no amount of retries would have helped)</li>
</ol>
<h2>2. Filtering text/html</h2>
<p>The next step was to filter out everything that didn't have a mime type of 'text/html'. This is pretty straightforward since the arc file format specifies the mime type in a header.
I used the <a href="http://nutch.apache.org/apidocs-1.2/org/apache/nutch/tools/arc/ArcInputFormat.html">ArcInputFormat</a> from 
<a href="http://nutch.apache.org">Apache Nutch</a> to actually generate the hadoop map input records.</p>
<p>Across the 3,000,000,000 objects in the crawl there ended up being 2,000 distinct mime types, 700 of each occuring only once, with about 90% of them being nonsense. </p>
<p>The top five mime types were ...</p>
<table class="data">
<tr><td><b>rank</b></td><td><b>mime type</b></td><td><b>freq</b></td><td><b>overall</br>%</b></td></tr>
<tr><td>1</td><td>text/html</td><td>2,970,000,000</td><td>91%</td></tr>
<tr><td>2</td><td>text/plain</td><td>79,000,000</td><td>2%</td></tr>
<tr><td>3</td><td>text/xml</td><td>52,000,000</td><td>1%</td></tr>
<tr><td>4</td><td>application/pdf</td><td>48,000,000</td><td>1%</td></tr>
<tr><td>5</td><td>application/x-javascript</td><td>26,000,000</td><td><1%</td></tr>
<tr><td>6</td><td>text/css</td><td>25,000,000</td><td><1%</td></tr>
</table>

<p>Even though there's probably interesting content in the non text/html object types it seemed that just handling text/html would get me the biggest bang for my buck.</p>
<p>Initially I had some problems with encoding. Though http response headers include an encoding
field that is <i>meant</i> to indicate what encoding the payload is I found it to be wrong about 30% of the time :( I just ignored what the header said and
used the <a href="http://tika.apache.org/1.0/api/org/apache/tika/parser/txt/CharsetDetector.html">CharsetDetector</a> 
provided in <a href="http://tika.apache.org/">Apache Tika</a>. CharsetDetector inspects a chunk of bytes, uses heuristics to guess the encoding, decodes and reencodes as UTF-8. </p>
<h2>3. Extracting visible text</h2>
<p>Next was to extract the visible text from this raw html. After playing with a few libraries I ended up going with 
<a href="http://code.google.com/p/boilerpipe/">boilerpipe</a>. In particular I ended up using the 
<a href="http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/KeepEverythingWithMinKWordsExtractor.html">KeepEverythingWithMinKWordsExtractor</a>
extractor. Boilerpipe, roughly, returns a single line per block element of the html.</p>
<h2>4. Filtering for english content</h2>
<p>I then used 
<a href="http://tika.apache.org/1.0/api/org/apache/tika/language/LanguageIdentifier.html">LanguageIdentifier</a>, again a part of Tika, to filter out everything but english text. 
It didn't seem to have any false positives but looking at the top 5 languages something seems amiss... </p>
<table class="data">
<tr><td><b>rank</b></td><td><b>language</b></td><td><b>freq</b></td></tr>
<tr><td>1</td><td>English (en)</td><td>1,600,000,000</td></tr>
<tr><td>2</td><td>Lithuanian (lt)</td><td>270,000,000</td></tr>
<tr><td>3</td><td>Norwegian (no)</td><td>150,000,000</td></tr>
<tr><td>4</td><td>Estonian (et)</td><td>140,000,000</td></tr>
<tr><td>5</td><td>French (fr)</td><td>140,000,000</td></tr>
</table>

<p>I never got around to sampling some of the Lithuanian ones to see what was actually going on but I'm a bit suspicious. I might have actually lost a bit of content in this step...f</p>
<h2>5. Tokenising</h2>
<p>The final step was to tokenise the text. I used 
<a href="http://nlp.stanford.edu/software/lex-parser.shtml">the stanford parser</a>, 
in particular I modified their example DocumentPreprocessor to make this simplified 
<a href="https://github.com/matpalm/common-crawl/blob/master/java/src/cc/util/SentenceTokeniser.java">SentenceTokeniser</a></p>
<p>This tokeniser was wrapped in a 
<a href="https://github.com/matpalm/common-crawl/blob/master/java/src/cc/TokeniseSentences.java">TokeniseSentences</a>
hadoop job that did some additional sanity checking like ignoring one/two word sentences etc.</p>
<h2>Results</h2>
<p>The final output was 92,000,000,000 sentences (3TB gzipped). Next will be to finish porting 
<a href="https://github.com/matpalm/resemblance/">my near duplicate sketching algorithm</a>
to hadoop to run it across this data.</p>]]></content:encoded>
    </item>
    <item>
      <title>finding phrases with mutual information</title>
      <link>http://matpalm.com/blog/2011/11/15/collocations_3</link>
      <category><![CDATA[nlp]]></category>
      <category><![CDATA[phrase-extraction]]></category>
      <category><![CDATA[collocations]]></category>
      <category><![CDATA[mutual-information]]></category>
      <guid>http://matpalm.com/blog/2011/11/15/collocations_3</guid>
      <description>finding phrases with mutual information</description>
      <content:encoded><![CDATA[<h2>finding phrases with mutual information</h2>
<p>continuing on with my series of <a href="/blog/2011/10/22/collocations_1/">mutual information experiments</a> how might 
we extend the technique to identity sequences longer than just two terms?</p>
<p>one novel way is to identify the bigrams of interest, replace them with a single token and simply repeat 
the entire process. (thanks <a href="http://tdunning.blogspot.com">ted</a> for the idea)</p>
<h2>example</h2>
<p>so say we had the 6 term sentence <tt>i went to new york city</tt></p>
<p>it has 5 bigrams; <tt>('i went', 'went to', 'to new', 'new york', 'york city')</tt></p>
<p>running the mutual information algorithm over this might identify <tt>new york</tt> 
as a bigram of interest. </p>
<p>we can swap the two terms with a single token 
<tt>(new_york)</tt> giving us a new sentence with 5 terms; <tt>i went to '(new_york)' city</tt></p>
<p>this new sentence has 4 bigrams <tt>('i went', 'went to', 'to (new_york)', '(new_york) city')</tt></p>
<p>another run of mutual information might now identify the pair <tt>(new_york) city</tt> so we replace 
it with the token <tt>((new_york)_city)</tt> and just keep repeating.</p>
<h2>data</h2>
<p>lets run this over a small sample of 300,000 sentences taken from visible text of 
the <a href="http://download.freebase.com/wex/">freebase wiki dump</a> after it's been tokenised by 
<a href="http://nlp.stanford.edu/software/lex-parser.shtml">the stanford parser</a></p>
<p>(to speed things a little i calculate mutual information and replace the top 10 bigrams in the text before recalculating)</p>
<p>example starting sentences include...</p>
<pre>
A solid and dependable performer Taylor held the record having played in games for the Phillies at second base t...
A surface may also exhibit both specular and diffuse reflection as is the case for example of glossy paint as us...
A variety of names have since been given to the Wandering Jew including Matathias Buttadeus Paul Marrane and Isa...
A.D.A.M. has control over Eggman 's computer and therefore every robot he owns he can also spread to other compu...
Absolute magnitude magazine cover Though this image is subject to copyright its use is covered by the U.S. fair ...
</pre>

<h2>results</h2>
<h3>what phrases do we find?</h3>
<p>after the first iteration we get the bigrams we've seen before...</p>
<pre>
Socorro LINEAR
expr    expr
United  States
Los     Angeles
median  income
</pre>

<hr>

<p>but after the second iteration we get a mix of single term bigrams and immediately 
start seeing some new composite bigrams; in this case the trigram <tt>'per square mile'</tt></p>
<pre>
(expr_expr)     (expr_expr)
capita  income
(t_t)   t
per     (square_mile)
Las     Vegas
</pre>

<p>unfortunately there's lots of noise too. <tt>'expr expr expr expr'</tt> comes from an single sentence, the term 'expr' repeated 450 times, 
that must have been poorly parsed originally. the <tt>'t t t</tt>' case is something similar.</p>
<hr>

<p>by the 16th iteration we get our first 4gram phrase <tt>' U.S. fair use laws'</tt></p>
<pre>
had     been
U.S.    ((fair_use)_laws)
Rotten  Tomatoes
science fiction
(New_York)      City
</pre>

<hr>

<p>and by the 70th iteration we get our first 5gram phrase <tt>'United Nations Security Council Resolution'</tt>.
jujitsu fans out there will be pleased to see some grappling coming in too!</p>
<p>alas more rubbish as well with the align styling tags leaking in.</p>
<pre>
(((United_Nations)_Security)_Council)   Resolution
Submission      (rear_(naked_choke))
Asian   (Pacific_Islander)
(UD_(align_left))       ((align_left)_((align_center)_(Win_(align_left))))
lieutenant      colonel
</pre>

<hr>

<p>it's only two passes later that we get a big continuation of this one 
with <tt>'United Nations Security Council Resolution adopted unanimously'</tt></p>
<pre>
(((((United_Nations)_Security)_Council)_Resolution)_adopted)    unanimously
(United_States) ((align_left)_((align_center)_(Win_(align_left))))
Flying  Corps
Saddam  Hussein
TKO     punches
</pre>

<p>i was a bit suspicous of this one but grabbing the original text we can see
how it makes for an interesting construct in the text...</p>
<pre>
United Nations Security Council Resolution adopted unanimously on August after recalling Resolution the Council ... 
United Nations Security Council Resolution adopted unanimously on March after recalling all previous resolutions...
United Nations Security Council Resolution adopted unanimously on February after noting that the Council had bee...
United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ...
United Nations Security Council Resolution adopted unanimously on May after a complaint by Senegal against Portu...
United Nations Security Council Resolution adopted unanimously on June after recalling resolutions and the Counc...
United Nations Security Council Resolution adopted unanimously on July after noting the recent entry into force ...
United Nations Security Council Resolution adopted unanimously on May after recalling all resolutions on the sit...
United Nations Security Council Resolution adopted unanimously on January after recalling all previous resolutio...
United Nations Security Council Resolution adopted unanimously on June after hearing representations from Botswa...
United Nations Security Council Resolution adopted unanimously on May after reaffirming Resolution and all subse...
United Nations Security Council Resolution adopted unanimously on August after reaffirming previous resolutions ...
United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ...
United Nations Security Council Resolution was adopted unanimously on October after recalling resolutions and on...
United Nations Security Council resolution adopted unanimously on March after reaffirming resolutions and on the...
United Nations Security Council Resolution adopted unanimously on June after recalling all previous resolutions ...
United Nations Security Council Resolution adopted unanimously on January after reaffirming Resolution on the si...
United Nations Security Council Resolution adopted unanimously on February after reaffirming resolutions and in ...
</pre>

<p>interesting. i wonder has this come from a template perhaps? maybe just cut n paste? one author with fixed style?</p>
<hr>

<p>even by the end of my run, 950 iterations, (aka last night) there continue to be valid 
short phrases being picked up</p>
<pre>
(County_Kansas) (United_States)
County  Clare
Sunday  night
Rift    Valley
Charlton        Heston
</pre>

<h3>how has the corpus changed?</h3>
<p>during the processing we've been replacing these tokens in the original text. 
so how does it look by this time? well, not a whole lot has changed actually. </p>
<p>the following 3 random examples show how little the text differs
(should have left it running much longer!!)</p>
<pre>
(He_played) for a (short_time) with (Duke_Ellington) for (which_he) is (best_remembered)

His (debut_single) Mi God Mi King topped the Jamaican (singles_chart) and a string of hits
followed including Heel And Toe Monkey And Ape (Ghost_Rider) and Crucifixion although his
best-remembered song is Mini Bus which lamented the demise (of_the) jolly bus and which 
(was_awarded) the title Song Of The Year in (from_the) Jamaica (Broadcasting_Corporation)

However this number is certainly an improvement (from_the) cars it averaged yearly ((during_the)_1980s)
</pre>

<h3>what are the longest phrases identified?</h3>
<p>the top three are noise alas...</p>
<table class="data">
<tr> <td>rank</td> <td>num</br>underscores</td> <td>phrase</td> </tr>
<tr> <td>1</td> <td>127</td> <td>expr_expr_expr_..... (128 times)</td></tr>
<tr> <td>2</td> <td>95</td>  <td>September_Socorro_LINEAR_September_Socorro_LINEAR_... (32 times)</td></tr>
<tr> <td>3</td> <td>63</td>  <td>t_t_t_... (64 times)</td></tr>
</table>

<p>it's not until the 77th we get something that isn't (arguably) just a repeated 
pattern or noisy parsing</p>
<table class="data">
<tr> <td>rank</td> <td>num</br>underscores</td> <td>phrase</td> </tr>
<tr> <td>77</td> <td>10</td><td> At_the_census_there_were_people_households_and_families_residing_in </td></tr>
</table>

<p>which were identified as phrases due to the large frequency of occurances of variations of the following...</p>
<table class="data">
<tr> <td>freq</td> <td>original phrase</td> </tr>
<tr><td> 72 </td> <td>As of the census of there were people households and families residing in the city      </td> </tr>
<tr><td>     60 </td> <td>As of the census of there were people households and families residing in the town      </td> </tr>
<tr><td>     46 </td> <td>As of the census of there were people households and families residing in the CDP       </td> </tr>
<tr><td>     32 </td> <td>As of the census of there were people households and families residing in the village   </td> </tr>
<tr><td>     26 </td> <td>As of the census of there were people households and families residing in the township  </td> </tr>
<tr><td>     15 </td> <td>As of the census of there were people households and families residing in the borough   </td> </tr>
<tr><td>      3 </td> <td>At the census there were people households and families residing in the city  </td> </tr>
<tr><td>      2 </td> <td>At the census there were people households and families residing in the village </td> </tr>
<tr><td>      2 </td> <td>As of the census of there were people households and families residing on the base      </td> </tr>
</table>

<p>fascinating stuff!!</p>
<h2>random todo thoughts</h2>
<ul>
<li>use patterns found in this experiment to clean up noise and rerun</li>
<li>work out a way to fold the composition into the scoring</li>
<li>work on larger dataset</li>
<li>approach to dealing with duplicates? don't want to just uniq since they represent something </li>
</ul>]]></content:encoded>
    </item>
    <item>
      <title>collocations in wikipedia, part 2</title>
      <link>http://matpalm.com/blog/2011/11/05/collocations_2</link>
      <category><![CDATA[nlp]]></category>
      <category><![CDATA[phrase-extraction]]></category>
      <category><![CDATA[collocations]]></category>
      <guid>http://matpalm.com/blog/2011/11/05/collocations_2</guid>
      <description>collocations in wikipedia, part 2</description>
      <content:encoded><![CDATA[<h2>problems with low support</h2>
<p>in my <a href="/blog/2011/10/22/collocations_1/">last post</a> we went through mutual information as a way of finding collocations.</p>
<p>the astute reader may have noticed that for the list of top bigrams i only
showed ones that had a frequency above 5,000. </p>
<p>why this cutoff? well it turns out
that one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases. </p>
<p>if we purely just sort by the mutual information score we find that the top 250,000 all have the same score 
and correpond to bigrams that occur only once in the corpus (and whose terms only appear in that bigram). some examples 
include "Bruail Brueil", "LW1X LW4X" and "UG-211 GB-HMF" and they are, as often as not it seems, artifacts of parsing quirks.</p>
<p>so how did i decide the minimum support of 5,000? it's just a round number near the 99th percentile of the frequency of frequencies. 
purely a magic number, not good!</p>
<h2>likelihood ratios</h2>
<p>another approach that doesn't suffer this problem with low frequency bigrams is to use likelihood ratios. 
one such test is the <a href="http://en.wikipedia.org/wiki/G-test">g-test</a> very well described in 
<a href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html">this blog post by ted dunning</a></p>
<p>as a concrete implementation i'll just use the 
<a href="https://github.com/apache/mahout/blob/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java">LogLikelihood</a> 
code provided in <a href="http://mahout.apache.org/">mahout</a>. (yay for being able to 
<a href="http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/">use an arbitrary static java function as a pig udf!</a>)</p>
<p>to use this method we need 4 values, k11, k12, k21 and k22, all conveniently calculatable from the counts we gathered for mutual info</p>
<table class="data">
<tr>
 <td>k_value</td>
 <td>description</td>
 <td>calculated from</td></tr>
<tr>
 <td>k11</td>
 <td>t1 & t2</td>
 <td>freq(t1,t2)</td>
</tr>
<tr>
 <td>k12</td>
 <td>t1 & not t2</td>
 <td>freq(t1) - freq(t1,t2)</td>
</tr>
<tr>
 <td>k21</td>
 <td>not t1 & t2</td>
 <td>freq(t2) - freq(t1,t2)</td>
</tr>
<tr>
 <td>k22</td>
 <td>not t1 & not t2</td>
 <td>total_num_bigrams - ( freq(t1) + freq(t2) - freq(t1,t2) )</td>
</tr>
</table>

<p>calculating this value for the 1,300,000,000 bigrams of wikipedia 
(<a href="https://github.com/matpalm/collocations/tree/master/llr">code here</a>) 
we get these top 10 bigrams...</p>
<table class="data">
<tr><td>rank</td><td>bigram</td><td>llr</td></tr>
<tr><td>1</td><td>of the</td><td>21,369,480</td></tr>
<tr><td>2</td><td>in the</td><td>    12,669,724</td></tr>
<tr><td>3</td><td>the the</td><td>   10,743,814</td></tr>
<tr><td>4</td><td>United States</td><td> 10,490,802</td></tr>
<tr><td>5</td><td>is a</td><td>     8,948,460</td></tr>
<tr><td>6</td><td>New York</td><td>   6,973,104</td></tr>
<tr><td>7</td><td>such as</td><td>   6,175,861</td></tr>
<tr><td>8</td><td>the of</td><td>   5,821,300</td></tr>
<tr><td>9</td><td>to be</td><td>   5,374,484</td></tr>
<tr><td>10</td><td>has been</td><td>   5,348,722</td></tr>
</table>

<p>this surprised me a bit since these bigrams are not at all what i expected.... especially when you compare the results against just ranking by
raw bigram frequency (which is obviously much easier to calculate)</p>
<table class="data">
<tr><td>rank based</br>on llr</td><td>rank based</br>on freq</td><td>bigram</td></tr>
<tr><td>1</td><td>1</td><td>of the</td></tr>
<tr><td>2</td><td>2</td><td>in the</td>   </tr>
<tr><td>3</td><td>20,000</td><td>the the</td>  </tr>
<tr><td>4</td><td>31</td><td>United States</td></tr>
<tr><td>5</td><td>4</td><td>is a</td>  </tr>
<tr><td>6</td><td>48</td><td>New York</td></tr>
<tr><td>7</td><td>27</td><td>such as</td></tr>
<tr><td>8</td><td>18,000</td><td>the of</td></tr>
<tr><td>9</td><td>14</td><td>to be</td>  </tr>
<tr><td>10</td><td>36</td><td>has been</td></tr>
</table>

<h2>do i have a bug?</h2>
<p>at first i thought i must have a bug but manually redoing the top one (of,the) gives the same answer</p>
<p>k11 
 = f(of,the) 
 = 12,290,443</p>
<p>k12 
 = f(of) - f(of,the) 
 = 41,478,115 - 12,290,443 
 = 29,187,672 </p>
<p>k21 
 = f(the) - f(of,the) 
 = 74,807,672 - 12,290,443 
 = 62,517,229</p>
<p>k22 
 = total number bigrams - ( f(of) + f(the) - f(of, the) ) 
 = 1,110,473,107 - ( 41,478,115 + 74,807,672 - 12,290,443 ) 
 = 1,006,477,763</p>
<p>and
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(12290443, 29187672, 62517229, 1006477763)
 = 2.1369480467885613E7</p>
<p>hmmm.. i wonder what i'm missing? to be continued i guess!</p>
<h2>weighting by frequency</h2>
<p>coming back to mutual information a variation is to simply weight by the bigram frequency.</p>
<p>so instead of sorting on <tt>mutual info</tt> we can sort on <tt>log(bigram freq) * mutual info</tt>. note: 
we can't weight
using the bigram frequency directly because the values are on such different scales as the mutual info. rathering just
normalising we can reduce the bigram frequency using a logarithm which feels "fair" since these frequencies follow a power law anyways.</p>
<p>ordering by this new metric gives the following results which seem ok. the main thing is i didn't have to
specify an arbitrary cut off frequency!</p>
<table class="data">
<tr> <td>bigram</td> <td>bigram freq</td> <td>mutual info</td> <td>log(freq) *</br> mutual info</td> </tr>
<tr> <td>fn org</td> <td>45050</td> <td>14.699 </td> <td>  157.514</td> </tr>
<tr> <td>Buenos Aires</td> <td>20682</td> <td>15.808   </td> <td>157.092</td> </tr>
<tr> <td>Socorro LINEAR</td> <td>97365</td> <td>13.576   </td> <td>155.943</td> </tr>
<tr> <td>gastropod mollusk</td> <td>19342</td> <td>15.687   </td> <td>154.835</td> </tr>
<tr> <td>Hong Kong</td> <td>67738</td> <td>13.827   </td> <td>153.804</td> </tr>
<tr> <td>Los Angeles</td> <td>    134801</td> <td>12.883    </td> <td>152.172</td> </tr>
<tr> <td>Tel Aviv</td> <td>9144 </td> <td>         16.667   </td> <td>152.018</td> </tr>
<tr> <td>Burkina Faso</td> <td>5407 </td> <td>         17.649   </td> <td>151.705</td> </tr>
<tr> <td>Kuala Lumpur</td> <td>6450 </td> <td>         17.233   </td> <td>151.173</td> </tr>
<tr> <td>Notre Dame</td> <td>13546  </td> <td>         15.873   </td> <td>151.021</td> </tr>
</table>

<p>at least good enough to provide data for the next experiment...</p>]]></content:encoded>
    </item>
    <item>
      <title>collocations in wikipedia, part 1</title>
      <link>http://matpalm.com/blog/2011/10/22/collocations_1</link>
      <category><![CDATA[nlp]]></category>
      <category><![CDATA[phrase-extraction]]></category>
      <category><![CDATA[collocations]]></category>
      <guid>http://matpalm.com/blog/2011/10/22/collocations_1</guid>
      <description>collocations in wikipedia, part 1</description>
      <content:encoded><![CDATA[<h2>introduction</h2>
<p><a href="http://en.wikipedia.org/wiki/Collocation">collocations</a> are combinations of terms that occur together more frequently than
you'd expect by chance. </p>
<p>they can include </p>
<ul>
<li>proper noun phrases like 'Darth Vader'</li>
<li>stock/colloquial phrases like 'flora and fauna' or 'old as the hills'</li>
<li>common adjectives/noun pairs (notice how 'strong coffee' sounds ok but 'powerful coffee' doesn't?)</li>
</ul>
<p>let's go through a couple of 
techniques for finding collocations taken from the exceptional nlp text 
<a href="http://www.amazon.com/gp/product/0262133601?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0262133601">"foundations of statistical natural language processing"</a> by manning and schutze.</p>
<h2>mutual information</h2>
<p>the first technique we'll try is <a href="http://en.wikipedia.org/wiki/Mutual_information">mututal information</a>, it's a way
of scoring terms based on how often they appear together vs how often they appear separately. </p>
<p>the intuition is that if two (or three) terms
appear together a lot, but hardly ever appear without each other, they probably can be treated as a phrase.</p>
<p>a common definition for bigram mutual information is </p>
<p>\( MutualInformation(t1,t2) = log _{2} \frac{P(t1,t2)}{P(t1).P(t2)} \)</p>
<p>and a definition for trigram mutual information i found in the paper <a href="http://acl.ldc.upenn.edu/P/P94/P94-1033.pdf">A Corpus-based Approach to Automatic Compound Extraction</a> is </p>
<p>\( MutualInformation(t1,t2,t3) = log _{2} \frac{P(t1,t2,t3)}{P(t1).P(t2).P(t3)+P(t1).P(t2,t3)+P(t1,t2).P(t3)} \)</p>
<p>given a corpus we can use simple maximum likelihoods estimates to calculate these probabilities ...</p>
<p>\( P(t1,t2,t3) = \frac{freq(t1,t2,t3)}{\#trigrams} \)
and \( P(t1,t2) = \frac{freq(t1,t2)}{\#bigrams} \)
and \( P(t1) = \frac{freq(t1)}{\#unigrams} \)</p>
<p>so! we need a corpus! lets ...</p>
<ol>
<li>grab a <a href="http://download.freebase.com/wex/">freebase wikipedia dump</a></li>
<li>pass it through the <a href="http://nlp.stanford.edu/software/lex-parser.shtml">stanford nlp parser</a> to 
extract all the sentences </li>
<li>build the frequency tables for our ngrams</li>
</ol>
<p>( see <a href="https://github.com/matpalm/collocations">my project on github</a> for the gory details to reproduce )</p>
<p>working with the 2011-10-15 freebase dump we start with 5,700,000 articles.
from this the stanford parser extracts 55,000,000 sentences.</p>
<p>from these sentences we can extract some ngrams ...</p>
<table class="data">
<tr><td>&nbsp;</td>   <td>#total</td>        <td>#distinct</td>  </tr> 
<tr><td>unigrams</td> <td>1,386,868,488</td> <td>8,295,593</td>   </tr> 
<tr><td>bigrams</td>  <td>1,331,695,519</td> <td>99,340,352</td>  </tr> 
<tr><td>trigrams</td> <td>1,276,522,552</td> <td>381,541,510</td> </tr> 
</table>

<p>the top 5 of each being...</p>
<table><tr>

<td>
<table class="data">
<tr><td>unigram</td>   <td>freq</td>  </tr> 
<tr><td>the</td> <td>74,528,781</td> </tr> 
<tr><td>,</td> <td>70,605,655</td> </tr> 
<tr><td>.</td> <td>54,902,186</td> </tr> 
<tr><td>of</td> <td>41,340,440</td> </tr> 
<tr><td>and</td> <td>34,962,970</td> </tr> 
</table>
</td>

<td>
<table class="data">
<tr><td>bigram</td>   <td>freq</td>  </tr> 
<tr><td>of the</td> <td>12,184,383</td> </tr> 
<tr><td>in the</td> <td>8,042,527</td> </tr> 
<tr><td>, and</td> <td>7,223,201</td> </tr> 
<tr><td>, the</td> <td>4,756,776</td> </tr> 
<tr><td>to the</td> <td>4,077,474</td> </tr> 
</table>
</td>

<td>
<table class="data">
<tr><td>trigram</td>   <td>freq</td>  </tr> 
<tr><td>| | |</td> <td>805,094</td> </tr> 
<tr><td>, and the</td> <td>767,814</td> </tr> 
<tr><td>one of the</td> <td>617,374</td> </tr> 
<tr><td>-RRB- is a</td> <td>562,709</td> </tr> 
<tr><td>| - |</td> <td>516,652</td> </tr> 
</table>
</td>

<p></tr></table></p>
<p>(note: -RRB- is the tokenised right parenthese) </p>
<p>and, as always, the devil's in the detail when it comes to tokenisation... you always have to make lots of decisions; if 
we're after word pairs/triples should we just remove single characters such as '-' or '|' ? for this experiment i decided to leave them
in as they act as a convenient seperator.</p>
<p>overall the freebase data is clean enough for some hacking. had to remove some stray html markup left in from the original wikimedia
parse (so the stanford parser wouldn't implode) but other than that we can get away with ignoring anomalies such as the trigram '| - |'
(hoorah for statistical methods!)</p>
<h3>bigram mutual information</h3>
<p>calculating the mutual information for all the bigrams with a frequency over 5,000 gives the following top ranked ones</p>
<pre>
rank             bigram  freq   m_info
1          Burkina Faso  5417 17.88616
2       Rotten Tomatoes  5695 17.50873
3          Kuala Lumpur  6441 17.47578
4              Tel Aviv  9106 16.90873
5           Baton Rouge  5587 16.85029
6        Figure Skating  5518 16.44119
7             Lok Sabha  7429 16.43407
8            Notre Dame 13516 16.11460
9          Buenos Aires 20595 16.05346
10    gastropod mollusk 19335 15.92581
11           Costa Rica 11014 15.85664
12         Barack Obama  9742 15.84432
13           vice versa  5205 15.66973
14              hip hop 15727 15.63575
15        Uttar Pradesh  7833 15.63525
16   main-belt asteroid 10551 15.62005
17 Theological Seminary  6131 15.61613
18         Saudi Arabia 14887 15.59454
19                sq mi  8492 15.58054
20            São Paulo 13832 15.53181
</pre>

<p>these are pretty much all proper nouns and, though they are all great finds, they're not really the adjective/noun
phrases i was particularly interested in. i guess it's not too surprising since we've done nothing in terms of POS tagging yet.</p>
<p>see <a href="https://github.com/matpalm/collocations/blob/master/bigram_mutual_info.top1k.tsv">here</a> for the top 1,000</p>
<p>a plot of term frequency vs mutual information score shows an expected huge density of low frequency / low mutual information bigrams.
the low frequency / high mutual info ones (in the top left) are the ones in the table above and the high frequency / low 
mutual info ones (in the bottom right) correspond to boring language constructs such as "of the", "to the" or ", and".</p>
<img src="/blog/imgs/2011/10/bigram_mutual_info.png"/>

<h3>trigram mutual information</h3>
<p>what about trigrams? here are the top 20 with a support of over 1,000</p>
<pre>
rank                   trigram freq   m_info
1                Abdu ` l-Bahá 1011 19.06866
2    Dravida Munnetra Kazhagam 1043 18.98674
3              Ab urbe condita 1059 18.58179
4                Dar es Salaam 1130 18.09764
5              Kitts and Nevis 1095 18.02320
6             Procter & Gamble 1255 17.96789
7          Antigua and Barbuda 1290 17.90375
8          agnostic or atheist 1068 17.84620
9                Vasco da Gama 1401 17.77709
10                Ku Klux Klan 1944 17.77443
11              Ways and Means 1070 17.51264
12             Croix de Guerre 1196 17.46765
13        Jehovah 's Witnesses 2235 17.46177
14                  SV = Saves 1980 17.24957
15               Venue | Crowd 1518 17.24024
16             summa cum laude 1363 17.22880
17        Teenage Mutant Ninja 1003 17.17236
18             Osama bin Laden 1734 17.16104
19             magna cum laude 1566 17.13729
20         Names -LRB- US-ACAN 2813 16.91815
</pre>

<p>again mostly proper nouns apart from some oddities such as "SV = Saves" (which must be from some type of sports glossary since i also see later "Pts = Points" &amp; "SO = Strikeouts" )</p>
<p>see <a href="https://github.com/matpalm/collocations/blob/master/trigram_mutual_info.top1k.tsv">here</a> for the top 1,000</p>
<p>a plot of frequency vs mutual info is similar to the bigram case.</p>
<img src="/blog/imgs/2011/10/trigram_mutual_info.png"/>

<p>and the top 10 non capitilised trigrams are curious...</p>
<pre>
agnostic or atheist
summa cum laude
magna cum laude
unmarried opposite-sex partnerships
flora and fauna
non-institutionalized group quarters
mollusk or micromollusk
italics indicate fastest
air-breathing land snail
</pre>

<p>air-breathing land snails #ftw !</p>
<h2>bigrams at a distance</h2>
<p>a variation of the standard bigrams approach is to allow tokens to be treated as bigrams as long as they
have no more than 2 tokens between them.</p>
<p>eg 'the cat in the hat' 
which would usually just have bigrams ['the cat','cat in', 'in the', 'the hat'] 
instead is defined by the bigrams ['the cat', 'the in', 'the the', 'cat in', 'cat the', 'cat hat', 'in the', 'in hat', 'the hat']</p>
<p>this results in roughly three times the bigrams (3,154,200,111 instead of 1,331,695,519) so it's slightly more processing
but it allows tokens to influence each other at a short distance</p>
<p>calculating the mutual information for these bigrams gives a slightly different set</p>
<pre>
rank                   trigram  freq   m_info
1                    expr expr 20888 17.04807
2                    ifeq ifeq  6507 16.18608
3                 Burkina Faso  5473 16.14546
4              Rotten Tomatoes  5705 15.75572
5                 Kuala Lumpur  6457 15.72382
5                SO Strikeouts  5788 15.56487
6        Masovian east-central  8452 15.41213
7                    Earned SO  5651 15.40456
8                  Wins Losses  7984 15.23901
9                     Tel Aviv  9137 15.15810
10                 Baton Rouge  5599 15.09785
11            Dungeons Dragons  5509 14.84334
12             Trinidad Tobago  6241 14.77053
13              Figure Skating  5528 14.68826
14                   Lok Sabha  7435 14.67970
15     background-color E9E9E9  8490 14.65283
16              Haleakala NEAT  5328 14.51430
17             Kitt Spacewatch 17854 14.43547
</pre>

<p>so more noise from the original freebase parse eg (expr, expr) or (background-color, E9E9E9)</p>
<p>interesting to see it picks up what would otherwise be a trigram with a middle 'and' eg (Dungeons, Dragons) and
(Trinidad, Tobago) </p>
<h2>summary</h2>
<p>so we've found lots of proper nouns! these can be very useful if you're doing feature extraction for a classifier that doesn't like
dependent features; a tokenisation of ['Barack Obama', 'went', 'to', 'Kuala Lumpur'] if often better than 
['Barack', 'Obama', 'went', 'to', 'Kuala', 'Lumpur']</p>
<p>coming up next, the mean/sd distance method...</p>]]></content:encoded>
    </item>
    <item>
      <title>an exercise in handling mislabelled training data</title>
      <link>http://matpalm.com/blog/2011/10/03/mislabelled-training-data</link>
      <category><![CDATA[]]></category>
      <category><![CDATA[training]]></category>
      <category><![CDATA[vowpal wabbit]]></category>
      <guid>http://matpalm.com/blog/2011/10/03/mislabelled-training-data</guid>
      <description>an exercise in handling mislabelled training data</description>
      <content:encoded><![CDATA[<h2>intro</h2>
<p>as part of my <a href="https://github.com/matpalm/diy_twitter_client">diy twitter client project</a> 
i've been using the <a href="https://dev.twitter.com/docs/streaming-api/methods">twitter sample streams</a> as a source
of unlabelled data for some <a href="http://en.wikipedia.org/wiki/Mutual_information">mutual information</a> analysis. 
these streams are a great source of random tweets but 
include a lot of non english content. extracting the english tweets would be pretty straight forward if the ['user']['lang'] 
field of a tweet was 100% representative of the tweet's language but a lot of the times it's not; can we use
these values at least as a starting point?</p>
<p>one approach to seeing how consistent the relationship between user_lang and the tweet language is to</p>
<ol>
<li>train a classifier for predicting the tweet's language assuming the user_lang field is correct</li>
<li>have the classifier reclassify the same tweets and see which ones stand out as being misclassified</li>
</ol>
<p>yes, yes, i realise that testing against the same data you've trained against is a big no no but i'm curious...</p>
<h2>method</h2>
<p>let's start with 100,000 tweets taken from the <a href="https://dev.twitter.com/docs/streaming-api/methods">sample stream</a>. 
we'll use <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki">vowpal wabbit</a> for the classifier and extract features as follows...</p>
<ol>
<li>lower case the tweet text.</li>
<li>remove hashtags, user mentions and urls.</li>
<li>split the text into character unigrams, bigrams and trigrams.</li>
</ol>
<p>we treat tweets marked as user_lang=en as the +ve case for the classifier (class 1) and all other tweets as the -ve case (class 0).</p>
<p>the standard output for predictions from vowpal is a value from 0.0 (not english) to 1.0 (english) but we'll use the raw prediction values instead; 
the magnitude of these in some way describes the model's confidence in the decision.</p>
<h2>results</h2>
<p>when we reclassify the tweets the model does pretty well (not surprisingly given we're testing against the same data we trained against)</p>
<p>some examples include..
<pre><b>
tweet text      watching a episode of law &amp; order this sad awww
marked english? 1.0 ( yes )
raw prediction  0.998317 ( model agrees it's english )
tweet text      こけむしは『高杉晋助、沖田総悟、永倉新八、神威、白石蔵ノ介』に誘われています。誰を選ぶ？
marked english? 0.0 ( marked as ja )
raw prediction  -1.06215 ( model agrees, <em>definitely</em> not english )
</b></pre></p>
<p>it's not getting 100% (what classifier ever can?) and in part it's since the labelling is "incorrect" at times. </p>
<p>we can use 
<a href="http://osmot.cs.cornell.edu/kddcup/software.html">perf</a> to check the accuracy of the model (though it's not really checking "accuracy", more like 
checking "agreement") and doing this we can see the classifier is correct roughlty 82% of the time.</p>
<div class="pygments_murphy"><pre>&gt; accuracy
 [1] 0.82496 0.81260 0.82690 0.82654 0.82454 0.82354 0.82078 0.82023 0.82120 0.81990
&gt; summary(accuracy)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.8126  0.8204  0.8224  0.8221  0.8249  0.8269
</pre></div>

<p><small>(see <a href="https://github.com/matpalm/mislabelled-training-data/blob/master/evaluate.sh">evaluate.sh</a> for the code to reproduce this)</small></p>
<h2>analysis</h2>
<p>so it generally does well but the most interesting cases are when the model <b>doesn't</b> agree with the label. </p>
<pre><b>
tweet text      [천국이 RT이벤트]2011 대한민국 소비자신뢰 대표브랜드 대상수상! 알바천국이 여러분의 사랑에힘입어
marked english? 1.0 ( hmmm, not sure this is in english :/ )
raw prediction  -0.52598 ( ie model thinks it's not english )
"disagreement"  1.52598
</b></pre>

<p>this is great! the model has correctly identified this instance is mislabelled. however sometimes the model disagrees and is wrong...</p>
<pre><b>
tweet text      поняла …что она совсем не нужна ему.
marked english? 0.0 ( fair enough, looks russian to me.. )
prediction:     5.528163 ( model strongly thinks it's english )
"disagreement"  5.528163
</b></pre>

<p>a bit of poking through the tweets shows that there are enough russian tweets marked as english for it to be learnt as english...</p>
<h2>"correcting" the labels</h2>
<p>we can score each tweet based on how much the model disagrees (mean square error of "disagreement" across the multiple runs) and we see, at least for the top 200, that
the model was right the vast majority of the time (ie the language of the tweet isn't the user_lang).</p>
<p>what we can do then is trust the model and change the user_lang as required for the top, say, 100 and reiterate.</p>
<p>if we do this overall iteration 10 times we see a gradual improvement in the model.</p>
<img src="/blog/imgs/2011/10/acc_vs_runs.png" />

<p>comparing the first run (r01) to the last (r10) the mean has risen a little bit from 0.8245 to 0.8298 and a 
t-test thinks this change is significant (p-value = 0.002097 &lt; 0.05); though it's not really that huge an improvement</p>
<h2>but an even simpler solution</h2>
<p>the iterative solution was novel but it turns out there's a much better solution; make a first pass on the data and if you see 
one of the common non english characters и, の, ل or น just mark the tweet as non english.</p>
<p>if we do this we get an immediate improvement</p>
<img src="/blog/imgs/2011/10/before_and_update_lang_set.png" />

<div class="pygments_murphy"><pre>&gt; summary(updated)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8144  0.8308  0.8339  0.8326  0.8359  0.8382 
</pre></div>

<p>and you don't need a t-test to see this change is significant :)</p>
<h2>tl;dr</h2>
<ol>
<li>a supervised classifier can be used in an iterative sense to do unsupervised work</li>
<li>but never forget a simple solution can often be the best!</li>
</ol>]]></content:encoded>
    </item>
    <item>
      <title>do all first links on wikipedia lead to philosophy?</title>
      <link>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy</link>
      <category><![CDATA[graph]]></category>
      <category><![CDATA[wikipedia]]></category>
      <guid>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy</guid>
      <description>do all first links on wikipedia lead to philosophy?</description>
      <content:encoded><![CDATA[<hr>

<p>(update: like all interesting things it turns out <a href="http://en.wikipedia.org/wiki/User:Ilmari_Karonen/First_link">someone else had already done this</a> :D)</p>
<h2>questions</h2>
<p>a <a href="http://xkcd.com/903/">recent</a> xkcd posed the idea...</p>
<p><i>wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at Philosophy.</i></p>
<p>this raises a number of questions</p>
<ol>
<li>Q: though i wouldn't be surprised if it's true for <em>most</em> articles it can't be true for <em>all</em> articles. can it?</li>
<li>Q: what's the distribution of distances (measured in "number of clicks away") from 'Philosophy'?</li>
<li>Q: by this same measure what's the furthest article from 'Philosophy'?</li>
<li>Q: are there any other articles that are more common than 'Philosophy'? </li>
<li>Q: what are the common paths to 'Philosophy'?</li>
</ol>
<p>there's only one way to find out!</p>
<ol>
<li>grab a wikipedia dump</li>
<li>build the graph of 'article' to 'first link to next article' (not in parentheses or italics)</li>
<li>do breadth first search backwards from 'Philosophy' and see what things look like</li>
</ol>
<hr>

<h2>getting and processing the data</h2>
<p>for my first attempt i tried to use the <a href="http://wiki.freebase.com/wiki/WEX">freebase wikipedia dump</a>. my thought was it'd be easier
to deal with a preparsed dataset but it didn't turn out. </p>
<p>two big problems....</p>
<ol>
<li>lots of information has been lost in the preparsing (eg. it was sometimes hard to determine if the first links were from the main body of text or from a sidebar )</li>
<li>some pages weren't parsed properly at all and were just blank; included ones like <a href="http://en.wikipedia.org/wiki/Greeks">Greeks</a>
which ended up being pretty important.</li>
</ol>
<p>instead i went for a <a href="http://download.wikimedia.org/enwiki/20110722/">raw wikimedia dump</a>, in particular the enwiki-20110722-pages-articles.xml.bz2 version.
it's 7gb compressed &amp; 30gb uncompressed.</p>
<p>for preprocessing there were a number of steps</p>
<ol>
<li>split the dataset into pages that represent redirects and the actual articles themselves</li>
<li>dereference all the redirects (to avoid redirects that redirect to other redirects)</li>
<li>parse all the articles; the crux of this is done with <a href="http://code.pediapress.com/wiki/wiki/mwlib">mwlib</a> 
and <a href="https://github.com/matpalm/wikipediaPhilosophy/blob/master/article_parser.py">article_parser.py</a>; to make a big list of edges of 'from' nodes (the article) and 'to' nodes (the first applicable link on the article page)</li>
<li>dereference the edges to make sure all redirects have been followed</li>
</ol>
<p>some general statements before we go further</p>
<ol>
<li>wikipedia is under heavy edit churn. i've been doing this project in 15-30 minutes chunks for a few weeks and it's amazing
 how often i'd compare the parsing to live wikipedia and find out a page had already subtely changed. god knows what it looks like currently.</li>
<li>i wrote all the code for this in python as i'm trying to move away from ruby to get better data related library support. everything in fact <em>except</em> for
the depth first search which i did in java. the full graph as a dict was <em>insanely</em> slow to access, i must be doing something wrong.
for the full details see 
<a href="http://www.github.com/matpalm/wikipediaPhilosophy">the code on this project</a>. git cloning the project and executing the README
as a shell script may [1] do something close to all the steps from start to finish. <small>[1] or it might not</small></li>
</ol>
<p>the end result of the parsing is a list of 3.6e6 edges of the form 'article' -&gt; 'first link to next article' (after following redirects).</p>
<p>all the 'article's are unique but there are only 500e3 distinct 'next article's which is already very interesting; it means less than 15% of articles 
on wikipedia are represented by one of these first links; this graph is very "bushy" (ie lots of leaf nodes).</p>
<p>to calculate the distance from 'Philosophy' for all articles it's a straight forward 
<a href="http://en.wikipedia.org/wiki/Breadth_first_search">breadth first search</a> and
because this search doesnt <a href="http://en.wikipedia.org/wiki/Graph_cycle">cycle</a> back to 'Philosophy' again it ends
up building a <a href="http://en.wikipedia.org/wiki/Tree_(graph_theory)">tree</a>.</p>
<hr>

<h2>the results</h2>
<p>with this tree we can start answering some of our original questions ...</p>
<hr>

<h3>Q: though i wouldn't be surprised if it's true for <em>most</em> articles it can't be true for <em>all</em> articles. can it?</h3>
<p>seems it's not true for all articles; 3.5e6 articles lead to 'Philosophy' but 100e3 don't.</p>
<p>these 100e3 fall into two types</p>
<p>1) 50e3 of them end up in cycles. this is a remarkably low count given 3.5e6 make it to 'Philosophy'.</p>
<p>the vast majority of the cycles are of length 2; eg <strong>Waste management -&gt; Waste collection -&gt; Waste management</strong></p>
<p>( my favorite that i stumbled across is <strong>Sand fence -&gt; Snow fence -&gt; Sand fence</strong></br>
the first sentence of Snow fence being "A snow fence is a structure, similar to a sand fence ..."</br>
the first sentence of Sand fence being "A sand fence is a structure similar to a snow fence ..." )</p>
<p>2) the other 50e3 are dead ends; all sorts of examples for this, mainly around pages that were never written or have been deleted.</p>
<p>eg <strong>Windsurfing -&gt; Surface water sports -&gt; Discing</strong> (which has deleted)</p>
<hr>

<h3>Q: what's the distribution of distances of articles from 'Philosophy'?</h3>
<p>the bulk of the articles are between 10 to 30 clicks away...</p>
<img src="http://matpalm.com/wikipediaPhilosophy/num_articles__number_clicks__philosophy.png"/>

<p>i've trimmed this graph at 70 clicks away since there's a long tail of one single path that is 1001 articles long.</p>
<p><strong>List of state leaders in 1977 -&gt; List of state leaders in 1976 -&gt; List of state leaders in 1975 -&gt;
.... -&gt; List of state leaders in 1001 -&gt; List of state leaders in 1000 -&gt; Fatimid Caliphate -&gt; Arab people
-&gt; Panethnicity -&gt; Ethnic group -&gt; Social group -&gt; Social sciences -&gt; List of academic disciplines 
-&gt; Academia -&gt; Community -&gt; Living -&gt; Life -&gt; Physical body -&gt; Physics -&gt; Natural science -&gt; Science 
-&gt; Knowledge -&gt; Fact -&gt; Information -&gt; Sequence -&gt; Mathematics -&gt; Quantity -&gt; Property (philosophy) 
-&gt; Modern philosophy -&gt; Philosophy</strong></p>
<p>seems a bit of a "meta article" outlier we can ignore.</p>
<p>( there's an interesting dip at a distance of 19 too; wonder what's going on there? )</p>
<hr>

<h3>Q: what's the furthest article from 'Philosophy'?</h3>
<p>'Violet &amp; Daisy' is the longest chain i found that didn't include "meta" pages with some kind of sequence number in it. it's 36 articles from 'Philosophy'.</p>
<p><strong>Violet &amp; Daisy -&gt; Saoirse Ronan -&gt; BAFTA Award for Best Actress in a Supporting Role -&gt; British Academy Film Awards -&gt; 
 British Academy of Film and Television Arts -&gt; David Lean -&gt; Order of the British Empire -&gt; Chivalric order -&gt; Knight -&gt; 
 Warrior -&gt; Combat -&gt; Violence -&gt; Psychological manipulation -&gt; Social influence -&gt; Conformity -&gt; Unconscious mind -&gt; 
 Germans -&gt; Germanic peoples -&gt; Proto-Indo-Europeans -&gt; Proto-Indo-European language -&gt; Linguistic reconstruction -&gt; 
 Internal reconstruction -&gt; Language -&gt; Human -&gt; Extant taxon -&gt; Biology -&gt; Natural science -&gt; Science -&gt; Knowledge -&gt; 
 Fact -&gt; Information -&gt; Sequence -&gt; Mathematics -&gt; Quantity -&gt; Property (philosophy) -&gt; Modern philosophy -&gt; Philosophy</strong></p>
<hr>

<h3>Q: are there any other articles that are "more common" than 'Philosophy'?</h3>
<p>with 95+% of articles clicking through to 'Philosophy' it's not possible for there to be another unconnected graph with an article more represented than
'Philosophy'. </p>
<p>but if we <em>continue</em> to click through past 'Philosophy' we see we're in a short cycle of 12 articles...</p>
<p><strong>Philosophy -&gt; Reason -&gt; Natural science -&gt; Science -&gt; Knowledge -&gt; Fact -&gt; Information -&gt; Sequence -&gt; Mathematics -&gt; 
 Quantity -&gt; Property (philosophy) -&gt; Modern philosophy -&gt; Philosophy</strong></p>
<p>so really <em>any</em> of these are reasonable candidates and are equally good as 'Philosophy' itself for this game.</p>
<hr>

<h3>Q: what are the common paths into 'Philosophy'?</h3>
<p>as mentioned the breadth first search builds a tree of articles with 'Philosophy' at it's root.</p>
<p>one metric we can assign to each article in this tree is the number of descendant articles it has.</br>
'Philosophy', as the root, has all articles as descendants so it's number is 3.5e6 and it's rank 1.</br>
the next ranked by number of descendants is 'Modern philosophy' with 3.4e6 descendants; 
( ie of the 3.5e6 articles that eventually led to 'Philosophy' only 100e3 of them <em>didn't</em> click through 'Modern Philosophy').</p>
<p>by ranking articles by this metric we can observe the core structure of the tree.</p>
<p><hr>
in fact for the top 10 ranked articles it's hardly a tree, just the chain ...</p>
<p><a href="http://matpalm.com/wikipediaPhilosophy/top10.png"><img src="http://matpalm.com/wikipediaPhilosophy/top10.png" width="100%"/></a></p>
<p><small>(width of the edge is proportional to the number of descendants)</small></p>
<p>it turns out that 3e6 articles (85% of the lot) get to 'Philosophy' through 'Science'.</p>
<p><hr>
in fact it's not until we consider up to the 20th ranked item, 'Biology', before it actually becomes a tree structure ...</p>
<p><a href="http://matpalm.com/wikipediaPhilosophy/top20.png"><img src="http://matpalm.com/wikipediaPhilosophy/top20.png" width="100%"/></a></p>
<p><small>(click for a bigger version)</small></p>
<p><hr>
when we consider the top 200 things start to look a bit more interesting ...</p>
<script src="http://zoom.it/adTw.js?width=auto&height=500px"></script>

<p><hr>
and by the top 1000 things are starting to lose an obvious core structure ...</p>
<script src="http://zoom.it/QyGA.js?width=auto&height=500px"></script>

<p>( though dot's a pretty poor layout engine for this one, i should redo this one )</p>
<h2>conclusions</h2>
<p>so i managed to answer the main questions i had, but it's a fun dataset so there's lots more to do yet!</p>
<p>todos include </p>
<ol>
<li>a better layout for the top 1000 or so</li>
<li>redo with a more recent wiki dump to see what's changed</li>
<li>what happened at a depth of 19 articles?</li>
</ol>]]></content:encoded>
    </item>
    <item>
      <title>dimensionality reduction using random projections.</title>
      <link>http://matpalm.com/blog/2011/05/10/random-projections/</link>
      <category><![CDATA[machine learning]]></category>
      <guid>http://matpalm.com/blog/2011/05/10/random-projections/</guid>
      <description>dimensionality reduction using random projections.</description>
      <content:encoded><![CDATA[<p>previously i've discussed <a href="http://matpalm.com/lsa_via_svd/intro.html">dimensionality reduction using SVD and PCA</a> but another interesting technique is using a random projection.</p>
<p>in a random projection we project A (a NxM matrix) to A' (a NxO, O &lt; M) by the transform AP=A' where P is a MxO matrix with random values.</br>
( well not totally random, each column must have unit length (ie entries in each column must add to 1) )</p>
<p>though the results of this reduction are generally not as good as the results from SVD or PCA it has two huge benefits</p>
<ul>
<li>can be done <em>without</em> needing to hold P in memory (since it's entries can be generated multiple times using a seeded RNG)</li>
<li>and more importantly it's <em>really</em> fast </li>
</ul>
<p>how good is it compared to PCA i wonder? let's have a play around...</p>
<p>consider the 2d dataset of two clear clusters of points around (2,2) and (8,8)</p>
<img src="/blog/imgs/2011/05/2d_data.png"/>

<p>the following shows 5 random projections to 1d compared to a 1d PCA reduction (done using <a href="http://www.cs.waikato.ac.nz/ml/weka/">weka</a>) </p>
<img src="/blog/imgs/2011/05/2d_to_1d_projections.png"/>

<p>there was a clear seperation between the 2 classes in 2d and it's retained in 3 out of the 5 projections even though we're using <em>half</em> the space going from 2d to 1d.</br>
( the 2nd random projection actually looks very similiar to PCA )</p>
<p>what about some higher dimensions?</p>
<p>let's generate the same two clusters but in 10d; ie around the points (2,2,2,2,2,2,2,2,2,2) and (8,8,8,8,8,8,8,8,8,8)</p>
<p>though we can't plot this easily there are 2 useful visualisations of the data</p>
<p>a scatterplot, which is pretty uninteresting actually...</p>
<img src="/blog/imgs/2011/05/10d_scatterplot.png"/>

<p>or a 2d tour through the 10d space</p>
<p><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/QrJQDSFTg-k?hl=en&fs=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/QrJQDSFTg-k?hl=en&fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object></p>
<p>(check out <a href="http://vimeo.com/12292239">my screencast on ggobi</a> if you're after a better idea of what this tour represents)</p>
<p>lets compare 5 random projections to PCA when projecting from 10d to 1d</br>
this time not so good... only one projection gets the clusters seperated, and only just.</p>
<img src="/blog/imgs/2011/05/10d_to_1d_projections.png"/>

<p>what about projecting to 2d? </p>
<table>
<tr><td>PCA</td><td>run1</td><td>run2</td><td>run3</td><td>run4</td><td>run5</td></tr>
<tr><td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_PCA.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run1.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run2.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run3.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run4.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run5.png"/></td></tr>
</table>

<p>PCA is by the cleanest with the x-axis representing the clusters and the y-axis representing the in cluster variance. </br>
it clearly shows how the dataset can be projection to 1d, we only needed the first (x-axis) principal component.</p>
<p>the projections aren't too bad, all but 1 of them has the 2 clusters linearly seperable.</p>
<p>so in summary i think it's pretty good if you need to do something super fast. in these experiments i was using a pretty contrived dataset but was trying to be quite aggressive in going from
10d to 1d. </p>
<p>i wonder what, if any, difference there would be with sparse data?</p>
<p>&lt; random evening hackery /&gt;</p>]]></content:encoded>
    </item>
    <item>
      <title>pseudocounts and the good-turing estimation (part1)</title>
      <link>http://matpalm.com/blog/2011/04/03/pseudocounts-part-1/</link>
      <category><![CDATA[pseudocounts]]></category>
      <category><![CDATA[statistics]]></category>
      <guid>http://matpalm.com/blog/2011/04/03/pseudocounts-part-1/</guid>
      <description>pseudocounts and the good-turing estimation (part1)</description>
      <content:encoded><![CDATA[<h2>beer</h2>
<p>say we are running the bar at a soldout <a href="http://en.wikipedia.org/wiki/Bad_religion">bad religion</a> concert. the bar serves beer, scotch and water and we decide to record orders over the night so that we can know how much to order for tomorrow's gig...</p>
<table class="data">
<tr><td>drink</td><td>#sales</td></tr> 
<tr><td>beer</td><td>1000</td></tr> 
<tr><td>scotch</td><td>300</td></tr> 
<tr><td>water</td><td>200</td></tr> 
</table>

<p>using these numbers we can predict a number of things..</p>
<p>what is the chance the next person will order a beer?</br>
it's a pretty simple probability; 1000 beers / 1500 total drinks = 0.66 or 66%</p>
<p>what is the chance the next person will order a water?</br>
also straightforward; 200 waters / 1500 total drinks = 0.14 or 14%</p>
<h2>t-shirts</h2>
<p>now say we run the t-shirt stand at the same concert....</p>
<p>instead of only selling 3 items (like at the bar) we sell 20 different types of t-shirts. once again we record orders over the night...</p>
<table class="data">
<tr><td>t-shirt</td><td>#sales</td><td>t-shirt</td><td>#sales</td></tr> 
<tr><td>br tour</td><td>15</td><td>pennywise</td><td>3</td><tr>
<tr><td>br logo1</td><td>15</td><td>strung out</td><td>3</td><tr>
<tr><td>br album art 1</td><td>10</td><td>propagandhi</td><td>3</td><tr>
<tr><td>br album art 3</td><td>10</td><td>bouncing souls</td><td>1</td><tr>
<tr><td>nofx logo1</td><td>5</td><td>the vandals</td><td>1</td><tr>
<tr><td>nofx logo2</td><td>5</td><td>dead kennedys</td><td>1</td><tr>
<tr><td>lagwagon</td><td>4</td><td>misfits</td><td>1</td><tr>
<tr><td>frenzal rhomb</td><td>4</td><td>the offspring</td><td>0</td><tr>
<tr><td>rancid</td><td>4</td><td>the ramones</td><td>0</td><tr>
<tr><td>descendants</td><td>3</td><td>mxpx</td><td>0</td><tr>
</table>

<p>we can ask similar questions again regarding the chance of people buying a particular t-shirt</p>
<p>what's the chance the next person to buy a t-shirt wants the official tour t-shirt?</br>
15 tour t-shirts sold / 88 sold in total = 0.170 or 17.0%</p>
<p>what's the chance the next person to buy a t-shirt wants the a descendants t-shirt?</br>
3 descendants t-shirts sold / 88 sold in total = 0.034 or 3.4%</p>
<p>what's the chance the next person to buy a t-shirt wants the an offspring t-shirt?</br>
0 offspring t-shirts sold / 88 sold in total = 0 or 0%</p>
<p>if you're like me then the last one "feels" wrong. even though we've not seen a purchase of at least 1 t-shirt
it seems a bit rough to say there is <em>no</em> chance of someone buying one. 
this illustrates one of the problems of dealing 
<a href="http://en.wikipedia.org/wiki/Prior_probability">prior probabilities</a></p>
<p>any system using a products of probabilities, such as the modeling of "independent" events in naive bayes, suffers badly from these zero probabilities. i've discussed the problems a few times before in previous experiments such as 
(<a href="../rss.feed/p3/">this one on naive bayes</a> and <a href="../semi_supervised_naive_bayes/semi_supervised_bayes.html">this one on semi supervised bayes</a>)
and the approach i've always used is the simple 
<a href="http://en.wikipedia.org/wiki/Rule_of_succession">rule of succession</a> where we avoid
the zero problem by adding one to the frequency of each event.</p>
<p>for reference here are the probabilities per t-shirt without adjustment...</p>
<div class="pygments_murphy"><pre>R&gt; sales = rep(c(15,10,5,4,3,1,0), c(2,2,2,3,4,4,3))
 [1] 15 15 10 10  5  5  4  4  4  3  3  3  3  1  1  1  1  0  0  0
R&gt; simple_probs = sales / sum(sales)
 [1] 0.17045455 0.17045455 0.11363636 0.11363636 0.05681818 0.05681818
 [7] 0.04545455 0.04545455 0.04545455 0.03409091 0.03409091 0.03409091
[13] 0.03409091 0.01136364 0.01136364 0.01136364 0.01136364 0.00000000
[19] 0.00000000 0.00000000
</pre></div>

<p>... and here are the values for the succession rule case</p>
<div class="pygments_murphy"><pre>R&gt; sales = rep(c(15,10,5,4,3,1,0), c(2,2,2,3,4,4,3))
 [1] 15 15 10 10  5  5  4  4  4  3  3  3  3  1  1  1  1  0  0  0
R&gt; sales_plus_one = sales + 1
 [1] 16 16 11 11  6  6  5  5  5  4  4  4  4  2  2  2  2  1  1  1
R&gt; smooth_probs = sales_plus_one / sum(sales_plus_one)
 [1] 0.14814815 0.14814815 0.10185185 0.10185185 0.05555556 0.05555556
 [7] 0.04629630 0.04629630 0.04629630 0.03703704 0.03703704 0.03703704
[13] 0.03703704 0.01851852 0.01851852 0.01851852 0.01851852 0.00925926
[19] 0.00925926 0.00925926
</pre></div>

<p>no more zeros! yay! but, unfortunately, at the cost of the accuracy of the other values.</p>
<p>it's always worked for me in the past (well at least better than having the zeros) but it's always felt wrong too.  but finally the other day i found another approach, that seems a lot more statistically sound.</p>
<p>it's called <a href="http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation">good-turing estimation</a> 
and it was developed as part of <a href="http://en.wikipedia.org/wiki/Alan_Turing">turing's</a> work at bletchley 
park (so it must be awesome). a decent implementation is explained in 
<a href="http://www.grsampson.net/AGtf1.html">this paper</a> by geoffrey sampson (it's somewhat more 
complex than adding 1)</p>
<p>it works on using the frequency of frequencies and redistributes the probabilities to include a
special allocation that we should allocate over items that have never seen before. </p>
<p>the following table shows the frequencies, the original probability, the probability adjusted using the rule of
succession and the probabilities as redistributed using good turing estimation.</p>
<table class="data">
<tr>
 <td>freq</td>
 <td>freq of freq</td>
 <td>original</br>prob</td>
 <td>succession</br>prob</td>
 <td>good turing</br>prob</td>
 <td>description</td>
</tr>
<tr><td>15</td><td>2 </td><td>0.170 </td><td>0.148 </td><td>0.160 </td><td>2 t-shirts sold 15 times</td></tr>
<tr><td>10</td><td>2 </td><td>0.113 </td><td>0.101 </td><td>0.107 </td><td>2 t-shirts sold 10 times</td></tr>
<tr><td>5 </td><td>2 </td><td>0.056 </td><td>0.055 </td><td>0.054 </td><td>2 t-shirts sold 5 times</td></tr>
<tr><td>4 </td><td>3 </td><td>0.045 </td><td>0.046 </td><td>0.043 </td><td>3 t-shirts sold 4 times</td></tr>
<tr><td>3 </td><td>4 </td><td>0.034 </td><td>0.037 </td><td>0.033 </td><td>4 t-shirts sold 3 times</td></tr>
<tr><td>1 </td><td>4 </td><td>0.011 </td><td>0.018 </td><td>0.011 </td><td>4 t-shirts sold once</td></tr>
<tr><td>0 </td><td>3 </td><td>0.000 </td><td>0.009 </td><td>0.015 </td><td>3 t-shirts didn't sell</td></tr>
</table>

<p>and here's a graph of the same thing.</p>
<img src="/blog/imgs/tshirts.png"/>

<p>some observations...</p>
<ul>
<li>the rule as succession is just smoothing really and drags the higher probabilities down in response to pulling the lower probabilities up</li>
<li>the good turing estimation is closer to the real value of the high frequency cases</li>
<li>the good turing estimate for the zero case is quite a bit higher than the rule of succession estimate</li>
<li>and most interesting of all, the good turing estimate for the freq 0 is higher than the estimate for freq 1.</li>
</ul>
<p>the last point in particular i think is really interesting. the good turing algorithm actually gives a total estimate for the zero probability cases (in this examples it gave 0.045) and it's up to the user to distribute it among the actual examples (in this example
there were 3 cases so i just divided 0.045 by 3).</p>
<p>if there had be 4 types of t-shirts that hadn't sold the estimate for each of them would have be 0.011 like the 4 t-shirts that sold once.</p>
<p>if there had only be 1 type of t-shirt that hadn't had any sales we'd have to allocate the entire 0.045 to it. in effect the algorithms says it expects that t-shirt to be more likely to sell that the 4 types of t-shirts that had 3 sales each (the 0.033 probability case). </p>
<p>an interesting result, not sure what intution to take away from it.... </p>
<p>now this is all good, but i actually don't run the bar at a bad religion concert (or the t-shirt stand) 
and i'm actually interested in this in how it applies to text processing, especially in the area of classification.</p>
<p>so my question is <em>"is the extra computation required for the good-turing calculation worth it?"</em></p>
<p>results coming in part2. work in progress code on <a href="https://github.com/matpalm/pseudocounts">github</a></p>]]></content:encoded>
    </item>
    <item>
      <title>visualising the consistent hash</title>
      <link>http://matpalm.com/blog/2010/09/26/consistent_hash/</link>
      <category><![CDATA[algorithms]]></category>
      <guid>http://matpalm.com/consistent_hash/</guid>
      <description>visualising the consistent hash</description>
      <content:encoded><![CDATA[<p><style type="text/css">
    body {background-color:#000000; color:#cceedd}
    .r {color:#ff0000;}
    .y {color:#ccff00;}
    .c {color:#00ff77;}
    .b {color:#0077ff;}
    .p {color:#cc00ff;}
  </style></p>
<h2>the resource allocation problem</h2>
<p>consider the problem of allocating N resources across M servers (N &gt;&gt; M)</p>
<h2>modulo hash</h2>
<p>a common approach is the straight forward modulo hash...</p>
<p>if we have 4 servers; <pre>servers = [server0, server1, server2, server3]</pre> we can allocate a resource to a server by simply</p>
<ol>
<li>hashing the resource <pre>hash(resource) = 54</pre></li>
<li>reducing modulo 4 <pre>54 % 4 = 2</pre></li>
<li>allocating to that numbered server <pre>servers[2] = server2</pre></li>
</ol>
<p>we can visualise how this scheme maps resources to servers by allocating a colour to each server;
<span class="r">server0 </span> <span class="y">server1 </span> <span class="c">server2 </span> <span class="b">server3 </span></br>
and, assuming we are hashing to a value between 0 and 99, draw the following chart ...</p>
<p><img src="http://matpalm.com/consistent_hash/mod_4.png"></br>
... where the colour of the <i>n</i><sup>th</sup> column represents which server a resource hashing to <i>n</i> would be allocated to.</p>
<p>this hashing scheme is nice for a couple of reasons</p>
<ol>
<li>it's very simple</li>
<li>it allocates resources evenly across the servers (assuming you have a good hashing function)</li>
</ol>
<p>however it has one big drawback; what happens when you change the number of servers?</br>
say for example that due to extra load we have to add another server; <span class="p">server4</span></p>
<p>switching from modulo 4 to modulo 5 means that a resource that used to hash to server2 ...
<pre>54 % 4 = 2</pre>
now hashs to server4 ...
<pre>54 % 5 = 4</pre></p>
<p>in fact if we compare the difference in the hashing we get the following ...</p>
<img src="http://matpalm.com/consistent_hash/mod_4_45_diff_5.png">

<p>... where the top bar represents the allocation with 4 servers</br>
the bottom bar represents the allocation with 5 servers,</br>
with white areas between representing cases of a resource changing which server is was allocated to.</p>
<p>this is pretty bad in terms of reallocation; a whooping <i>80%</i> of the resources have changed which server they are assigned to.</p>
<h2>divisor hash</h2>
<p>how about instead of modulo arithmetic we try divisor instead?</p>
<p>considering 4 servers again we allocate a resource by</p>
<ol>
<li>hashing the resource as before <pre>hash(resource) = 54</pre></li>
<li>reducing divisor 25 (25=100/4; ie hash max / number servers) <pre>54 / (100/4) = 2</pre></li>
<li>allocating to that numbered server <pre>servers[2] = server2</pre></li>
</ol>
<p>as before we can visualise how this scheme maps resources to servers by again allocating a colour to each server;
<span class="r">server0 </span> <span class="y">server1 </span> <span class="c">server2 </span> <span class="b">server3 </span></br>
and, assuming we are hashing to a value between 0 and 99, draw the following chart ...</p>
<img src="http://matpalm.com/consistent_hash/div_4.png"/>

<p>again if we get a 5th server <span class="p">server4</span> we can see how the resources are reallocated ...</p>
<img src="http://matpalm.com/consistent_hash/div_4_45diff_5.png">

<p>this time we only 50% reallocation, instead of 80%, so that's an improvement.</br>
we also continue to spread the resources evenly across the servers which is great.</br></p>
<p>but of course, we can do better!</p>
<h2>consistent hash</h2>
<p>in a consistent hash we associate ranges of the hash space to servers by hashing the servers themselves.</p>
<p>starting with 4 servers we can hash them (by name, eg 'server0') into the range 0 to 90107 (a smallish prime) giving ...</br>
<span class="r">server0 =&gt; 67981, </span> <span class="y">server1 =&gt; 24530, </span> <span class="c">server2 =&gt; 71186, </span> <span class="b">server3 =&gt; 27735</span></p>
<p>... which can be converted into the ranges ...</br>
  <span class="y">server1 =&gt; (0, 24530), </span>  <span class="b">server3 =&gt; (24531, 27735)</span>  <span class="r">server0 =&gt; (27736, 67981), </span>
  <span class="c">server2 =&gt; (67982, 71186), </span>  <span class="y">server1 =&gt; (71186, 90106), </span></p>
<p>visually represented as ...</p>
<img src="http://matpalm.com/consistent_hash/ch_4_1slots.png">

<p>allocation of a resource to a server is simply done now by hashing the resource and see which range it falls into.</p>
<p>adding a 5th server is a done by hashing the new server; eg <span class="p">server4 =&gt; 74391</span> and adjusting the ranges.</p>
<img src="http://matpalm.com/consistent_hash/ch_45_1slot.png">

<p>we can see how this scheme ensures that as many resources as possible retain their original server allocation.</p>
<p>however there's a pretty obvious problem; where as the previous methods divided the hash space evenly this method is way off.</p>
<p>we'd like the ratios to be 0.25 for the 4 server case
and 0.20 for the 5 server case; but instead they are</br>
  <span class="r">server0 =&gt; 0.44, </span>
  <span class="y">server1 =&gt; 0.48, </span>
  <span class="c">server2 =&gt; 0.04, </span>
  <span class="b">server3 =&gt; 0.04</span> and </br>
  <span class="r">server0 =&gt; 0.44, </span>
  <span class="y">server1 =&gt; 0.44, </span>
  <span class="c">server2 =&gt; 0.04, </span>
  <span class="b">server3 =&gt; 0.04, </span>
  <span class="p">server4 =&gt; 0.04</span> </br></p>
<p>luckily there's a pretty simple fix; simply hash each server multiple times!</p>
<p>if we hash each server 5 times, using 5 different hash functions, we get the following allocations</p>
<img src="http://matpalm.com/consistent_hash/ch_45_5slots.png">

<p>which are this time much closer to being even; </br>
  <span class="r">server0 =&gt; 0.20, </span>
  <span class="y">server1 =&gt; 0.26, </span>
  <span class="c">server2 =&gt; 0.26, </span>
  <span class="b">server3 =&gt; 0.28 </span> and </br>
  <span class="r">server0 =&gt; 0.17, </span>
  <span class="y">server1 =&gt; 0.19, </span>
  <span class="c">server2 =&gt; 0.21, </span>
  <span class="b">server3 =&gt; 0.24, </span>
  <span class="p">server4 =&gt; 0.18</span> </br></p>
<p>and the more times we hash the closer we get to an even allocation.</br>
yay!</br>
we get the best of both worlds; an even allocation and the minimum amount of reallocation as the number of servers change.</br></p>
<p>there's one final trick that can be done with a consistent hash.</br>
turns out we don't <i>have</i> to give the same number of slots to each server</p>
<p>starting with an even allocation ...</p>
<img src="http://matpalm.com/consistent_hash/ch_5_5slots.png">

<p>we might decide to get <span class="p">server4</span> twice the number of slots that the others have ...</p>
<img src="http://matpalm.com/consistent_hash/ch_5_5slots_server4x2.png"/>

<p>this results in an uneven allocation of ...</br>
<span class="r">server0 =&gt; 0.16, </span>
<span class="y">server1 =&gt; 0.13, </span>
<span class="c">server2 =&gt; 0.17, </span>
<span class="b">server3 =&gt; 0.20, </span>
<span class="p">server4 =&gt; 0.34</span></br></p>
<p>why would we want to have a non even allocation?</br>
a couple of reasons i could think of are..</p>
<ol>
<li>a server with twice the grunt could get handle twice the load so should get twice the slots</li>
<li>it's an interesting way to handle a/b testing; introduce a new server by slowing 'dialing' up it's slots</li>
</ol>
<p>interesting stuff!</p>
<p>all the code used to generate the images for this page are available on <a href="http://github.com/matpalm/consistent_hash">github</a></p>
<p>26th september 2010</p>]]></content:encoded>
    </item>
  </channel>
</rss>

