<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>visualising the consistent hash</title>
      <link>http://matpalm.com/blog/2010/09/26/consistent_hash/</link>
      <category><![CDATA[algorithms]]></category>
      <guid>http://matpalm.com/consistent_hash/</guid>
      <description>visualising the consistent hash</description>
      <content:encoded><![CDATA[<p><style type="text/css">
    body {background-color:#000000; color:#cceedd}
    .r {color:#ff0000;}
    .y {color:#ccff00;}
    .c {color:#00ff77;}
    .b {color:#0077ff;}
    .p {color:#cc00ff;}
  </style></p>
<h2>the resource allocation problem</h2>
<p>consider the problem of allocating N resources across M servers (N &gt;&gt; M)</p>
<h2>modulo hash</h2>
<p>a common approach is the straight forward modulo hash...</p>
<p>if we have 4 servers; <pre>servers = [server0, server1, server2, server3]</pre> we can allocate a resource to a server by simply</p>
<ol>
<li>hashing the resource <pre>hash(resource) = 54</pre></li>
<li>reducing modulo 4 <pre>54 % 4 = 2</pre></li>
<li>allocating to that numbered server <pre>servers[2] = server2</pre></li>
</ol>
<p>we can visualise how this scheme maps resources to servers by allocating a colour to each server;
<span class="r">server0 </span> <span class="y">server1 </span> <span class="c">server2 </span> <span class="b">server3 </span></br>
and, assuming we are hashing to a value between 0 and 99, draw the following chart ...</p>
<p><img src="http://matpalm.com/consistent_hash/mod_4.png"></br>
... where the colour of the <i>n</i><sup>th</sup> column represents which server a resource hashing to <i>n</i> would be allocated to.</p>
<p>this hashing scheme is nice for a couple of reasons</p>
<ol>
<li>it's very simple</li>
<li>it allocates resources evenly across the servers (assuming you have a good hashing function)</li>
</ol>
<p>however it has one big drawback; what happens when you change the number of servers?</br>
say for example that due to extra load we have to add another server; <span class="p">server4</span></p>
<p>switching from modulo 4 to modulo 5 means that a resource that used to hash to server2 ...
<pre>54 % 4 = 2</pre>
now hashs to server4 ...
<pre>54 % 5 = 4</pre></p>
<p>in fact if we compare the difference in the hashing we get the following ...</p>
<img src="http://matpalm.com/consistent_hash/mod_4_45_diff_5.png">

<p>... where the top bar represents the allocation with 4 servers</br>
the bottom bar represents the allocation with 5 servers,</br>
with white areas between representing cases of a resource changing which server is was allocated to.</p>
<p>this is pretty bad in terms of reallocation; a whooping <i>80%</i> of the resources have changed which server they are assigned to.</p>
<h2>divisor hash</h2>
<p>how about instead of modulo arithmetic we try divisor instead?</p>
<p>considering 4 servers again we allocate a resource by</p>
<ol>
<li>hashing the resource as before <pre>hash(resource) = 54</pre></li>
<li>reducing divisor 25 (25=100/4; ie hash max / number servers) <pre>54 / (100/4) = 2</pre></li>
<li>allocating to that numbered server <pre>servers[2] = server2</pre></li>
</ol>
<p>as before we can visualise how this scheme maps resources to servers by again allocating a colour to each server;
<span class="r">server0 </span> <span class="y">server1 </span> <span class="c">server2 </span> <span class="b">server3 </span></br>
and, assuming we are hashing to a value between 0 and 99, draw the following chart ...</p>
<img src="http://matpalm.com/consistent_hash/div_4.png"/>

<p>again if we get a 5th server <span class="p">server4</span> we can see how the resources are reallocated ...</p>
<img src="http://matpalm.com/consistent_hash/div_4_45diff_5.png">

<p>this time we only 50% reallocation, instead of 80%, so that's an improvement.</br>
we also continue to spread the resources evenly across the servers which is great.</br></p>
<p>but of course, we can do better!</p>
<h2>consistent hash</h2>
<p>in a consistent hash we associate ranges of the hash space to servers by hashing the servers themselves.</p>
<p>starting with 4 servers we can hash them (by name, eg 'server0') into the range 0 to 90107 (a smallish prime) giving ...</br>
<span class="r">server0 =&gt; 67981, </span> <span class="y">server1 =&gt; 24530, </span> <span class="c">server2 =&gt; 71186, </span> <span class="b">server3 =&gt; 27735</span></p>
<p>... which can be converted into the ranges ...</br>
  <span class="y">server1 =&gt; (0, 24530), </span>  <span class="b">server3 =&gt; (24531, 27735)</span>  <span class="r">server0 =&gt; (27736, 67981), </span>
  <span class="c">server2 =&gt; (67982, 71186), </span>  <span class="y">server1 =&gt; (71186, 90106), </span></p>
<p>visually represented as ...</p>
<img src="http://matpalm.com/consistent_hash/ch_4_1slots.png">

<p>allocation of a resource to a server is simply done now by hashing the resource and see which range it falls into.</p>
<p>adding a 5th server is a done by hashing the new server; eg <span class="p">server4 =&gt; 74391</span> and adjusting the ranges.</p>
<img src="http://matpalm.com/consistent_hash/ch_45_1slot.png">

<p>we can see how this scheme ensures that as many resources as possible retain their original server allocation.</p>
<p>however there's a pretty obvious problem; where as the previous methods divided the hash space evenly this method is way off.</p>
<p>we'd like the ratios to be 0.25 for the 4 server case
and 0.20 for the 5 server case; but instead they are</br>
  <span class="r">server0 =&gt; 0.44, </span>
  <span class="y">server1 =&gt; 0.48, </span>
  <span class="c">server2 =&gt; 0.04, </span>
  <span class="b">server3 =&gt; 0.04</span> and </br>
  <span class="r">server0 =&gt; 0.44, </span>
  <span class="y">server1 =&gt; 0.44, </span>
  <span class="c">server2 =&gt; 0.04, </span>
  <span class="b">server3 =&gt; 0.04, </span>
  <span class="p">server4 =&gt; 0.04</span> </br></p>
<p>luckily there's a pretty simple fix; simply hash each server multiple times!</p>
<p>if we hash each server 5 times, using 5 different hash functions, we get the following allocations</p>
<img src="http://matpalm.com/consistent_hash/ch_45_5slots.png">

<p>which are this time much closer to being even; </br>
  <span class="r">server0 =&gt; 0.20, </span>
  <span class="y">server1 =&gt; 0.26, </span>
  <span class="c">server2 =&gt; 0.26, </span>
  <span class="b">server3 =&gt; 0.28 </span> and </br>
  <span class="r">server0 =&gt; 0.17, </span>
  <span class="y">server1 =&gt; 0.19, </span>
  <span class="c">server2 =&gt; 0.21, </span>
  <span class="b">server3 =&gt; 0.24, </span>
  <span class="p">server4 =&gt; 0.18</span> </br></p>
<p>and the more times we hash the closer we get to an even allocation.</br>
yay!</br>
we get the best of both worlds; an even allocation and the minimum amount of reallocation as the number of servers change.</br></p>
<p>there's one final trick that can be done with a consistent hash.</br>
turns out we don't <i>have</i> to give the same number of slots to each server</p>
<p>starting with an even allocation ...</p>
<img src="http://matpalm.com/consistent_hash/ch_5_5slots.png">

<p>we might decide to get <span class="p">server4</span> twice the number of slots that the others have ...</p>
<img src="http://matpalm.com/consistent_hash/ch_5_5slots_server4x2.png"/>

<p>this results in an uneven allocation of ...</br>
<span class="r">server0 =&gt; 0.16, </span>
<span class="y">server1 =&gt; 0.13, </span>
<span class="c">server2 =&gt; 0.17, </span>
<span class="b">server3 =&gt; 0.20, </span>
<span class="p">server4 =&gt; 0.34</span></br></p>
<p>why would we want to have a non even allocation?</br>
a couple of reasons i could think of are..</p>
<ol>
<li>a server with twice the grunt could get handle twice the load so should get twice the slots</li>
<li>it's an interesting way to handle a/b testing; introduce a new server by slowing 'dialing' up it's slots</li>
</ol>
<p>interesting stuff!</p>
<p>all the code used to generate the images for this page are available on <a href="http://github.com/matpalm/consistent_hash">github</a></p>
<p>26th september 2010</p>]]></content:encoded>
    </item>
    <item>
      <title>how many terms in a trend?</title>
      <link>http://matpalm.com/blog/2010/05/11/how-many-terms-in-a-trend/</link>
      <category><![CDATA[e15]]></category>
      <category><![CDATA[trending]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[puzzled]]></category>
      <guid>http://matpalm.com/blog/?p=627</guid>
      <description>how many terms in a trend?</description>
      <content:encoded><![CDATA[<p>i've been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i'm not sure how to solve. the question revolves around discovering multi terms trends. </p>
<p>a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms 'happy', 'new' and 'year'? it's pretty obvious that the actual trend is 'happy new year' but what is the best way to express this as a single trend in an algorithmic sense?</p>
<p>one approach i've been playing with is to collect unigrams, bigrams and trigrams (1,2,3 term 'phrases') and consider the cases where the terms overlap. basically if 'happy new year' is trending then, in some sense, we can ignore trends for 'happy new', 'new year', 'happy', 'new' and 'year'. but does this result in to many false positives? would we miss 'happy' as a trend if lots of people were chirpy about the change of year (as they usually are, on new years eve)</p>
<p>rather than outright ignore we could somehow reduce the weighting by removing the double counting.</p>
<p>eg if we had 3 trends;  (free beer,11), (free,12) &amp; (beer,25)
we can take 11 (from the 2gram) off both 1grams to give  (free beer,11), (free,1) &amp; (beer,14)
showing that 'beer', outside of the phrase 'free beer', is perhaps a trend in itself (as it should be)</p>
<p>this feels like it might work but would be non trivial (read: fun) to implement</p>
<p>another slightly different problem is around the handling of retweeting. my experiments have shown a huge amount of the 'trends' found are related to retweets, which is fine in itself, but it gives quite strange trends since the retweeted portion of the text is usually quite long.</p>
<p>for example; say lots of people are retweeting something and, as some people do, are adding various bits and pieces at the beginning and end; eg 'RT @bob omg i just found a peanut' or 'omg i just found a peanut; via @bob lucky him!!'</p>
<p>if we're considering bigrams (which i am in my current implementation) we end up with an odd selection of trends such as 'just found', 'a peanut', 'omg i', 'found a', 'i just' and in these cases it'd be great to be able to just stitch them together into the common retweeted element 'omg i just found a peanut'. </p>
<p>we could 'solve' this problem by not just considering 1,2 and 3 grams but considering <em>all</em> possible n-grams for each tweet and employing the technique we spoke of above of reducing the counts. it'd almost be feasible, since tweets are never that long, but feels uber clumsy and i'd hate to see the order statistic of that algorithm ;)</p>
<p>this seems more like a stitching problem of some kind;  eg if we have 4 grams 'omg i just found', 'i just found a', 'just found a peanut' perhaps we can identify the non trivial overlap and stitch them together (?)</p>
<p>not sure, there are a number of things to try. was hoping that brain dumping some of this would help me see the light but nothing obvious jumps out :(</p>]]></content:encoded>
    </item>
    <item>
      <title>e10.4 communities in social graphs</title>
      <link>http://matpalm.com/blog/2009/10/06/e10-4-communities-in-social-graphs/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[social network]]></category>
      <category><![CDATA[betweenness]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[graph]]></category>
      <guid>http://matpalm.com/blog/?p=83</guid>
      <description>e10.4 communities in social graphs</description>
      <content:encoded><![CDATA[<p>social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.</p>
<p>these connecting edges define the boundaries of communities in the social network and can be identified by algorithms that measure <a href="http://en.wikipedia.org/wiki/Betweenness#Betweenness_centrality">betweenness</a>.</p>
<p>the <a href="http://en.wikipedia.org/wiki/Girvan-Newman_algorithm">girvan-newman algorithm</a> can be used to decompose a graph hierarchically based on successive removal of the edges with the highest betweenness.</p>
<p>the algorithm is basically</p>
<ol>
<li>calculate the betweenness of each edge (using an all shortest paths algorithm)</li>
<li>remove the edge(s) with the highest betweenness</li>
<li>check for connected components (using <a href="http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm">tarjan's</a> algorithm)</li>
<li>repeat for graph or subgraphs if graph was split</li>
</ol>]]></content:encoded>
    </item>
    <item>
      <title>e10.3 twitter crawl progress</title>
      <link>http://matpalm.com/blog/2009/09/29/e10-3-twitter-crawl-progress/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[hadoop]]></category>
      <guid>http://matpalm.com/blog/?p=70</guid>
      <description>e10.3 twitter crawl progress</description>
      <content:encoded><![CDATA[<p>since the twitter api is rate limited it's quite slow to crawl twitter and after a most of a week i've still only managed to get info on 8,000 users. i probably should subscribe to get a 20,000 an hr limit instead of the 150 i'm on now. i'll just let it chug along in the background of my pvr.</p>
<p>while the crawl has been going on i've been trying some things on the data to decide what to do with it.</p>
<p>i've managed to write a version of pagerank using <a href="http://hadoop.apache.org/pig/">pig</a> which has been very interesting. (for those who haven't seen it before pig is a query language that sits on top of hadoop's mapreduce). my initial feel for pig is that it's pretty awesome. it was <em>much</em> quicker to write this script than to write the <a href="http://matpalm.com/sip/">statistically improbable phrases</a>. in fact i'm reinspired to have another crack at the sip stuff using pig. my final result wasn't great for the performance of hadoop and after some <a href="http://mail-archives.apache.org/mod_mbox/hadoop-general/200909.mbox/%3C93d501de0909141814vaa8c9c0wc5a47ee05baae7de@mail.gmail.com%3E">great feedback on the hadoop mailing list</a> i've got a number of other things to try including writing my joins in pig.</p>
<p>anyways, here's my pagerank in pig</p>
<!--more-->

<p>done once</p>
<table class="pygments_murphytable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3
4
5
6</pre></div></td><td class="code"><div class="pygments_murphy"><pre>edges = load &#39;edges&#39; as (from:chararray, to:chararray);
nodes = group edges by from;
node_contribs = foreach nodes generate group, 1.0 / (double)SIZE(edges) as contrib;
store node_contribs into &#39;node_contribs&#39;;
zero_contribs = foreach nodes generate group, (double)0 as contrib;
store zero_contribs into &#39;zero_contribs&#39;;
</pre></div>
</td></tr></table>

<p>done until convergence</p>
<table class="pygments_murphytable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16</pre></div></td><td class="code"><div class="pygments_murphy"><pre>page_rank = load &#39;$input&#39; as (node:chararray, rank:float);
node_contribs = load &#39;node_contribs&#39; as (node:chararray, contrib:double);
nodes_page_rank = join node_contribs by node, page_rank by node;
contribs = foreach nodes_page_rank {
  generate node_contribs::node, (double)node_contribs::contrib*(double)page_rank::rank as contrib;
}
edges = load &#39;edges&#39; as (from:chararray, to:chararray);
joined_divy_groups = join edges by from, contribs by node_contribs::node;
page_rank_contributions = foreach joined_divy_groups generate edges::to, contribs::contrib;
zero_contribs = load &#39;zero_contribs&#39; as (node:chararray, contrib:double);
page_rank_contributions_with_zero = union page_rank_contributions, zero_contribs;
group_page_ranks = group page_rank_contributions_with_zero by edges::to;
next_page_rank = foreach group_page_ranks {
  generate group, 0.15+(0.85*SUM(page_rank_contributions_with_zero.contribs::contrib));
}
store next_page_rank into &#39;$output&#39;;
</pre></div>
</td></tr></table>

<p>as for all my projects code is on <a href="http://github.com/matpalm/tgraph">github</a></p>]]></content:encoded>
    </item>
    <item>
      <title>e10.1 crawling twitter</title>
      <link>http://matpalm.com/blog/2009/09/19/e10-1-crawling-twitter/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[graph]]></category>
      <guid>http://matpalm.com/blog/?p=55</guid>
      <description>e10.1 crawling twitter</description>
      <content:encoded><![CDATA[<p>our first goal is to get some data and <a href="http://apiwiki.twitter.com/">the twitter api</a> makes getting the data trivial. i'm focused mainly on the <a href="http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-friends%C2%A0ids">friends</a> stuff but because it only gives user ids i'll also get the <a href="http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-users%C2%A0show">user info</a> so i can put names to ids.</p>
<p>a <a href="http://en.wikipedia.org/wiki/Depth-first_search">depth first crawl</a> makes no sense for this one experiment, we're unlikely to get the entire graph and are more interested in following edges "close" to me. instead we'll use a <a href="http://en.wikipedia.org/wiki/Breadth-first_search">breadth first search</a>.</p>
<p>since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth first we'll introduce a cost component to elements on the frontier so help decide what to grab next. this is especially important for a graph like  twitter where the outdegree of a node is often in the hundreds. it turns the crawl into something that is not strictly depth first but it works out.</p>
<p>to explain the cost component consider the expected connectivity of nodes in the twitter friend graph. most nodes have an outdegree of the order 20-200. occasionally you see much larger (in the 1000's) or much smaller (under 10).</p>
<p>we might, naively perhaps, say that having a large outdegree means the person is a bit less strict with her following criteria and that some of them are not really that important to her. if this is the case we should focus a little more  on getting nodes with smaller outdegree.</p>
<p>the formula i've come up with is to not consider the depth but instead add 1 + Log10(1+the outdegree of the previous node). in this way we penalise large outdegress, but not by a huge amount. we always add 1 to counter the cases where there are no edges leaving a node.</p>]]></content:encoded>
    </item>
    <item>
      <title>e10.0 introducing tgraph</title>
      <link>http://matpalm.com/blog/2009/09/19/e10-0-introducing-tgraph/</link>
      <category><![CDATA[big data]]></category>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <category><![CDATA[algorithms]]></category>
      <guid>http://matpalm.com/blog/?p=47</guid>
      <description>e10.0 introducing tgraph</description>
      <content:encoded><![CDATA[<p>so <a href="http://matpalm.com/sip/">e9 sip</a> is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, <a title="pagerank" href="http://en.wikipedia.org/wiki/PageRank">pagerank</a>. a well understood algorithm like page rank will be a  great chance to try <a href="http://hadoop.apache.org/pig/">pig</a>, the query language that sits on top of hadoop mapreduce.</p>
<p>so we need a graph to work on. my first thoughts were using one of the <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596">wikipedia linkage dumps</a> but it feels a bit sterile. instead it's a good excuse to do a little crawl of the following graph of twitter.</p>
<p>this will also be a chance to try to document a project via a blog. <a href="http://www.skorks.com/">skorks</a>' incessant blog rambling has convinced me to give it a go.</p>]]></content:encoded>
    </item>
    <item>
      <title>bin packing</title>
      <link>http://matpalm.com/blog/2008/12/14/bin-packing/</link>
      <category><![CDATA[algorithms]]></category>
      <guid>http://matpalm.com/blog/?p=18</guid>
      <description>bin packing</description>
      <content:encoded><![CDATA[<p>how to decide what next to backup onto a dvd?</p>
<p>when is brute force good enough? will a random walk get a good enough result faster?</p>
<p><a title="burnit" href="http://www.matpalm.com/burn.it">matpalm.com/burn.it</a></p>]]></content:encoded>
    </item>
    <item>
      <title>the median of a trillion numbers</title>
      <link>http://matpalm.com/blog/2008/11/15/the-median-of-a-trillion-numbers/</link>
      <category><![CDATA[erlang]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[ec2]]></category>
      <guid>http://matpalm.com/blog/?p=16</guid>
      <description>the median of a trillion numbers</description>
      <content:encoded><![CDATA[<p>i got asked in an interview once “how would find the median of a trillion numbers across a thousand machines?”</p>
<p>the question has haunted me, until now.</p>
<p>here’s my ruby and erlang implementation with a bit of running amazon ec2 thrown in for good measure….. <a href="http://www.matpalm.com/median/">matpalm.com/median/</a></p>
<p>grab the code from <a href="http://github.com/matpalm/median">github</a></p>]]></content:encoded>
    </item>
    <item>
      <title>fastmap and the jaccard distance</title>
      <link>http://matpalm.com/blog/2008/10/31/fastmap-and-the-jaccard-distance/</link>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[deduplication]]></category>
      <category><![CDATA[c++]]></category>
      <guid>http://matpalm.com/blog/?p=14</guid>
      <description>fastmap and the jaccard distance</description>
      <content:encoded><![CDATA[<p>given a set of pairwise distances how do you determine what points correspond to those distances?</p>
<p><a href="http://www.matpalm.com/resemblance/jaccard_distance/">my latest experiment</a> considers this problem in relation to jaccard distances, a resemblance measure similar to jaccard coefficients used in <a href="http://www.matpalm.com/resemblance/jaccard_coeff/">a previous experiment</a></p>
<p>by using the <a href="http://www.kyriakides.net/CBCL/references/Faloutsos/p163-faloutsos.pdf">fastmap</a> algorithm we get points from distances and once you have points you have visualisation!</p>]]></content:encoded>
    </item>
    <item>
      <title>shingling and the jaccard index</title>
      <link>http://matpalm.com/blog/2008/10/06/shingling-and-the-jaccard-index/</link>
      <category><![CDATA[ruby]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[deduplication]]></category>
      <category><![CDATA[c++]]></category>
      <guid>http://matpalm.com/blog/?p=10</guid>
      <description>shingling and the jaccard index</description>
      <content:encoded><![CDATA[<p>on the weekend i did another experiment using shingling and the jaccard index to try to determine if two sets of data were “duplicates”</p>
<p>it works quite well and includes a ruby and c++ version with low level bit operations.</p>
<p>project page is <a href="http://www.matpalm.com/resemblance/">www.matpalm.com/resemblance</a></p>
<p>code at <a href="http://github.com/matpalm/resemblance">github.com/matpalm/resemblance</a></p>]]></content:encoded>
    </item>
  </channel>
</rss>

