<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>trending topics in tweets about cheese; part2</title>
      <link>http://matpalm.com/blog/2010/05/01/trending-topics-in-tweets-about-cheese-part2/</link>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[trending]]></category>
      <category><![CDATA[e15]]></category>
      <category><![CDATA[pig]]></category>
      <guid>http://matpalm.com/blog/?p=559</guid>
      <description>trending topics in tweets about cheese; part2</description>
      <content:encoded><![CDATA[<p>prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.</p>
<p>the main approach will be</p>
<ol>
<li>maintain a relation with one record per token</li>
<li>fold 1 hours worth of new data at a time into the model</li>
<li>check the entries for the latest hour for any trends</li>
</ol>
<p>the <a href="http://github.com/matpalm/trending/blob/master/pig/trending.pig">full version is on github</a>. read on for a line by line walkthrough!</p>
<p>the ruby impl used the simplest approach possible for calculating mean and standard deviation; just keep a record of 
all the values seen so far and recalculate for each new value.</p>
<p>for our pig version we'll take a fixed space approach. rather than keep <em>all</em> the values for
each time series it turns out we can get away with storing just 3...</p>
<ol>
<li>num_occurences: the number of values</li>
<li>mean: the current mean of all values</li>
<li>mean_sqrs: the current mean of the squares of all values</li>
</ol>
<p>the idea is that the mean<sub>n+1</sub> = ( n * mean<sub>n</sub> + new value ) / n+1</p>
<p>and that the standard deviation<sub>n+1</sub> can be calculated from n, the mean<sub>n</sub> and the mean of the squares<sub>n</sub> as we'll see below</p>
<p>let's say we've already run it 6 times and we're now folding in the 7th chunk of per-hour data</p>
<p>the data up to now shows the following</br>
token 'a' has been seen in all 6 chunks with frequencies [1,2,1,2,1,1]; μ=1.33 ρ=0.51</br>
token 'b' has been seen in 3 chunks with frequencies [1,2,2]; μ=1.66 ρ=0.57</br>
token 'c' has been seen in 3 chunks with frequencies [3,4,2]; μ=3 ρ=1</p>
<p>the first thing is to load the existing version of the model, in this case stored in the file 'data/model/006'
it contains everything we need for checking the trending for each token</p>
<div class="pygments_murphy"><pre>&gt; model = load &#39;data/model/006&#39; as (token:chararray, num_occurences:int, mean:float, mean_sqrs:float);
&gt; describe model;
model: {token: chararray, 
        num_occurences: int, 
        mean: float, 
        mean_sqrs: float}
&gt; dump model;
(a,6,1.333333F,2.0F)
(b,3,1.666666F,3.0F)
(c,3,3.0F,9.666666F)
</pre></div>

<p>this tells us we've seen the token 'a' in 6 previous chunks, the average time we saw it was 1.3 times per chunk and the mean_sqrs (for the standard deviation
calculation) is 2</p>
<p>( as a reminder of how we're using these values to calculate a trending score see <a href="/blog/2010/04/27/trending-topics-in-tweets-about-cheese-part1/">part 1</a> )</p>
<p>next we load the new hour's worth of data, in this case contained in 'data/chunks/006'</p>
<div class="pygments_murphy"><pre>&gt; next_chunk = load &#39;data/chunks/006&#39;;
&gt; dump next_chunk;
(a b a a)
(d b d a d)
</pre></div>

<p>from the text we want to get the frequency of the tokens and we do this using <a href="https://github.com/matpalm/trending/blob/master/pig/tokenizer.py">tokenizer.py</a> which utilises the uber awesome <a href="http://www.nltk.org/">NLTK</a></p>
<div class="pygments_murphy"><pre>&gt; define tokenizer `python tokenizer.py` cache(&#39;data/tokenizer.py#tokenizer.py&#39;);
&gt; tokens = stream next_chunk through tokenizer as (token:chararray);
&gt; describe tokens;   
tokens: {token: chararray}
&gt; dump tokens;
(a)
(b)
(a)
(a)
(d)
(b)
(d)
(a)
(d)
</pre></div>

<p>calculating the frequencies of the tokens is a simple two step process of first grouping by the key...</p>
<div class="pygments_murphy"><pre>&gt; tokens_grouped = group tokens by token PARALLEL 1;
&gt; describe tokens_grouped;
tokens_grouped: {group: chararray,
                 tokens: {token: chararray}}
&gt; dump tokens_grouped;
(a,{(a),(a),(a),(a)})
(b,{(b),(b)})
(d,{(d),(d),(d)})
</pre></div>

<p>...and then generating the key, frequency pairs</p>
<div class="pygments_murphy"><pre>&gt; chunk = foreach tokens_grouped generate group as token, SIZE(tokens) as freq;
&gt; dump chunk;
(a,4L)
(b,2L)
(d,3L)
</pre></div>

<p>next we join the model with this latest chunk</p>
<div class="pygments_murphy"><pre>&gt; cogrouped = cogroup model by token, chunk by token;
&gt; describe cogrouped;
cogrouped: {group: chararray,
            model: {token: chararray,
                    num_occurences: int,
                    mean: float,
                    mean_sqrs: float},
            chunk: {token: chararray,
                    freq: long}}
&gt; dump cogrouped;
(a,{(a,6,1.333333F,2.0F)},{(a,4L)})
(b,{(b,3,1.666666F,3.0F)},{(b,2L)})
(c,{(c,3,3.0F,9.666666F)},{})
(d,{},{(d,3L)})
</pre></div>

<p>and doing this allows us to break the data into three distinct relations...</p>
<ol>
<li>entries where token was just in the model; these continue to the next iteration untouched as there is nothing to update</li>
<li>entries where token was just in the chunk; these are being seen for the first time and contribute new model entries</li>
<li>entries where token was in both; these need a trending check and will require the chunk being folded into the model </li>
</ol>
<div class="pygments_murphy"><pre>&gt; split cogrouped into
        just_model_grped if IsEmpty(chunk),
        just_chunk_grped if IsEmpty(model),
        in_both_grped    if not IsEmpty(chunk) and not IsEmpty(model);
&gt; dump just_model_grped;
(c,{(c,3,3.0F,9.666666F)},{})
&gt; dump just_chunk_grped;
(d,{},{(d,3L)})
&gt; dump in_both_grped;
(a,{(a,6,1.333333F,2.0F)},{(a,4L)})
(b,{(b,3,1.666666F,3.0F)},{(b,2L)})
</pre></div>

<p>each of these can be processed in turn.</p>
<p>firstly entries where the token was only the model (ie not in the chunk) pass to next generation of model untouched</p>
<div class="pygments_murphy"><pre>&gt; model_n1__just_model = foreach just_model_grped generate flatten(model);
&gt; dump model_n1__just_model;
(c,3,3.0F,9.666666F)
</pre></div>

<p>secondly entries where the token was only in the chunk (ie not in the model) contribute new model entries for the next generation</p>
<div class="pygments_murphy"><pre>&gt; just_chunk_entries = foreach just_chunk_grped generate flatten(chunk);
&gt; model_n1__just_chunk = foreach just_chunk_entries generate token, 1, freq, freq*freq;
&gt; dump model_n1__just_chunk;
(d,1,3L,9L)
</pre></div>

<p>finally, and the most interestingly, when the token was in both the model and the chunk we need to....</p>
<p>flatten the data out a bit</p>
<div class="pygments_murphy"><pre>&gt; describe in_both_grped;
in_both_grped: {group: chararray,
                model: {token: chararray,
                        num_occurences: int,
                        mean: float,
                        mean_sqrs: float},
                chunk: {token: chararray,
                        freq: long}}
&gt; in_both_flat = foreach in_both_grped generate flatten(model), flatten(chunk);
&gt; describe in_both_flat;
in_both_flat: {model::token: chararray,
               model::num_occurences: int,
               model::mean: float,
               model::mean_sqrs: float,
               chunk::token: chararray,
               chunk::freq: long}
&gt; dump in_both_flat;
(a,6,1.333333F,2.0F,a,4L)
(b,3,1.666666F,3.0F,b,2L)
</pre></div>

<p>do a trending check (note the comparison of freq of iter:n is done against mean/sd of iter:n-1)</p>
<div class="pygments_murphy"><pre>&gt; trending = foreach in_both_flat {
               sd_lhs = num_occurences * mean_sqrs;
               sd_rhs = num_occurences * (mean*mean);
               sd = sqrt( (sd_lhs-sd_rhs) / num_occurences ); 
               fraction_of_sd_over_mean = ( sd==0 ? 0 : (freq-mean)/sd );
               generate model::token as token, fraction_of_sd_over_mean as trending_score; 
 }
&gt; describe trending;
trending: {token: chararray,
           trending_score: double}
&gt; dump trending;
(a,5.656845419750436)
(b,0.7071049267408686)
</pre></div>

<p>this result tells us that token 'a' is well over what was expected and is seriously trending</br>
with a frequency of 4 in this hour's chunk it's 5.6 times the standard deviation (ρ=0.51) over it's mean frequency (μ=1.33)</p>
<p>token 'b' isn't really trending</br>
with a frequency of 2 in this hour's chunk it's not even one (0.7) standard deviation (ρ=0.57) over it's mean frequency (μ=1.66)</p>
<p>at this stage we can do whatever we want with the trending scores, perhaps save the top 10 off.</p>
<div class="pygments_murphy"><pre>trending_sorted = order trending by trending_score desc PARALLEL 1;
top_trending = limit trending_sorted 10 PARALLEL 1;
store top_trending into &#39;data/trending/006&#39;;
</pre></div>

<p>after the trending check we need to fold the chunk into the existing model</p>
<div class="pygments_murphy"><pre>&gt; model_n1__folded = foreach in_both_flat {
                       new_total = (mean * num_occurences) + freq;
                       new_total_sqrs = (mean_sqrs * num_occurences) + (freq*freq);
                       num_occurences = num_occurences + 1;
                       mean = new_total / num_occurences;
                       mean_sqrs = new_total_sqrs / num_occurences;
                       generate model::token, num_occurences, mean, mean_sqrs;
};
&gt; dump model_n1__folded;
(a,7,1.7142855F,4.0F)
(b,4,1.7499995F,3.25F)
</pre></div>

<p>and finally combine with the previous parts we had broken out before to make the new generation of the model!</p>
<div class="pygments_murphy"><pre>&gt; model_n1 = union model_n1__just_model, 
                   model_n1__just_chunk, 
                   model_n1__folded;
&gt; store model_n1 into &#39;data/model/007&#39;;
</pre></div>]]></content:encoded>
    </item>
    <item>
      <title>a pig screencast</title>
      <link>http://matpalm.com/blog/2010/01/17/a-pig-screencast/</link>
      <category><![CDATA[screencast]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <guid>http://matpalm.com/blog/?p=248</guid>
      <description>a pig screencast</description>
      <content:encoded><![CDATA[<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="400" height="300" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://vimeo.com/moogaloop.swf?clip_id=8789251&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed type="application/x-shockwave-flash" width="400" height="300" src="http://vimeo.com/moogaloop.swf?clip_id=8789251&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p><a href="http://vimeo.com/8789251">pig demo</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>based on a talk i gave at work recently</p>]]></content:encoded>
    </item>
    <item>
      <title>e11.2 aggregating tweets by time of day</title>
      <link>http://matpalm.com/blog/2009/10/24/e11-2-aggregating-tweets-by-time-of-day/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <guid>http://matpalm.com/blog/?p=144</guid>
      <description>e11.2 aggregating tweets by time of day</description>
      <content:encoded><![CDATA[<p>for v3 lets aggregate by time of the day, should make for an interesting animation</p>
<p>browsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.</p>
<p>furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,</p>
<p>i've been streaming all my tweets ( <a href="http://matpalm.com/blog/2009/10/16/e11-0-tweets-around-the-world/">as previously discussed</a> ) and been storing them in a directory json_stream</p>
<p>here are the steps...</p>
<h2>1. extract locations</h2>
<p>use a streaming script to take a tweet in json form and emit the tweet time and location string</p>
<pre>export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
hadoop jar $HADOOP_STREAMING_JAR \
 -mapper <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/extract_locations.rb">./extract_locations.rb</a> -reducer /bin/cat \
 -input json_stream -output locations</pre>

<p>sample output (4.7e6 tuples) { time, location string }</p>
<pre>Wed Oct 14 22:01:41 +0000 2009    iPhone: -23.492420,-46.846916
Wed Oct 14 22:01:41 +0000 2009    Ottawa
Wed Oct 14 22:01:41 +0000 2009    DA HOOD
Wed Oct 14 22:01:42 +0000 2009    Earth</pre>

<h2>2. pluck lat longs from locations</h2>
<p>make another pass and extract possible lat lons from the location strings</p>
<pre>hadoop jar $HADOOP_STREAMING_JAR \
 -mapper <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/extract_lat_longs_from_locations.rb">./extract_lat_longs_from_locations.rb</a> -reducer /bin/cat \
 -input locations -output lat_lons</pre>

<p>sample output (reduces down to 320e3 data points) { time, lat, lon }</p>
<pre>Wed Oct 14 22:01:41 +0000 2009    -23.49242    -46.846916
Wed Oct 14 22:05:25 +0000 2009    35.670086    139.740766
Wed Oct 14 22:11:35 +0000 2009    41.37731257    -74.68153942
Wed Oct 14 22:15:18 +0000 2009    51.503212    5.478329</pre>

<h2>3. bucket data into timeslices and points for a map</h2>
<p>we need to project the times into 10min slots; ie 00:05 will be slot 0, 00:12 will be slot 1.</p>
<p>also use to project the lat lons to x and y coords (0-&gt;1) using a simple <a href="http://en.wikipedia.org/wiki/Mercator_projection">mercator</a> projection</p>
<pre>hadoop jar $HADOOP_STREAMING_JAR \
 -mapper <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/lat_long_to_merc_and_bucket.rb">./lat_long_to_merc_and_bucket.rb</a> -reducer /bin/cat \
 -cmdenv BUCKET_SIZE=0.005 \
 -input lat_lons -output x_y_points</pre>

<p>sample output { timeslice, normalised x position, normalised y position }</p>
<pre>122     0.48    0.205
122     0.295   0.26
122     0.29    0.26
123     0.265   0.265</pre>

<p>as a slight digression before we move onto aggregating per timeslice here's a pic of all 320e3 tweets on a heatmap.</p>
<p>some interesting noise on the greenwich meridian, must be incorrectly identified lat lons during the <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/extract_lat_longs_from_locations.rb">./extract_lat_longs_from_locations.rb</a> step.</p>
<h3>log10 tweet location (click for a hires version)</h3>
<p><a href="http://matpalm.com/rtw_tweet/v3/hi_res_320e3_log.jpg"><img class="size-full wp-image-149" title="lo_res_320e3_log" src="/blog/imgs/2009/10/lo_res_320e3_log.jpg" alt="log10 tweet location, click for a hires version" width="640" height="496" /></a></p>
<h2>4. aggregate (x,y) pairs per timeslice</h2>
<p>next we aggreate, per timeslice, the frequency of points each x,y point.
we'll do this with a pig script, <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/aggregate_per_timeslice.pig">aggregate_per_timeslice.pig</a></p>
<pre>
# aggregating per timeslice
pts = load 'x_y_points/part-00000' as (timeslice:int, x:float, y:float);
pts2 = group pts by (timeslice,x,y);
pts3 = foreach pts2 generate $0, COUNT($1) ;
pts4 = foreach pts3 generate $0.$0, $0.$1, $0.$2, $1 as freq;
pts5 = order pts4 by timeslice;
store pts5 into 'aggregated_freqs';</pre>

<p>results in the tuples in 'aggregated_freqs' { timeslice, normalised x position, normalised y position, frequency }</p>
<pre>0    0.0    0.32    1
0    0.06    0.325    9
0    0.065    0.33    1
0    0.08    0.17    2
0    0.155    0.225    8</pre>

<p>we need to normalise each frequency value for drawing on the map and would have like to have done this in pig also but turns out there isn't a log function in v0.3 of pig (??)</p>
<p>will have to do scaling when generating the images. isn't such a big deal since the dataset is quite small at this stage but was trying to use this whole thing as an excuse to learn pig :(</p>
<h2>5. take aggregated_freqs and make 144 heat map images</h2>
<p>use a simple script to read through the aggregated_freqs and generate a heap map for each frame</p>
<pre><a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/heat_maps.rb">heat_maps.rb</a> aggregated_freqs 0.005 frames</pre>

<h2>6. convert to animation</h2>
<p>next bundle stills into an animation and upload to youtube</p>
<pre>mencoder mencoder "mf://frames/*" -mf fps=25 -o rtw_tweet_v3.avi -ovc x264 -x264encopts bitrate=750</pre>

<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="344" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://www.youtube.com/v/cSnGI33CwP0&amp;hl=en&amp;fs=1&amp;" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="425" height="344" src="http://www.youtube.com/v/cSnGI33CwP0&amp;hl=en&amp;fs=1&amp;" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<h2>7. conclusions</h2>
<ol>
<li>didn't really end up using hadoop's power that much; streaming jobs that use just cat as a reducer as just a parallel way of doing 1:1 string mapping</li>
<li>aggregation was really easy in pig but lack of Log function is annoying; could have written a <a href="http://wiki.apache.org/pig/UDFManual">UDF</a>, and there probably already is one but i couldn't find it</li>
<li>this visualisation came out pretty lame; funny to see how the really swish visualisations rely far more on pretty colours and smooth lines than the data itself. there are a bundle of things i could do with this one but it's time to move on to something else.</li>
</ol>]]></content:encoded>
    </item>
    <item>
      <title>e11.1 from bash scripts to hadoop</title>
      <link>http://matpalm.com/blog/2009/10/18/e11-1-from-bash-scripts-to-hadoop/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[maps]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <guid>http://matpalm.com/blog/?p=100</guid>
      <description>e11.1 from bash scripts to hadoop</description>
      <content:encoded><![CDATA[<p>let's rewrite <a href="http://matpalm.com/blog/2009/10/16/e11-0-tweets-around-the-world/">v1</a> using hadoop tooling, code is on <a href="http://github.com/matpalm/rtw_tweet/tree/master/v2/">github</a></p>
<p>we'll run hadoop in non distributed <a href="http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html#Local">standalone mode</a>. in this mode everything runs in a single jvm so it's nice and simple to dev against.</p>
<h3>step 1: extract the locations strings from the json stream</h3>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>bzcat sample.bz2 | ./extract_locations.pl &gt; locations
</pre></div>

<p>using the the awesome <a href="http://hadoop.apache.org/common/docs/current/streaming">hadoop streaming</a> interface it's not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.</p>
<p>for the mapper we'll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we use an "identity" script, ie cat, as the reduce phase.</p>
<div class="pygments_murphy"><pre>mkdir json_stream
bzcat sample.bz2 | gzip - &gt; json_stream/input.gz
# hadoop supports gzip out of the bound but not bzip2 :(
export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./extract_locations.pl -reducer /bin/cat \
  -input json_stream -output locations
</pre></div>

<p>this gives us the locations in a single file locations/part-0000</p>
<h3>step 2: extract iphone and ut lat longs strings</h3>
<p>the second step is another text munging problem where we extract just the lat longs for the iPhone and UT tagged locations</p>
<p>ie for strings of the form</p>
<div class="pygments_murphy"><pre>iPhone: 21.320328,-157.877579
\u00dcT: 41.727877,-91.626323
</pre></div>

<p>we want to extract</p>
<div class="pygments_murphy"><pre>21.320328 -157.877579
41.727877 -91.626323
</pre></div>

<p>since this is just text manipulation we'll use streaming again</p>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>cat locations | ./extract_lat_longs_from_locations.rb iphone &gt; locations.iphone
cat locations | ./extract_lat_longs_from_locations.rb ut &gt; locations.ut
</pre></div>

<p>for hadoop streaming it's</p>
<div class="pygments_murphy"><pre>hadoop jar $HADOOP_STREAMING_JAR \
  -mapper &#39;./extract_lat_longs_from_locations.rb iphone&#39; -reducer /bin/cat \
  -input locations -output locations.iphone
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper &#39;./extract_lat_longs_from_locations.rb ut&#39; -reducer /bin/cat \
  -input locations -output locations.ut
</pre></div>

<h3>step 3: convert from lat long to mercator coordinates and aggregate into buckets for the heat map</h3>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>cat locations.{ut,iphone} | ./lat_long_to_merc.rb | ./bucket.rb | sort | uniq -c
</pre></div>

<p>this converts the three tuples { lat, long }</p>
<div class="pygments_murphy"><pre>35.670086 139.740766
-23.492420 -46.846916
35.657570 139.744858
</pre></div>

<p>into two tuples { frequency, left-offset, top-offset }</p>
<div class="pygments_murphy"><pre>1 0.36 0.45
2 0.88 0.28
</pre></div>

<p>the first two parts, converting to mercator (lat_long_to_merc.rb) and the bucketing (bucket.rb), i'll combine into one script.</p>
<div class="pygments_murphy"><pre>hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./lat_long_to_merc_and_bucket.rb -reducer /bin/cat \
  -input locations.iphone -input locations.ut -output x_y_points
</pre></div>

<p>but the use of sort and uniq to aggregate the data is represented by the shuffle and reduce stages of hadoop.</p>
<p>we could use the aggregate functionality of the streaming interface but i'm trying to learn more pig so we'll use that instead. <a href="http://hadoop.apache.org/pig/">pig</a> is a scripting language that translates a pig latin query language into map reduce jobs. my main motivation for using it has been that it's great at doing joins, something i've found to be a <a href="http://matpalm.com/sip/take2_term_frequency.html#hadoop+part+2">big pain</a> to represent in plain map reduce jobs.</p>
<p>( note we didn't do the conversion to mercator and bucketing in pig, the arithmetic operations provided are a bit lacking. )</p>
<p>enter a pig shell running in standalone (ie non hadoop distributed) mode</p>
<div class="pygments_murphy"><pre>bash&gt; pig -x local
</pre></div>

<p>load the points</p>
<div class="pygments_murphy"><pre>grunt&gt; pts = load &#39;x_y_points/part-00000&#39; as (x:float, y:float);
grunt&gt; describe pts;
pts: {x: float,y: float}
grunt&gt; dump pts
(0.06F,0.32F)
(0.15F,0.27F)
(0.16F,0.27F)
...
</pre></div>

<p>group them together</p>
<div class="pygments_murphy"><pre>grunt&gt; buckets = group pts by (x,y);
grunt&gt; describe buckets;
buckets: {group: (x: float,y: float),pts: {x: float,y: float}}
grunt&gt; dump buckets;
((0.06F,0.32F),{(0.06F,0.32F)})
((0.15F,0.27F),{(0.15F,0.27F)})
((0.16F,0.27F),{(0.16F,0.27F),(0.16F,0.27F),(0.16F,0.27F),(0.16F,0.27F)})
...
</pre></div>

<p>from the groups emit the size of each bucket, this corresponds to the frequency</p>
<div class="pygments_murphy"><pre>grunt&gt; freq = foreach buckets { generate group, SIZE(pts) as size; }
grunt&gt; describe freq;
freq: {group: (x: float,y: float),size: long}
grunt&gt; dump freq
((0.06F,0.32F),1L)
((0.15F,0.27F),1L)
((0.16F,0.27F),4L)
...
</pre></div>

<p>and based on the sizes we can evaluate the min and max frequencies which we'll use in the colour coding of the heat map</p>
<div class="pygments_murphy"><pre>grunt&gt; freqs = group freq all;
grunt&gt; describe freqs;
freqs: {group: chararray,freq: {group: (x: float,y: float),size: long}}
grunt&gt; dump freqs;
(all,{((0.06F,0.32F),1L),((0.15F,0.27F),1L), ... })
grunt&gt; store freq into &#39;freqs&#39;;&lt;/pre&gt;
&lt;pre&gt;grunt&gt; min_max = foreach freqs { generate MAX(freq.size) as max, MIN(freq.size) as min; };
grunt&gt; describe min_max;
min_max: {max: long,min: long}
grunt&gt; dump min_max;
(7L,1L)
grunt&gt; store min_max into &#39;min_max&#39;;&lt;/pre&gt;
&lt;pre&gt;bash&gt; cat freqs
(0.06,0.32)   1
(0.15,0.27)   1
(0.16,0.27)   4
</pre></div>

<p>these call all be run as one command</p>
<div class="pygments_murphy"><pre>bash&gt; pig -x local -f freqs.pig
</pre></div>

<p>we just need our final conversion to a javascript snippet to jam into a map page</p>
<div class="pygments_murphy"><pre>bash&gt; cat freqs | ./as_draw_square.rb 1 7
</pre></div>

<p>win!</p>
<p>to make things a little different lets use a bigger sample of 475e3 tweets from oct 13 07:00 to 20:00. this results in 10e3 iphone locations (7e3 unique) and 22e3 ut locations (15e3 unique)</p>
<p>lat longs are bucketed into only 478 pixels for map</p>
<p>here's one plot with the raw numbers; highest freq is 9e3 in jakarta</p>
<h4>raw frequencies</h4>
<img class="size-full wp-image-111" title="raw frequencies" src="/blog/imgs/2009/10/raw1.jpg" alt="raw frequencies" width="682" height="529" />

<p>scaling down by log 10 gives a smoother map</p>
<h4>log10 frequencies</h4>
<img class="size-full wp-image-113" title="log10 frequencies" src="/blog/imgs/2009/10/log10.jpg" alt="log10 frequencies" width="682" height="529" />

<p>and here is a comparison of iphone vs ut. without knowing what ut is i can see it's not big in northern europe or japan but it's popular in indonesia.</p>
<h4>iphones</h4>
<img class="size-medium wp-image-120" title="iphones" src="/blog/imgs/2009/10/iphones-300x232.jpg" alt="iphones" width="300" height="232" />

<h4>ut</h4>
<img class="size-medium wp-image-119" title="ut" src="/blog/imgs/2009/10/ut-300x232.jpg" alt="ut" width="300" height="232" />

<p>next steps, animating based on the hour of the day</p>]]></content:encoded>
    </item>
    <item>
      <title>e10.0 introducing tgraph</title>
      <link>http://matpalm.com/blog/2009/09/19/e10-0-introducing-tgraph/</link>
      <category><![CDATA[big data]]></category>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <category><![CDATA[algorithms]]></category>
      <guid>http://matpalm.com/blog/?p=47</guid>
      <description>e10.0 introducing tgraph</description>
      <content:encoded><![CDATA[<p>so <a href="http://matpalm.com/sip/">e9 sip</a> is on hold for a bit while i kick off e10 tgraph. was looking for another problem to try hadoop with and came across a classic graph one, <a title="pagerank" href="http://en.wikipedia.org/wiki/PageRank">pagerank</a>. a well understood algorithm like page rank will be a  great chance to try <a href="http://hadoop.apache.org/pig/">pig</a>, the query language that sits on top of hadoop mapreduce.</p>
<p>so we need a graph to work on. my first thoughts were using one of the <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596">wikipedia linkage dumps</a> but it feels a bit sterile. instead it's a good excuse to do a little crawl of the following graph of twitter.</p>
<p>this will also be a chance to try to document a project via a blog. <a href="http://www.skorks.com/">skorks</a>' incessant blog rambling has convinced me to give it a go.</p>]]></content:encoded>
    </item>
  </channel>
</rss>

