<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>e11.1 from bash scripts to hadoop</title>
      <link>http://matpalm.com/blog/2009/10/18/e11-1-from-bash-scripts-to-hadoop/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[maps]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <guid>http://matpalm.com/blog/?p=100</guid>
      <description>e11.1 from bash scripts to hadoop</description>
      <content:encoded><![CDATA[<p>let's rewrite <a href="http://matpalm.com/blog/2009/10/16/e11-0-tweets-around-the-world/">v1</a> using hadoop tooling, code is on <a href="http://github.com/matpalm/rtw_tweet/tree/master/v2/">github</a></p>
<p>we'll run hadoop in non distributed <a href="http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html#Local">standalone mode</a>. in this mode everything runs in a single jvm so it's nice and simple to dev against.</p>
<h3>step 1: extract the locations strings from the json stream</h3>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>bzcat sample.bz2 | ./extract_locations.pl &gt; locations
</pre></div>

<p>using the the awesome <a href="http://hadoop.apache.org/common/docs/current/streaming">hadoop streaming</a> interface it's not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.</p>
<p>for the mapper we'll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we use an "identity" script, ie cat, as the reduce phase.</p>
<div class="pygments_murphy"><pre>mkdir json_stream
bzcat sample.bz2 | gzip - &gt; json_stream/input.gz
# hadoop supports gzip out of the bound but not bzip2 :(
export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./extract_locations.pl -reducer /bin/cat \
  -input json_stream -output locations
</pre></div>

<p>this gives us the locations in a single file locations/part-0000</p>
<h3>step 2: extract iphone and ut lat longs strings</h3>
<p>the second step is another text munging problem where we extract just the lat longs for the iPhone and UT tagged locations</p>
<p>ie for strings of the form</p>
<div class="pygments_murphy"><pre>iPhone: 21.320328,-157.877579
\u00dcT: 41.727877,-91.626323
</pre></div>

<p>we want to extract</p>
<div class="pygments_murphy"><pre>21.320328 -157.877579
41.727877 -91.626323
</pre></div>

<p>since this is just text manipulation we'll use streaming again</p>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>cat locations | ./extract_lat_longs_from_locations.rb iphone &gt; locations.iphone
cat locations | ./extract_lat_longs_from_locations.rb ut &gt; locations.ut
</pre></div>

<p>for hadoop streaming it's</p>
<div class="pygments_murphy"><pre>hadoop jar $HADOOP_STREAMING_JAR \
  -mapper &#39;./extract_lat_longs_from_locations.rb iphone&#39; -reducer /bin/cat \
  -input locations -output locations.iphone
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper &#39;./extract_lat_longs_from_locations.rb ut&#39; -reducer /bin/cat \
  -input locations -output locations.ut
</pre></div>

<h3>step 3: convert from lat long to mercator coordinates and aggregate into buckets for the heat map</h3>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>cat locations.{ut,iphone} | ./lat_long_to_merc.rb | ./bucket.rb | sort | uniq -c
</pre></div>

<p>this converts the three tuples { lat, long }</p>
<div class="pygments_murphy"><pre>35.670086 139.740766
-23.492420 -46.846916
35.657570 139.744858
</pre></div>

<p>into two tuples { frequency, left-offset, top-offset }</p>
<div class="pygments_murphy"><pre>1 0.36 0.45
2 0.88 0.28
</pre></div>

<p>the first two parts, converting to mercator (lat_long_to_merc.rb) and the bucketing (bucket.rb), i'll combine into one script.</p>
<div class="pygments_murphy"><pre>hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./lat_long_to_merc_and_bucket.rb -reducer /bin/cat \
  -input locations.iphone -input locations.ut -output x_y_points
</pre></div>

<p>but the use of sort and uniq to aggregate the data is represented by the shuffle and reduce stages of hadoop.</p>
<p>we could use the aggregate functionality of the streaming interface but i'm trying to learn more pig so we'll use that instead. <a href="http://hadoop.apache.org/pig/">pig</a> is a scripting language that translates a pig latin query language into map reduce jobs. my main motivation for using it has been that it's great at doing joins, something i've found to be a <a href="http://matpalm.com/sip/take2_term_frequency.html#hadoop+part+2">big pain</a> to represent in plain map reduce jobs.</p>
<p>( note we didn't do the conversion to mercator and bucketing in pig, the arithmetic operations provided are a bit lacking. )</p>
<p>enter a pig shell running in standalone (ie non hadoop distributed) mode</p>
<div class="pygments_murphy"><pre>bash&gt; pig -x local
</pre></div>

<p>load the points</p>
<div class="pygments_murphy"><pre>grunt&gt; pts = load &#39;x_y_points/part-00000&#39; as (x:float, y:float);
grunt&gt; describe pts;
pts: {x: float,y: float}
grunt&gt; dump pts
(0.06F,0.32F)
(0.15F,0.27F)
(0.16F,0.27F)
...
</pre></div>

<p>group them together</p>
<div class="pygments_murphy"><pre>grunt&gt; buckets = group pts by (x,y);
grunt&gt; describe buckets;
buckets: {group: (x: float,y: float),pts: {x: float,y: float}}
grunt&gt; dump buckets;
((0.06F,0.32F),{(0.06F,0.32F)})
((0.15F,0.27F),{(0.15F,0.27F)})
((0.16F,0.27F),{(0.16F,0.27F),(0.16F,0.27F),(0.16F,0.27F),(0.16F,0.27F)})
...
</pre></div>

<p>from the groups emit the size of each bucket, this corresponds to the frequency</p>
<div class="pygments_murphy"><pre>grunt&gt; freq = foreach buckets { generate group, SIZE(pts) as size; }
grunt&gt; describe freq;
freq: {group: (x: float,y: float),size: long}
grunt&gt; dump freq
((0.06F,0.32F),1L)
((0.15F,0.27F),1L)
((0.16F,0.27F),4L)
...
</pre></div>

<p>and based on the sizes we can evaluate the min and max frequencies which we'll use in the colour coding of the heat map</p>
<div class="pygments_murphy"><pre>grunt&gt; freqs = group freq all;
grunt&gt; describe freqs;
freqs: {group: chararray,freq: {group: (x: float,y: float),size: long}}
grunt&gt; dump freqs;
(all,{((0.06F,0.32F),1L),((0.15F,0.27F),1L), ... })
grunt&gt; store freq into &#39;freqs&#39;;&lt;/pre&gt;
&lt;pre&gt;grunt&gt; min_max = foreach freqs { generate MAX(freq.size) as max, MIN(freq.size) as min; };
grunt&gt; describe min_max;
min_max: {max: long,min: long}
grunt&gt; dump min_max;
(7L,1L)
grunt&gt; store min_max into &#39;min_max&#39;;&lt;/pre&gt;
&lt;pre&gt;bash&gt; cat freqs
(0.06,0.32)   1
(0.15,0.27)   1
(0.16,0.27)   4
</pre></div>

<p>these call all be run as one command</p>
<div class="pygments_murphy"><pre>bash&gt; pig -x local -f freqs.pig
</pre></div>

<p>we just need our final conversion to a javascript snippet to jam into a map page</p>
<div class="pygments_murphy"><pre>bash&gt; cat freqs | ./as_draw_square.rb 1 7
</pre></div>

<p>win!</p>
<p>to make things a little different lets use a bigger sample of 475e3 tweets from oct 13 07:00 to 20:00. this results in 10e3 iphone locations (7e3 unique) and 22e3 ut locations (15e3 unique)</p>
<p>lat longs are bucketed into only 478 pixels for map</p>
<p>here's one plot with the raw numbers; highest freq is 9e3 in jakarta</p>
<h4>raw frequencies</h4>
<img class="size-full wp-image-111" title="raw frequencies" src="/blog/imgs/2009/10/raw1.jpg" alt="raw frequencies" width="682" height="529" />

<p>scaling down by log 10 gives a smoother map</p>
<h4>log10 frequencies</h4>
<img class="size-full wp-image-113" title="log10 frequencies" src="/blog/imgs/2009/10/log10.jpg" alt="log10 frequencies" width="682" height="529" />

<p>and here is a comparison of iphone vs ut. without knowing what ut is i can see it's not big in northern europe or japan but it's popular in indonesia.</p>
<h4>iphones</h4>
<img class="size-medium wp-image-120" title="iphones" src="/blog/imgs/2009/10/iphones-300x232.jpg" alt="iphones" width="300" height="232" />

<h4>ut</h4>
<img class="size-medium wp-image-119" title="ut" src="/blog/imgs/2009/10/ut-300x232.jpg" alt="ut" width="300" height="232" />

<p>next steps, animating based on the hour of the day</p>]]></content:encoded>
    </item>
    <item>
      <title>e11.0 tweets around the world</title>
      <link>http://matpalm.com/blog/2009/10/16/e11-0-tweets-around-the-world/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[maps]]></category>
      <category><![CDATA[twitter]]></category>
      <guid>http://matpalm.com/blog/?p=88</guid>
      <description>e11.0 tweets around the world</description>
      <content:encoded><![CDATA[<p>was discussing the <a href="http://apiwiki.twitter.com/Streaming-API-Documentation">streaming twitter api</a> with <a href="http://github.com/srbartlett">steve</a> and though i knew about the private firehose i didn't know there was a lighter weight public gardenhose interface!</p>
<p>since discovering this my pvr has basically been running
<pre>curl -u mat_kelcey:XXX http://stream.twitter.com/1/statuses/sample.json |
   gzip -9 - &gt; sample.json.gz</pre>
but what am i going to do with all this data?</p>
<p>while poking around i noticed there was a fair number of iPhone: and ÜT: lat long tagged locations (eg iPhone: 35.670086,139.740766) so as a first hack let's do some work extracing lat longs and displaying them as heat map points on a map.</p>
<p>all the code is on <a href="http://github.com/matpalm/rtw_tweet/tree/master/v1/">github</a></p>
<p>as a test then let's take a <a href="http://github.com/matpalm/rtw_tweet/tree/master/v1/">sample.bz2</a> of 1,300 tweets between Oct 14 22:01:41 and 22:03:24</p>
<p>from this let's just extract the location part of the tweet
<pre>bzcat sample.bz2 | ./extract_locations.pl &gt; locations</pre>
of these 1,300 there are 30 examples of iphone lat longs (eg iPhone: -23.492420,-46.846916)
<pre>cat locations | ./extract_lat_longs_from_locations.rb iphone &gt; locations.iphone</pre>
and 36 examples of ut lat longs (eg UT: 51.503212,5.478329)
<pre>cat locations | ./extract_lat_longs_from_locations.rb ut &gt; locations.ut</pre>
on a side note, does anyone have any idea what ÜT is ? a phone type, maybe a carrier?</p>
<p>we need to convert these lat/longs to x/y points so we can plot onto a map and we'll use the standard <a href="http://en.wikipedia.org/wiki/Mercator_projection">mercator projection</a> to do this
<pre>cat locations.{ut,iphone} | ./lat_long_to_merc.rb &gt; x_y_points</pre>
for the heat map we want to aggregate into buckets so the pixels are nice and big. finally we'll output some simple javascript we can cut and paste into some map html
<pre>cat x_y_points | ./bucket.rb | sort | uniq -c | ./as_draw_square.rb</pre>
the final result is <a href="http://www.matpalm.com/rtw_tweet/v1/map.html">this map</a> !</p>
<p>a good start. next to do the same over a much larger sample using hadoop streaming and pig and then work towards an animation by aggregating on time slices</p>]]></content:encoded>
    </item>
  </channel>
</rss>

