<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>e11.3 at what time does the world tweet?</title>
      <link>http://matpalm.com/blog/2009/10/28/e11-3-at-what-time-does-the-world-tweet/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[r]]></category>
      <guid>http://matpalm.com/blog/?p=186</guid>
      <description>e11.3 at what time does the world tweet?</description>
      <content:encoded><![CDATA[<p>consider the graph below which shows the proportion of tweets per 10 min slot of the day (GMT0)</p>
<p>it compares 4.7e6 tweets with any location vs  320e3 tweets with identifiable lat lons
<p style="text-align: center;"><img class="aligncenter size-full wp-image-200" title="timeslices_freq.comparison" src="/blog/imgs/2009/10/timeslices_freq.comparison2.jpg" alt="timeslices_freq.comparison" width="750" height="480" /></p></p>
<p>some interesting observations with unanswered questions...
<ol>
    <li>the ebb and flow is not just a result of the time of day for high twitter traffic areas. the reduction between 06:00 and 10:00 comes close to zero. this is false, there is never a worldwide time when internet traffic hits zero. does twitter turn down it's gatdenhose for capacity reasons?</li>
    <li>the number of tweets with lat lons are correlated to those without EXCEPT past 17:00 where the lat lon cases drop drastically. have a couple of ideas banging around my head why this is the case but nothing concrete. any ideas?</li>
</ol>
speaking of correlation here's a scatterplot of tweets with lat lons vs without. we can see that time period uncorrelatedness that occurs past 17:00 as a quite obvious cluster.</p>
<img class="aligncenter size-full wp-image-190" title="timeslices_freq.scatter" src="/blog/imgs/2009/10/timeslices_freq.scatter.jpg" alt="timeslices_freq.scatter" width="400" height="480" />

<p><a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/timeslices_freq.graphs.r">and here is the R code for these graphs</a></p>]]></content:encoded>
    </item>
    <item>
      <title>e11.2 aggregating tweets by time of day</title>
      <link>http://matpalm.com/blog/2009/10/24/e11-2-aggregating-tweets-by-time-of-day/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <guid>http://matpalm.com/blog/?p=144</guid>
      <description>e11.2 aggregating tweets by time of day</description>
      <content:encoded><![CDATA[<p>for v3 lets aggregate by time of the day, should make for an interesting animation</p>
<p>browsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.</p>
<p>furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,</p>
<p>i've been streaming all my tweets ( <a href="http://matpalm.com/blog/2009/10/16/e11-0-tweets-around-the-world/">as previously discussed</a> ) and been storing them in a directory json_stream</p>
<p>here are the steps...</p>
<h2>1. extract locations</h2>
<p>use a streaming script to take a tweet in json form and emit the tweet time and location string</p>
<pre>export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
hadoop jar $HADOOP_STREAMING_JAR \
 -mapper <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/extract_locations.rb">./extract_locations.rb</a> -reducer /bin/cat \
 -input json_stream -output locations</pre>

<p>sample output (4.7e6 tuples) { time, location string }</p>
<pre>Wed Oct 14 22:01:41 +0000 2009    iPhone: -23.492420,-46.846916
Wed Oct 14 22:01:41 +0000 2009    Ottawa
Wed Oct 14 22:01:41 +0000 2009    DA HOOD
Wed Oct 14 22:01:42 +0000 2009    Earth</pre>

<h2>2. pluck lat longs from locations</h2>
<p>make another pass and extract possible lat lons from the location strings</p>
<pre>hadoop jar $HADOOP_STREAMING_JAR \
 -mapper <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/extract_lat_longs_from_locations.rb">./extract_lat_longs_from_locations.rb</a> -reducer /bin/cat \
 -input locations -output lat_lons</pre>

<p>sample output (reduces down to 320e3 data points) { time, lat, lon }</p>
<pre>Wed Oct 14 22:01:41 +0000 2009    -23.49242    -46.846916
Wed Oct 14 22:05:25 +0000 2009    35.670086    139.740766
Wed Oct 14 22:11:35 +0000 2009    41.37731257    -74.68153942
Wed Oct 14 22:15:18 +0000 2009    51.503212    5.478329</pre>

<h2>3. bucket data into timeslices and points for a map</h2>
<p>we need to project the times into 10min slots; ie 00:05 will be slot 0, 00:12 will be slot 1.</p>
<p>also use to project the lat lons to x and y coords (0-&gt;1) using a simple <a href="http://en.wikipedia.org/wiki/Mercator_projection">mercator</a> projection</p>
<pre>hadoop jar $HADOOP_STREAMING_JAR \
 -mapper <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/lat_long_to_merc_and_bucket.rb">./lat_long_to_merc_and_bucket.rb</a> -reducer /bin/cat \
 -cmdenv BUCKET_SIZE=0.005 \
 -input lat_lons -output x_y_points</pre>

<p>sample output { timeslice, normalised x position, normalised y position }</p>
<pre>122     0.48    0.205
122     0.295   0.26
122     0.29    0.26
123     0.265   0.265</pre>

<p>as a slight digression before we move onto aggregating per timeslice here's a pic of all 320e3 tweets on a heatmap.</p>
<p>some interesting noise on the greenwich meridian, must be incorrectly identified lat lons during the <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/extract_lat_longs_from_locations.rb">./extract_lat_longs_from_locations.rb</a> step.</p>
<h3>log10 tweet location (click for a hires version)</h3>
<p><a href="http://matpalm.com/rtw_tweet/v3/hi_res_320e3_log.jpg"><img class="size-full wp-image-149" title="lo_res_320e3_log" src="/blog/imgs/2009/10/lo_res_320e3_log.jpg" alt="log10 tweet location, click for a hires version" width="640" height="496" /></a></p>
<h2>4. aggregate (x,y) pairs per timeslice</h2>
<p>next we aggreate, per timeslice, the frequency of points each x,y point.
we'll do this with a pig script, <a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/aggregate_per_timeslice.pig">aggregate_per_timeslice.pig</a></p>
<pre>
# aggregating per timeslice
pts = load 'x_y_points/part-00000' as (timeslice:int, x:float, y:float);
pts2 = group pts by (timeslice,x,y);
pts3 = foreach pts2 generate $0, COUNT($1) ;
pts4 = foreach pts3 generate $0.$0, $0.$1, $0.$2, $1 as freq;
pts5 = order pts4 by timeslice;
store pts5 into 'aggregated_freqs';</pre>

<p>results in the tuples in 'aggregated_freqs' { timeslice, normalised x position, normalised y position, frequency }</p>
<pre>0    0.0    0.32    1
0    0.06    0.325    9
0    0.065    0.33    1
0    0.08    0.17    2
0    0.155    0.225    8</pre>

<p>we need to normalise each frequency value for drawing on the map and would have like to have done this in pig also but turns out there isn't a log function in v0.3 of pig (??)</p>
<p>will have to do scaling when generating the images. isn't such a big deal since the dataset is quite small at this stage but was trying to use this whole thing as an excuse to learn pig :(</p>
<h2>5. take aggregated_freqs and make 144 heat map images</h2>
<p>use a simple script to read through the aggregated_freqs and generate a heap map for each frame</p>
<pre><a href="http://github.com/matpalm/rtw_tweet/blob/master/v3/heat_maps.rb">heat_maps.rb</a> aggregated_freqs 0.005 frames</pre>

<h2>6. convert to animation</h2>
<p>next bundle stills into an animation and upload to youtube</p>
<pre>mencoder mencoder "mf://frames/*" -mf fps=25 -o rtw_tweet_v3.avi -ovc x264 -x264encopts bitrate=750</pre>

<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="344" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://www.youtube.com/v/cSnGI33CwP0&amp;hl=en&amp;fs=1&amp;" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="425" height="344" src="http://www.youtube.com/v/cSnGI33CwP0&amp;hl=en&amp;fs=1&amp;" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<h2>7. conclusions</h2>
<ol>
<li>didn't really end up using hadoop's power that much; streaming jobs that use just cat as a reducer as just a parallel way of doing 1:1 string mapping</li>
<li>aggregation was really easy in pig but lack of Log function is annoying; could have written a <a href="http://wiki.apache.org/pig/UDFManual">UDF</a>, and there probably already is one but i couldn't find it</li>
<li>this visualisation came out pretty lame; funny to see how the really swish visualisations rely far more on pretty colours and smooth lines than the data itself. there are a bundle of things i could do with this one but it's time to move on to something else.</li>
</ol>]]></content:encoded>
    </item>
    <item>
      <title>e11.1 from bash scripts to hadoop</title>
      <link>http://matpalm.com/blog/2009/10/18/e11-1-from-bash-scripts-to-hadoop/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[maps]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[hadoop]]></category>
      <category><![CDATA[pig]]></category>
      <guid>http://matpalm.com/blog/?p=100</guid>
      <description>e11.1 from bash scripts to hadoop</description>
      <content:encoded><![CDATA[<p>let's rewrite <a href="http://matpalm.com/blog/2009/10/16/e11-0-tweets-around-the-world/">v1</a> using hadoop tooling, code is on <a href="http://github.com/matpalm/rtw_tweet/tree/master/v2/">github</a></p>
<p>we'll run hadoop in non distributed <a href="http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html#Local">standalone mode</a>. in this mode everything runs in a single jvm so it's nice and simple to dev against.</p>
<h3>step 1: extract the locations strings from the json stream</h3>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>bzcat sample.bz2 | ./extract_locations.pl &gt; locations
</pre></div>

<p>using the the awesome <a href="http://hadoop.apache.org/common/docs/current/streaming">hadoop streaming</a> interface it's not too different. this interface allows you to specify any app as the mapper or reducer. the main difference is that it works on directories not just files.</p>
<p>for the mapper we'll use exactly the same script as before; extract_locations.pl and since there is no reduce component of this job so we use an "identity" script, ie cat, as the reduce phase.</p>
<div class="pygments_murphy"><pre>mkdir json_stream
bzcat sample.bz2 | gzip - &gt; json_stream/input.gz
# hadoop supports gzip out of the bound but not bzip2 :(
export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./extract_locations.pl -reducer /bin/cat \
  -input json_stream -output locations
</pre></div>

<p>this gives us the locations in a single file locations/part-0000</p>
<h3>step 2: extract iphone and ut lat longs strings</h3>
<p>the second step is another text munging problem where we extract just the lat longs for the iPhone and UT tagged locations</p>
<p>ie for strings of the form</p>
<div class="pygments_murphy"><pre>iPhone: 21.320328,-157.877579
\u00dcT: 41.727877,-91.626323
</pre></div>

<p>we want to extract</p>
<div class="pygments_murphy"><pre>21.320328 -157.877579
41.727877 -91.626323
</pre></div>

<p>since this is just text manipulation we'll use streaming again</p>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>cat locations | ./extract_lat_longs_from_locations.rb iphone &gt; locations.iphone
cat locations | ./extract_lat_longs_from_locations.rb ut &gt; locations.ut
</pre></div>

<p>for hadoop streaming it's</p>
<div class="pygments_murphy"><pre>hadoop jar $HADOOP_STREAMING_JAR \
  -mapper &#39;./extract_lat_longs_from_locations.rb iphone&#39; -reducer /bin/cat \
  -input locations -output locations.iphone
hadoop jar $HADOOP_STREAMING_JAR \
  -mapper &#39;./extract_lat_longs_from_locations.rb ut&#39; -reducer /bin/cat \
  -input locations -output locations.ut
</pre></div>

<h3>step 3: convert from lat long to mercator coordinates and aggregate into buckets for the heat map</h3>
<p>in v1 it was</p>
<div class="pygments_murphy"><pre>cat locations.{ut,iphone} | ./lat_long_to_merc.rb | ./bucket.rb | sort | uniq -c
</pre></div>

<p>this converts the three tuples { lat, long }</p>
<div class="pygments_murphy"><pre>35.670086 139.740766
-23.492420 -46.846916
35.657570 139.744858
</pre></div>

<p>into two tuples { frequency, left-offset, top-offset }</p>
<div class="pygments_murphy"><pre>1 0.36 0.45
2 0.88 0.28
</pre></div>

<p>the first two parts, converting to mercator (lat_long_to_merc.rb) and the bucketing (bucket.rb), i'll combine into one script.</p>
<div class="pygments_murphy"><pre>hadoop jar $HADOOP_STREAMING_JAR \
  -mapper ./lat_long_to_merc_and_bucket.rb -reducer /bin/cat \
  -input locations.iphone -input locations.ut -output x_y_points
</pre></div>

<p>but the use of sort and uniq to aggregate the data is represented by the shuffle and reduce stages of hadoop.</p>
<p>we could use the aggregate functionality of the streaming interface but i'm trying to learn more pig so we'll use that instead. <a href="http://hadoop.apache.org/pig/">pig</a> is a scripting language that translates a pig latin query language into map reduce jobs. my main motivation for using it has been that it's great at doing joins, something i've found to be a <a href="http://matpalm.com/sip/take2_term_frequency.html#hadoop+part+2">big pain</a> to represent in plain map reduce jobs.</p>
<p>( note we didn't do the conversion to mercator and bucketing in pig, the arithmetic operations provided are a bit lacking. )</p>
<p>enter a pig shell running in standalone (ie non hadoop distributed) mode</p>
<div class="pygments_murphy"><pre>bash&gt; pig -x local
</pre></div>

<p>load the points</p>
<div class="pygments_murphy"><pre>grunt&gt; pts = load &#39;x_y_points/part-00000&#39; as (x:float, y:float);
grunt&gt; describe pts;
pts: {x: float,y: float}
grunt&gt; dump pts
(0.06F,0.32F)
(0.15F,0.27F)
(0.16F,0.27F)
...
</pre></div>

<p>group them together</p>
<div class="pygments_murphy"><pre>grunt&gt; buckets = group pts by (x,y);
grunt&gt; describe buckets;
buckets: {group: (x: float,y: float),pts: {x: float,y: float}}
grunt&gt; dump buckets;
((0.06F,0.32F),{(0.06F,0.32F)})
((0.15F,0.27F),{(0.15F,0.27F)})
((0.16F,0.27F),{(0.16F,0.27F),(0.16F,0.27F),(0.16F,0.27F),(0.16F,0.27F)})
...
</pre></div>

<p>from the groups emit the size of each bucket, this corresponds to the frequency</p>
<div class="pygments_murphy"><pre>grunt&gt; freq = foreach buckets { generate group, SIZE(pts) as size; }
grunt&gt; describe freq;
freq: {group: (x: float,y: float),size: long}
grunt&gt; dump freq
((0.06F,0.32F),1L)
((0.15F,0.27F),1L)
((0.16F,0.27F),4L)
...
</pre></div>

<p>and based on the sizes we can evaluate the min and max frequencies which we'll use in the colour coding of the heat map</p>
<div class="pygments_murphy"><pre>grunt&gt; freqs = group freq all;
grunt&gt; describe freqs;
freqs: {group: chararray,freq: {group: (x: float,y: float),size: long}}
grunt&gt; dump freqs;
(all,{((0.06F,0.32F),1L),((0.15F,0.27F),1L), ... })
grunt&gt; store freq into &#39;freqs&#39;;&lt;/pre&gt;
&lt;pre&gt;grunt&gt; min_max = foreach freqs { generate MAX(freq.size) as max, MIN(freq.size) as min; };
grunt&gt; describe min_max;
min_max: {max: long,min: long}
grunt&gt; dump min_max;
(7L,1L)
grunt&gt; store min_max into &#39;min_max&#39;;&lt;/pre&gt;
&lt;pre&gt;bash&gt; cat freqs
(0.06,0.32)   1
(0.15,0.27)   1
(0.16,0.27)   4
</pre></div>

<p>these call all be run as one command</p>
<div class="pygments_murphy"><pre>bash&gt; pig -x local -f freqs.pig
</pre></div>

<p>we just need our final conversion to a javascript snippet to jam into a map page</p>
<div class="pygments_murphy"><pre>bash&gt; cat freqs | ./as_draw_square.rb 1 7
</pre></div>

<p>win!</p>
<p>to make things a little different lets use a bigger sample of 475e3 tweets from oct 13 07:00 to 20:00. this results in 10e3 iphone locations (7e3 unique) and 22e3 ut locations (15e3 unique)</p>
<p>lat longs are bucketed into only 478 pixels for map</p>
<p>here's one plot with the raw numbers; highest freq is 9e3 in jakarta</p>
<h4>raw frequencies</h4>
<img class="size-full wp-image-111" title="raw frequencies" src="/blog/imgs/2009/10/raw1.jpg" alt="raw frequencies" width="682" height="529" />

<p>scaling down by log 10 gives a smoother map</p>
<h4>log10 frequencies</h4>
<img class="size-full wp-image-113" title="log10 frequencies" src="/blog/imgs/2009/10/log10.jpg" alt="log10 frequencies" width="682" height="529" />

<p>and here is a comparison of iphone vs ut. without knowing what ut is i can see it's not big in northern europe or japan but it's popular in indonesia.</p>
<h4>iphones</h4>
<img class="size-medium wp-image-120" title="iphones" src="/blog/imgs/2009/10/iphones-300x232.jpg" alt="iphones" width="300" height="232" />

<h4>ut</h4>
<img class="size-medium wp-image-119" title="ut" src="/blog/imgs/2009/10/ut-300x232.jpg" alt="ut" width="300" height="232" />

<p>next steps, animating based on the hour of the day</p>]]></content:encoded>
    </item>
    <item>
      <title>e11.0 tweets around the world</title>
      <link>http://matpalm.com/blog/2009/10/16/e11-0-tweets-around-the-world/</link>
      <category><![CDATA[e11]]></category>
      <category><![CDATA[maps]]></category>
      <category><![CDATA[twitter]]></category>
      <guid>http://matpalm.com/blog/?p=88</guid>
      <description>e11.0 tweets around the world</description>
      <content:encoded><![CDATA[<p>was discussing the <a href="http://apiwiki.twitter.com/Streaming-API-Documentation">streaming twitter api</a> with <a href="http://github.com/srbartlett">steve</a> and though i knew about the private firehose i didn't know there was a lighter weight public gardenhose interface!</p>
<p>since discovering this my pvr has basically been running
<pre>curl -u mat_kelcey:XXX http://stream.twitter.com/1/statuses/sample.json |
   gzip -9 - &gt; sample.json.gz</pre>
but what am i going to do with all this data?</p>
<p>while poking around i noticed there was a fair number of iPhone: and ÜT: lat long tagged locations (eg iPhone: 35.670086,139.740766) so as a first hack let's do some work extracing lat longs and displaying them as heat map points on a map.</p>
<p>all the code is on <a href="http://github.com/matpalm/rtw_tweet/tree/master/v1/">github</a></p>
<p>as a test then let's take a <a href="http://github.com/matpalm/rtw_tweet/tree/master/v1/">sample.bz2</a> of 1,300 tweets between Oct 14 22:01:41 and 22:03:24</p>
<p>from this let's just extract the location part of the tweet
<pre>bzcat sample.bz2 | ./extract_locations.pl &gt; locations</pre>
of these 1,300 there are 30 examples of iphone lat longs (eg iPhone: -23.492420,-46.846916)
<pre>cat locations | ./extract_lat_longs_from_locations.rb iphone &gt; locations.iphone</pre>
and 36 examples of ut lat longs (eg UT: 51.503212,5.478329)
<pre>cat locations | ./extract_lat_longs_from_locations.rb ut &gt; locations.ut</pre>
on a side note, does anyone have any idea what ÜT is ? a phone type, maybe a carrier?</p>
<p>we need to convert these lat/longs to x/y points so we can plot onto a map and we'll use the standard <a href="http://en.wikipedia.org/wiki/Mercator_projection">mercator projection</a> to do this
<pre>cat locations.{ut,iphone} | ./lat_long_to_merc.rb &gt; x_y_points</pre>
for the heat map we want to aggregate into buckets so the pixels are nice and big. finally we'll output some simple javascript we can cut and paste into some map html
<pre>cat x_y_points | ./bucket.rb | sort | uniq -c | ./as_draw_square.rb</pre>
the final result is <a href="http://www.matpalm.com/rtw_tweet/v1/map.html">this map</a> !</p>
<p>a good start. next to do the same over a much larger sample using hadoop streaming and pig and then work towards an animation by aggregating on time slices</p>]]></content:encoded>
    </item>
  </channel>
</rss>

