brain of mat kelcey
e11.2 aggregating tweets by time of day
October 24, 2009 at 01:02 PM | categories: e11, twitter, hadoop, pig | View Commentsfor v3 lets aggregate by time of the day, should make for an interesting animation
browsing the data there are lots of other lat longs in data, not just iPhone: and ÜT: there are also one tagged with Coppó:, Pre:, etc perhaps should just try to take anything that looks like a lat long.
furthermore lets switch to a bigger dataset again, 4.7e6 tweets from Oct 13 07:00 thru Oct 19 17:00,
i've been streaming all my tweets ( as previously discussed ) and been storing them in a directory json_stream
here are the steps...
1. extract locations
use a streaming script to take a tweet in json form and emit the tweet time and location string
export HADOOP_STREAMING_JAR=$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar hadoop jar $HADOOP_STREAMING_JAR \ -mapper ./extract_locations.rb -reducer /bin/cat \ -input json_stream -output locations
sample output (4.7e6 tuples) { time, location string }
Wed Oct 14 22:01:41 +0000 2009 iPhone: -23.492420,-46.846916 Wed Oct 14 22:01:41 +0000 2009 Ottawa Wed Oct 14 22:01:41 +0000 2009 DA HOOD Wed Oct 14 22:01:42 +0000 2009 Earth
2. pluck lat longs from locations
make another pass and extract possible lat lons from the location strings
hadoop jar $HADOOP_STREAMING_JAR \ -mapper ./extract_lat_longs_from_locations.rb -reducer /bin/cat \ -input locations -output lat_lons
sample output (reduces down to 320e3 data points) { time, lat, lon }
Wed Oct 14 22:01:41 +0000 2009 -23.49242 -46.846916 Wed Oct 14 22:05:25 +0000 2009 35.670086 139.740766 Wed Oct 14 22:11:35 +0000 2009 41.37731257 -74.68153942 Wed Oct 14 22:15:18 +0000 2009 51.503212 5.478329
3. bucket data into timeslices and points for a map
we need to project the times into 10min slots; ie 00:05 will be slot 0, 00:12 will be slot 1.
also use to project the lat lons to x and y coords (0->1) using a simple mercator projection
hadoop jar $HADOOP_STREAMING_JAR \ -mapper ./lat_long_to_merc_and_bucket.rb -reducer /bin/cat \ -cmdenv BUCKET_SIZE=0.005 \ -input lat_lons -output x_y_points
sample output { timeslice, normalised x position, normalised y position }
122 0.48 0.205 122 0.295 0.26 122 0.29 0.26 123 0.265 0.265
as a slight digression before we move onto aggregating per timeslice here's a pic of all 320e3 tweets on a heatmap.
some interesting noise on the greenwich meridian, must be incorrectly identified lat lons during the ./extract_lat_longs_from_locations.rb step.
log10 tweet location (click for a hires version)
4. aggregate (x,y) pairs per timeslice
next we aggreate, per timeslice, the frequency of points each x,y point. we'll do this with a pig script, aggregate_per_timeslice.pig
# aggregating per timeslice pts = load 'x_y_points/part-00000' as (timeslice:int, x:float, y:float); pts2 = group pts by (timeslice,x,y); pts3 = foreach pts2 generate $0, COUNT($1) ; pts4 = foreach pts3 generate $0.$0, $0.$1, $0.$2, $1 as freq; pts5 = order pts4 by timeslice; store pts5 into 'aggregated_freqs';
results in the tuples in 'aggregated_freqs' { timeslice, normalised x position, normalised y position, frequency }
0 0.0 0.32 1 0 0.06 0.325 9 0 0.065 0.33 1 0 0.08 0.17 2 0 0.155 0.225 8
we need to normalise each frequency value for drawing on the map and would have like to have done this in pig also but turns out there isn't a log function in v0.3 of pig (??)
will have to do scaling when generating the images. isn't such a big deal since the dataset is quite small at this stage but was trying to use this whole thing as an excuse to learn pig :(
5. take aggregated_freqs and make 144 heat map images
use a simple script to read through the aggregated_freqs and generate a heap map for each frame
heat_maps.rb aggregated_freqs 0.005 frames
6. convert to animation
next bundle stills into an animation and upload to youtube
mencoder mencoder "mf://frames/*" -mf fps=25 -o rtw_tweet_v3.avi -ovc x264 -x264encopts bitrate=750
7. conclusions
- didn't really end up using hadoop's power that much; streaming jobs that use just cat as a reducer as just a parallel way of doing 1:1 string mapping
- aggregation was really easy in pig but lack of Log function is annoying; could have written a UDF, and there probably already is one but i couldn't find it
- this visualisation came out pretty lame; funny to see how the really swish visualisations rely far more on pretty colours and smooth lines than the data itself. there are a bundle of things i could do with this one but it's time to move on to something else.
