me on twitter

brain of mat kelcey


tokenising the visible english text of common crawl

December 10, 2011 at 04:00 PM | categories: common-crawl, nlp | View Comments

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.The first thing was to get the data into a hadoop cluster. It's made up of 300,000 100mb gzipped arc files stored in S3.I wrote a dead simple distributed copy to do this.Only a few things of note about this job...The data in S3 is marked as requester payswhich, even though it's a no-op if you're...
Read and Post Comments

old projects...