brain of mat kelcey
tokenising the visible english text of common crawl
December 10, 2011 at 04:00 PM | categories: common-crawl, nlp | View Comments
Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.The first thing was to get the data into a hadoop cluster. It's made up of 300,000 100mb gzipped arc files stored in S3.I wrote a dead simple distributed copy to do this.Only a few things of note about this job...The data in S3 is marked as requester payswhich, even though it's a no-op if you're...
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment