brain of mat kelcey
tokenising the visible english text of common crawl
December 10, 2011 at 04:00 PM | categories: common-crawl, nlp | View Comments
Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.The first thing was to get the data into a hadoop cluster. It's made up of 300,000 100mb gzipped arc files stored in S3.I wrote a dead simple distributed copy to do this.Only a few things of note about this job...The data in S3 is marked as requester payswhich, even though it's a no-op if you're...
finding phrases with mutual information
November 15, 2011 at 11:00 PM | categories: nlp, phrase-extraction, collocations, mutual-information | View Comments
continuing on with my series of mutual information experiments how might we extend the technique to identity sequences longer than just two terms?one novel way is to identify the bigrams of interest, replace them with a single token and simply repeat the entire process. (thanks ted for the idea)so say we had the 6 term sentence i went to new york cityit has 5 bigrams; ('i went', 'went to', 'to new', 'new york', 'york city')running the mutual information algorithm over this might identify new york as a bigram of interest. we can swap the two terms with a single token...
collocations in wikipedia, part 2
November 05, 2011 at 05:00 PM | categories: nlp, phrase-extraction, collocations | View Comments
in my last post we went through mutual information as a way of finding collocations.the astute reader may have noticed that for the list of top bigrams i onlyshowed ones that had a frequency above 5,000. why this cutoff? well it turns outthat one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases. if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear...
collocations in wikipedia, part 1
October 19, 2011 at 08:00 PM | categories: nlp, phrase-extraction, collocations | View Comments
collocations are combinations of terms that occur together more frequently thanyou'd expect by chance. they can include proper noun phrases like 'Darth Vader'stock/colloquial phrases like 'flora and fauna' or 'old as the hills'common adjectives/noun pairs (notice how 'strong coffee' sounds ok but 'powerful coffee' doesn't?)let's go through a couple of techniques for finding collocations taken from the exceptional nlp text "foundations of statistical natural language processing" by manning and schutze.the first technique we'll try is mututal information, it's a wayof scoring terms based on how often they appear together vs how often they appear separately. the intuition is that if...
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment