me on twitter

brain of mat kelcey


finding phrases with mutual information

November 15, 2011 at 11:00 PM | categories: nlp, phrase-extraction, collocations, mutual-information | View Comments

continuing on with my series of mutual information experiments how might we extend the technique to identity sequences longer than just two terms?one novel way is to identify the bigrams of interest, replace them with a single token and simply repeat the entire process. (thanks ted for the idea)so say we had the 6 term sentence i went to new york cityit has 5 bigrams; ('i went', 'went to', 'to new', 'new york', 'york city')running the mutual information algorithm over this might identify new york as a bigram of interest. we can swap the two terms with a single token...
Read and Post Comments

collocations in wikipedia, part 2

November 05, 2011 at 05:00 PM | categories: nlp, phrase-extraction, collocations | View Comments

in my last post we went through mutual information as a way of finding collocations.the astute reader may have noticed that for the list of top bigrams i onlyshowed ones that had a frequency above 5,000. why this cutoff? well it turns outthat one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases. if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear...
Read and Post Comments

collocations in wikipedia, part 1

October 19, 2011 at 08:00 PM | categories: nlp, phrase-extraction, collocations | View Comments

collocations are combinations of terms that occur together more frequently thanyou'd expect by chance. they can include proper noun phrases like 'Darth Vader'stock/colloquial phrases like 'flora and fauna' or 'old as the hills'common adjectives/noun pairs (notice how 'strong coffee' sounds ok but 'powerful coffee' doesn't?)let's go through a couple of techniques for finding collocations taken from the exceptional nlp text "foundations of statistical natural language processing" by manning and schutze.the first technique we'll try is mututal information, it's a wayof scoring terms based on how often they appear together vs how often they appear separately. the intuition is that if...
Read and Post Comments

old projects...