me on twitter

brain of mat kelcey


an exercise in handling mislabelled training data

October 03, 2011 at 08:00 PM | categories: , training, vowpal wabbit | View Comments

as part of my diy twitter client project i've been using the twitter sample streams as a sourceof unlabelled data for some mutual information analysis. these streams are a great source of random tweets but include a lot of non english content. extracting the english tweets would be pretty straight forward if the ['user']['lang'] field of a tweet was 100% representative of the tweet's language but a lot of the times it's not; can we usethese values at least as a starting point?one approach to seeing how consistent the relationship between user_lang and the tweet language is totrain a classifier...
Read and Post Comments

old projects...