brain of mat kelcey
an exercise in handling mislabelled training data
October 03, 2011 at 08:00 PM | categories: , training, vowpal wabbit | View Comments
as part of my diy twitter client project i've been using the twitter sample streams as a sourceof unlabelled data for some mutual information analysis. these streams are a great source of random tweets but include a lot of non english content. extracting the english tweets would be pretty straight forward if the ['user']['lang'] field of a tweet was 100% representative of the tweet's language but a lot of the times it's not; can we usethese values at least as a starting point?one approach to seeing how consistent the relationship between user_lang and the tweet language is totrain a classifier...
old projects...
- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment