<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>pseudocounts and the good-turing estimation (part1)</title>
      <link>http://matpalm.com/blog/2011/04/03/pseudocounts-part-1/</link>
      <category><![CDATA[pseudocounts]]></category>
      <category><![CDATA[statistics]]></category>
      <guid>http://matpalm.com/blog/2011/04/03/pseudocounts-part-1/</guid>
      <description>pseudocounts and the good-turing estimation (part1)</description>
      <content:encoded><![CDATA[<h2>beer</h2>
<p>say we are running the bar at a soldout <a href="http://en.wikipedia.org/wiki/Bad_religion">bad religion</a> concert. the bar serves beer, scotch and water and we decide to record orders over the night so that we can know how much to order for tomorrow's gig...</p>
<table class="data">
<tr><td>drink</td><td>#sales</td></tr> 
<tr><td>beer</td><td>1000</td></tr> 
<tr><td>scotch</td><td>300</td></tr> 
<tr><td>water</td><td>200</td></tr> 
</table>

<p>using these numbers we can predict a number of things..</p>
<p>what is the chance the next person will order a beer?</br>
it's a pretty simple probability; 1000 beers / 1500 total drinks = 0.66 or 66%</p>
<p>what is the chance the next person will order a water?</br>
also straightforward; 200 waters / 1500 total drinks = 0.14 or 14%</p>
<h2>t-shirts</h2>
<p>now say we run the t-shirt stand at the same concert....</p>
<p>instead of only selling 3 items (like at the bar) we sell 20 different types of t-shirts. once again we record orders over the night...</p>
<table class="data">
<tr><td>t-shirt</td><td>#sales</td><td>t-shirt</td><td>#sales</td></tr> 
<tr><td>br tour</td><td>15</td><td>pennywise</td><td>3</td><tr>
<tr><td>br logo1</td><td>15</td><td>strung out</td><td>3</td><tr>
<tr><td>br album art 1</td><td>10</td><td>propagandhi</td><td>3</td><tr>
<tr><td>br album art 3</td><td>10</td><td>bouncing souls</td><td>1</td><tr>
<tr><td>nofx logo1</td><td>5</td><td>the vandals</td><td>1</td><tr>
<tr><td>nofx logo2</td><td>5</td><td>dead kennedys</td><td>1</td><tr>
<tr><td>lagwagon</td><td>4</td><td>misfits</td><td>1</td><tr>
<tr><td>frenzal rhomb</td><td>4</td><td>the offspring</td><td>0</td><tr>
<tr><td>rancid</td><td>4</td><td>the ramones</td><td>0</td><tr>
<tr><td>descendants</td><td>3</td><td>mxpx</td><td>0</td><tr>
</table>

<p>we can ask similar questions again regarding the chance of people buying a particular t-shirt</p>
<p>what's the chance the next person to buy a t-shirt wants the official tour t-shirt?</br>
15 tour t-shirts sold / 88 sold in total = 0.170 or 17.0%</p>
<p>what's the chance the next person to buy a t-shirt wants the a descendants t-shirt?</br>
3 descendants t-shirts sold / 88 sold in total = 0.034 or 3.4%</p>
<p>what's the chance the next person to buy a t-shirt wants the an offspring t-shirt?</br>
0 offspring t-shirts sold / 88 sold in total = 0 or 0%</p>
<p>if you're like me then the last one "feels" wrong. even though we've not seen a purchase of at least 1 t-shirt
it seems a bit rough to say there is <em>no</em> chance of someone buying one. 
this illustrates one of the problems of dealing 
<a href="http://en.wikipedia.org/wiki/Prior_probability">prior probabilities</a></p>
<p>any system using a products of probabilities, such as the modeling of "independent" events in naive bayes, suffers badly from these zero probabilities. i've discussed the problems a few times before in previous experiments such as 
(<a href="../rss.feed/p3/">this one on naive bayes</a> and <a href="../semi_supervised_naive_bayes/semi_supervised_bayes.html">this one on semi supervised bayes</a>)
and the approach i've always used is the simple 
<a href="http://en.wikipedia.org/wiki/Rule_of_succession">rule of succession</a> where we avoid
the zero problem by adding one to the frequency of each event.</p>
<p>for reference here are the probabilities per t-shirt without adjustment...</p>
<div class="pygments_murphy"><pre>R&gt; sales = rep(c(15,10,5,4,3,1,0), c(2,2,2,3,4,4,3))
 [1] 15 15 10 10  5  5  4  4  4  3  3  3  3  1  1  1  1  0  0  0
R&gt; simple_probs = sales / sum(sales)
 [1] 0.17045455 0.17045455 0.11363636 0.11363636 0.05681818 0.05681818
 [7] 0.04545455 0.04545455 0.04545455 0.03409091 0.03409091 0.03409091
[13] 0.03409091 0.01136364 0.01136364 0.01136364 0.01136364 0.00000000
[19] 0.00000000 0.00000000
</pre></div>

<p>... and here are the values for the succession rule case</p>
<div class="pygments_murphy"><pre>R&gt; sales = rep(c(15,10,5,4,3,1,0), c(2,2,2,3,4,4,3))
 [1] 15 15 10 10  5  5  4  4  4  3  3  3  3  1  1  1  1  0  0  0
R&gt; sales_plus_one = sales + 1
 [1] 16 16 11 11  6  6  5  5  5  4  4  4  4  2  2  2  2  1  1  1
R&gt; smooth_probs = sales_plus_one / sum(sales_plus_one)
 [1] 0.14814815 0.14814815 0.10185185 0.10185185 0.05555556 0.05555556
 [7] 0.04629630 0.04629630 0.04629630 0.03703704 0.03703704 0.03703704
[13] 0.03703704 0.01851852 0.01851852 0.01851852 0.01851852 0.00925926
[19] 0.00925926 0.00925926
</pre></div>

<p>no more zeros! yay! but, unfortunately, at the cost of the accuracy of the other values.</p>
<p>it's always worked for me in the past (well at least better than having the zeros) but it's always felt wrong too.  but finally the other day i found another approach, that seems a lot more statistically sound.</p>
<p>it's called <a href="http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation">good-turing estimation</a> 
and it was developed as part of <a href="http://en.wikipedia.org/wiki/Alan_Turing">turing's</a> work at bletchley 
park (so it must be awesome). a decent implementation is explained in 
<a href="http://www.grsampson.net/AGtf1.html">this paper</a> by geoffrey sampson (it's somewhat more 
complex than adding 1)</p>
<p>it works on using the frequency of frequencies and redistributes the probabilities to include a
special allocation that we should allocate over items that have never seen before. </p>
<p>the following table shows the frequencies, the original probability, the probability adjusted using the rule of
succession and the probabilities as redistributed using good turing estimation.</p>
<table class="data">
<tr>
 <td>freq</td>
 <td>freq of freq</td>
 <td>original</br>prob</td>
 <td>succession</br>prob</td>
 <td>good turing</br>prob</td>
 <td>description</td>
</tr>
<tr><td>15</td><td>2 </td><td>0.170 </td><td>0.148 </td><td>0.160 </td><td>2 t-shirts sold 15 times</td></tr>
<tr><td>10</td><td>2 </td><td>0.113 </td><td>0.101 </td><td>0.107 </td><td>2 t-shirts sold 10 times</td></tr>
<tr><td>5 </td><td>2 </td><td>0.056 </td><td>0.055 </td><td>0.054 </td><td>2 t-shirts sold 5 times</td></tr>
<tr><td>4 </td><td>3 </td><td>0.045 </td><td>0.046 </td><td>0.043 </td><td>3 t-shirts sold 4 times</td></tr>
<tr><td>3 </td><td>4 </td><td>0.034 </td><td>0.037 </td><td>0.033 </td><td>4 t-shirts sold 3 times</td></tr>
<tr><td>1 </td><td>4 </td><td>0.011 </td><td>0.018 </td><td>0.011 </td><td>4 t-shirts sold once</td></tr>
<tr><td>0 </td><td>3 </td><td>0.000 </td><td>0.009 </td><td>0.015 </td><td>3 t-shirts didn't sell</td></tr>
</table>

<p>and here's a graph of the same thing.</p>
<img src="/blog/imgs/tshirts.png"/>

<p>some observations...</p>
<ul>
<li>the rule as succession is just smoothing really and drags the higher probabilities down in response to pulling the lower probabilities up</li>
<li>the good turing estimation is closer to the real value of the high frequency cases</li>
<li>the good turing estimate for the zero case is quite a bit higher than the rule of succession estimate</li>
<li>and most interesting of all, the good turing estimate for the freq 0 is higher than the estimate for freq 1.</li>
</ul>
<p>the last point in particular i think is really interesting. the good turing algorithm actually gives a total estimate for the zero probability cases (in this examples it gave 0.045) and it's up to the user to distribute it among the actual examples (in this example
there were 3 cases so i just divided 0.045 by 3).</p>
<p>if there had be 4 types of t-shirts that hadn't sold the estimate for each of them would have be 0.011 like the 4 t-shirts that sold once.</p>
<p>if there had only be 1 type of t-shirt that hadn't had any sales we'd have to allocate the entire 0.045 to it. in effect the algorithms says it expects that t-shirt to be more likely to sell that the 4 types of t-shirts that had 3 sales each (the 0.033 probability case). </p>
<p>an interesting result, not sure what intution to take away from it.... </p>
<p>now this is all good, but i actually don't run the bar at a bad religion concert (or the t-shirt stand) 
and i'm actually interested in this in how it applies to text processing, especially in the area of classification.</p>
<p>so my question is <em>"is the extra computation required for the good-turing calculation worth it?"</em></p>
<p>results coming in part2. work in progress code on <a href="https://github.com/matpalm/pseudocounts">github</a></p>]]></content:encoded>
    </item>
    <item>
      <title>e12.1 statistical synonyms</title>
      <link>http://matpalm.com/blog/2010/01/23/e12-1-statistical-synonyms/</link>
      <category><![CDATA[e12]]></category>
      <category><![CDATA[statistics]]></category>
      <guid>http://matpalm.com/blog/?p=250</guid>
      <description>e12.1 statistical synonyms</description>
      <content:encoded><![CDATA[<p>i've had an idea brewing in my head for awhile now seeded by <a href="http://www.youtube.com/watch?v=nU8DcBF-qo4">a great talk by peter norvig</a> about statistically approaches to find patterns in data.</p>
<p>one thing he alludes to is the generation of synoyms based on n-gram models.</p>
<p>the basic intuition is this; if a corpus contains occurrences of the phrases 'a x b' and 'a y b' then to some degree x and y are synonymous.</p>
<p>the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, 'a x b', 'a y b', 'a ? b' in the corpus. what else can we take into account?</p>]]></content:encoded>
    </item>
    <item>
      <title>simple statistics with R</title>
      <link>http://matpalm.com/blog/2009/10/03/simple-statistics-with-r/</link>
      <category><![CDATA[statistics]]></category>
      <category><![CDATA[r]]></category>
      <category><![CDATA[language]]></category>
      <guid>http://matpalm.com/blog/?p=77</guid>
      <description>simple statistics with R</description>
      <content:encoded><![CDATA[<p>i'm learning a new statistics language called R and it's pretty cool.</p>
<p>make a vector ...</p>
<table class="pygments_murphytable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2</pre></div></td><td class="code"><div class="pygments_murphy"><pre><span class="o">&gt;</span> c<span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">9</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">6</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">8</span><span class="p">)</span>
 <span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">3</span> <span class="m">1</span> <span class="m">4</span> <span class="m">1</span> <span class="m">5</span> <span class="m">9</span> <span class="m">2</span> <span class="m">6</span> <span class="m">5</span> <span class="m">3</span> <span class="m">5</span> <span class="m">8</span>
</pre></div>
</td></tr></table>

<p>turn it into a frequency table ...</p>
<table class="pygments_murphytable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3</pre></div></td><td class="code"><div class="pygments_murphy"><pre><span class="o">&gt;</span> table<span class="p">(</span>c<span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">9</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">6</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">8</span><span class="p">))</span>
<span class="m">1</span> <span class="m">2</span> <span class="m">3</span> <span class="m">4</span> <span class="m">5</span> <span class="m">6</span> <span class="m">8</span> <span class="m">9</span>
<span class="m">2</span> <span class="m">1</span> <span class="m">2</span> <span class="m">1</span> <span class="m">3</span> <span class="m">1</span> <span class="m">1</span> <span class="m">1</span>
</pre></div>
</td></tr></table>

<p>sort by frequency ...</p>
<table class="pygments_murphytable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3</pre></div></td><td class="code"><div class="pygments_murphy"><pre><span class="o">&gt;</span> sort<span class="p">(</span>table<span class="p">(</span>c<span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">9</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">6</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">8</span><span class="p">)))</span>
<span class="m">2</span> <span class="m">4</span> <span class="m">6</span> <span class="m">8</span> <span class="m">9</span> <span class="m">1</span> <span class="m">3</span> <span class="m">5</span>
<span class="m">1</span> <span class="m">1</span> <span class="m">1</span> <span class="m">1</span> <span class="m">1</span> <span class="m">2</span> <span class="m">2</span> <span class="m">3</span>
</pre></div>
</td></tr></table>

<p>and plot!</p>
<table class="pygments_murphytable"><tr><td class="linenos"><div class="linenodiv"><pre>1</pre></div></td><td class="code"><div class="pygments_murphy"><pre><span class="o">&gt;</span> barplot<span class="p">(</span>sort<span class="p">(</span>table<span class="p">(</span>c<span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">9</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">6</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">8</span><span class="p">))))</span>
</pre></div>
</td></tr></table>

<img title="Rplot" src="/blog/imgs/2009/10/Rplot.png" alt="Rplot" width="480" height="480" />

<p>so simple!</p>]]></content:encoded>
    </item>
    <item>
      <title>do a degree via youtube</title>
      <link>http://matpalm.com/blog/2009/10/01/do-a-degree-via-youtube/</link>
      <category><![CDATA[lectures]]></category>
      <category><![CDATA[statistics]]></category>
      <category><![CDATA[stanford]]></category>
      <category><![CDATA[machine learning]]></category>
      <guid>http://matpalm.com/blog/?p=75</guid>
      <description>do a degree via youtube</description>
      <content:encoded><![CDATA[<p>i'm amazed by how much great content is on youtube, how could you NOT learn something!?</p>
<p><a href="http://www.youtube.com/view_play_list?p=993FF1801B5AAB4D&amp;search_query=statistical+aspects+of+data+mining+stats+202">13 x 1hr Statistical Aspects of Data Mining (Stats 202)</a></p>
<p><a href="http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599&amp;search_query=machine+learning">20 x 1hr Machine Learning</a></p>]]></content:encoded>
    </item>
  </channel>
</rss>

