<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>dimensionality reduction using random projections.</title>
      <link>http://matpalm.com/blog/2011/05/10/random-projections/</link>
      <category><![CDATA[machine learning]]></category>
      <guid>http://matpalm.com/blog/2011/05/10/random-projections/</guid>
      <description>dimensionality reduction using random projections.</description>
      <content:encoded><![CDATA[<p>previously i've discussed <a href="http://matpalm.com/lsa_via_svd/intro.html">dimensionality reduction using SVD and PCA</a> but another interesting technique is using a random projection.</p>
<p>in a random projection we project A (a NxM matrix) to A' (a NxO, O &lt; M) by the transform AP=A' where P is a MxO matrix with random values.</br>
( well not totally random, each column must have unit length (ie entries in each column must add to 1) )</p>
<p>though the results of this reduction are generally not as good as the results from SVD or PCA it has two huge benefits</p>
<ul>
<li>can be done <em>without</em> needing to hold P in memory (since it's entries can be generated multiple times using a seeded RNG)</li>
<li>and more importantly it's <em>really</em> fast </li>
</ul>
<p>how good is it compared to PCA i wonder? let's have a play around...</p>
<p>consider the 2d dataset of two clear clusters of points around (2,2) and (8,8)</p>
<img src="/blog/imgs/2011/05/2d_data.png"/>

<p>the following shows 5 random projections to 1d compared to a 1d PCA reduction (done using <a href="http://www.cs.waikato.ac.nz/ml/weka/">weka</a>) </p>
<img src="/blog/imgs/2011/05/2d_to_1d_projections.png"/>

<p>there was a clear seperation between the 2 classes in 2d and it's retained in 3 out of the 5 projections even though we're using <em>half</em> the space going from 2d to 1d.</br>
( the 2nd random projection actually looks very similiar to PCA )</p>
<p>what about some higher dimensions?</p>
<p>let's generate the same two clusters but in 10d; ie around the points (2,2,2,2,2,2,2,2,2,2) and (8,8,8,8,8,8,8,8,8,8)</p>
<p>though we can't plot this easily there are 2 useful visualisations of the data</p>
<p>a scatterplot, which is pretty uninteresting actually...</p>
<img src="/blog/imgs/2011/05/10d_scatterplot.png"/>

<p>or a 2d tour through the 10d space</p>
<p><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/QrJQDSFTg-k?hl=en&fs=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/QrJQDSFTg-k?hl=en&fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object></p>
<p>(check out <a href="http://vimeo.com/12292239">my screencast on ggobi</a> if you're after a better idea of what this tour represents)</p>
<p>lets compare 5 random projections to PCA when projecting from 10d to 1d</br>
this time not so good... only one projection gets the clusters seperated, and only just.</p>
<img src="/blog/imgs/2011/05/10d_to_1d_projections.png"/>

<p>what about projecting to 2d? </p>
<table>
<tr><td>PCA</td><td>run1</td><td>run2</td><td>run3</td><td>run4</td><td>run5</td></tr>
<tr><td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_PCA.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run1.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run2.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run3.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run4.png"/></td>
    <td><img width="190" src="/blog/imgs/2011/05/10d_to_2d_run5.png"/></td></tr>
</table>

<p>PCA is by the cleanest with the x-axis representing the clusters and the y-axis representing the in cluster variance. </br>
it clearly shows how the dataset can be projection to 1d, we only needed the first (x-axis) principal component.</p>
<p>the projections aren't too bad, all but 1 of them has the 2 clusters linearly seperable.</p>
<p>so in summary i think it's pretty good if you need to do something super fast. in these experiments i was using a pretty contrived dataset but was trying to be quite aggressive in going from
10d to 1d. </p>
<p>i wonder what, if any, difference there would be with sparse data?</p>
<p>&lt; random evening hackery /&gt;</p>]]></content:encoded>
    </item>
    <item>
      <title>my list of cool machine learning books</title>
      <link>http://matpalm.com/blog/2010/08/06/my-list-of-cool-machine-learning-books/</link>
      <category><![CDATA[books]]></category>
      <category><![CDATA[machine learning]]></category>
      <guid>http://matpalm.com/blog/?p=746</guid>
      <description>my list of cool machine learning books</description>
      <content:encoded><![CDATA[<p>for the last month or so i've had my head down and have been focusing more on theory (ie reading) than on practice (ie coding)</p>
<p>so rather than write no blog post here's mats-list-of-cool-machine-learning-books in the order i think you should consider reading them...</p>
<!--more-->
<h2>1) <a href="http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325">"programming collective intelligence"</a><img src="http://www.assoc-amazon.com/e/ir?t=matpalmcom0e-20&l=as2&o=1&a=0596529325" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> by toby segaran</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325"><img class="alignnone size-full wp-image-689" title="pci" src="/blog/imgs/2010/08/pci.jpg" alt="" width="200" height="215" /></a></td>
<td>if you know nothing about machine learning and haven't done maths since high school then this is the book for you.

it's a fantastically accesible introduction to the field. includes almost no theory and explains algorithms using actual python implementations.</td>
</tr>
</tbody>
</table>
<h2>2) <a href="http://www.amazon.com/gp/product/0120884070?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0120884070">"data mining"</a><img src="http://www.assoc-amazon.com/e/ir?t=matpalmcom0e-20&l=as2&o=1&a=0120884070" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> by witten and frank</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/0120884070?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0120884070"><img class="alignnone size-full wp-image-684" src="/blog/imgs/2010/08/dm1.jpg" alt="" width="200" height="215"></a></td>
<td>this book covers quite a bit more than programming c.i. while still being extremely practical (ie very few formula).

about a fifth of the book is dedicated to weka, a machine learning workbench which was written by the authors. apart from the weka section this book has no code. i made <a href="http://vimeo.com/13051595">a little screencast on weka</a> awhile back if you're after a summary.</td>
</tr>
</tbody>
</table>
<h2>3) <a href="http://www.amazon.com/gp/product/0321321367?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0321321367">"introduction to data mining"</a><img src="http://www.assoc-amazon.com/e/ir?t=matpalmcom0e-20&l=as2&o=1&a=0321321367" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />  by tan, steinbach and kumar</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/0321321367?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0321321367"><img class="alignnone size-full wp-image-687" src="/blog/imgs/2010/08/itdm.jpg" alt="" width="200" height="215" /></a></td>
<td>covers almost the same material as the witten/frank text but delves a little bit deeper and with more rigour. includes no code (none of the books do from now on) with algorithms described by formula.

has a number of appendices on linear algebra, probability, statistics etc so that you can read up if you're a bit rusty or new to the fields (the witten/frank text lack these).

some people might argue having both of these books is a waste since they cover so much of the same ground but i've always found multiple explanations from different authors to be a great way to help understand a topic. i read the witten/frank text first and am glad i did but if i could only keep one i'd keep this one.</td>
</tr>
</tbody>
</table>
<h2>intermission</h2>
at this point you've probably got enough mental firepower to handle some of the uni level machine learning course notes that are floating about online.

if you're keen to get a better foundation of the maths side of things it'd be worth working through <a href="http://www.youtube.com/watch?v=UzxYlbK2c7E">andrew ng's lecture series on machine learning.</a> (20 hours of a second year stanford course on machine learning)

i also found <a href="http://www.cs.cmu.edu/~awm/">andrew moore's lecture slides</a> really great. (they do though require a reasonable understanding of the basics)
<h2>4) <a href="http://www.amazon.com/gp/product/0262133601?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0262133601">"foundations of statistical natural language processing"</a> by manning and schutze</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/0262133601?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0262133601"><img class="alignnone size-full wp-image-686" title="fosnlp" src="/blog/imgs/2010/08/fosnlp.jpg" alt="" width="200" height="215" /></a></td>
<td>not a machine learning book as such but great for learning to deal with one of the most common types of data around; text. since most of machine learning theory is about maths (ie numbers) this is awesome in helping to understanding how to deal with text in a mathematical context.</td>
</tr>
</tbody>
</table>
<h2>5) <a href="http://www.amazon.com/gp/product/026201243X?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=026201243X">"introduction to machine learning"</a> by ethem alpaydin</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/026201243X?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=026201243X"><img class="alignnone size-full wp-image-686" src="/blog/imgs/2010/08/itml.jpg" alt="" width="200" height="215" /></a></td>
<td>covers generally the same sort of topics as the data mining books but with much more rigour and theory (derivations, proofs, etc). i think this is a good thing though since understanding how things work at a low level gives you the ability to tweak and modify as required.

loads more formulas but again with appendixs that introduce the basics in enough detail to get by.</td>
</tr>
</tbody>
</table>
<h2>6) <a href="http://www.amazon.com/gp/product/1441923225?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1441923225">"all of statistics"</a> by larry wasserman</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/1441923225?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1441923225"><img class="alignnone size-full wp-image-686" src="/blog/imgs/2010/08/aos.jpg" alt="" /></a></td>
<td>by this stage you'll probably have an appreciation of how important statistics is for this domain and it might be worth foccussing on it for a bit.

personally i found this book to be a great read and though i've only read certain sections in depth i'm looking forward to when i get a chance to work through it cover to cover</td>
</tr>
</tbody>
</table>
<h2>7) "the elements of statistical learning" by hastie, tibshirani and friedman.</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><img class="alignnone size-full wp-image-686" src="/blog/imgs/2010/08/eosl.jpg" alt="" /></td>
<td>with a bit more stats under your belt you might have a chance of getting through this one; the most complex of the lot.

this book is absolutely beautifully presented and now that it's <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">FREE to download</a> you've got no reason not to have a crack at it.

a remarkable piece of work and one i've yet to get through fully cover to cover, it's quite hardcore and right on the border of my level of understanding ( which makes it perfect for me :P )</td>
</tr>
</tbody>
</table>

<h2>ps. books i haven't read that are in the mail</h2>

<h2><a href="http://www.amazon.com/gp/product/0070428077?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0070428077">"machine learning"</a> by tom mitchell</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/0070428077?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0070428077"><img class="alignnone size-full wp-image-686" src="/blog/imgs/2010/08/ml.jpg" alt="" /></a></td>
<td>have been wanting to read this one for awhile, i'm a big fan of <a href="http://www.cs.cmu.edu/~tom/">tom mitchell</a>, but couldn't justify the cost

however just found out the other day the paperback is a third of the price of the hardback i was looking at!! the book's in the mail</td>
</tr>
</tbody>

</table>
<h2><a href="http://www.amazon.com/gp/product/0387310738?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0387310738">"pattern recognition and machine learning"</a> by chris bishop</h2>
<table>
<tbody style="vertical-align: top;">
<tr>
<td><a href="http://www.amazon.com/gp/product/0387310738?ie=UTF8&tag=matpalmcom0e-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0387310738"><img class="alignnone size-full wp-image-686" src="/blog/imgs/2010/08/prml.jpg" alt="" /></a></td>
<td>all of a sudden seemed like everyone was reading this but me so it was time to jump on the bandwagon</td>
</tr>
</tbody>
</table>]]></content:encoded>
    </item>
    <item>
      <title>brutally short intro to weka</title>
      <link>http://matpalm.com/blog/2010/07/03/brutally-short-intro-to-weka/</link>
      <category><![CDATA[weka]]></category>
      <category><![CDATA[brutally short intro]]></category>
      <category><![CDATA[machine learning]]></category>
      <guid>http://matpalm.com/blog/?p=677</guid>
      <description>brutally short intro to weka</description>
      <content:encoded><![CDATA[<p>weka is a java based machine learning workbench that i've found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of which address the basics of text classification</p>
<p><object width="400" height="300"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=13051595&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=13051595&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"></embed></object><p><a href="http://vimeo.com/13051595">brutally short intro to weka</a> from <a href="http://vimeo.com/user2935988">Mat Kelcey</a> on <a href="http://vimeo.com">Vimeo</a>.</p></p>]]></content:encoded>
    </item>
    <item>
      <title>an intro to semi supervised document classification</title>
      <link>http://matpalm.com/blog/2010/01/31/an-intro-to-semi-supervised-document-classification/</link>
      <category><![CDATA[semi supervised]]></category>
      <category><![CDATA[naive bayes]]></category>
      <category><![CDATA[machine learning]]></category>
      <guid>http://matpalm.com/blog/?p=275</guid>
      <description>an intro to semi supervised document classification</description>
      <content:encoded><![CDATA[<p>here's a great <a href="http://videolectures.net/mlas06_mitchell_sla/">lecture</a> from <a href="http://www.cs.cmu.edu/~tom/">tom mitchell</a> about document classification using a semi supervised version of naive bayes.</p>
<p>semi supervised algorithms only require some of the training examples to be labeled and are able to make use of any unlabelled ones, very common when we have a huge corpus.</p>
<p>i've started an experiment brewing to test this out by porting some <a href="http://matpalm.com/rss.feed/p3/">previous naive bayes work</a> i did to use this semi supervised scheme and will published it when it's done.</p>
<p>cool stuff!!</p>]]></content:encoded>
    </item>
    <item>
      <title>do a degree via youtube</title>
      <link>http://matpalm.com/blog/2009/10/01/do-a-degree-via-youtube/</link>
      <category><![CDATA[lectures]]></category>
      <category><![CDATA[statistics]]></category>
      <category><![CDATA[stanford]]></category>
      <category><![CDATA[machine learning]]></category>
      <guid>http://matpalm.com/blog/?p=75</guid>
      <description>do a degree via youtube</description>
      <content:encoded><![CDATA[<p>i'm amazed by how much great content is on youtube, how could you NOT learn something!?</p>
<p><a href="http://www.youtube.com/view_play_list?p=993FF1801B5AAB4D&amp;search_query=statistical+aspects+of+data+mining+stats+202">13 x 1hr Statistical Aspects of Data Mining (Stats 202)</a></p>
<p><a href="http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599&amp;search_query=machine+learning">20 x 1hr Machine Learning</a></p>]]></content:encoded>
    </item>
  </channel>
</rss>

