<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>do all first links on wikipedia lead to philosophy?</title>
      <link>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy</link>
      <category><![CDATA[graph]]></category>
      <category><![CDATA[wikipedia]]></category>
      <guid>http://matpalm.com/blog/2011/08/13/wikipedia-philosophy</guid>
      <description>do all first links on wikipedia lead to philosophy?</description>
      <content:encoded><![CDATA[<hr>

<p>(update: like all interesting things it turns out <a href="http://en.wikipedia.org/wiki/User:Ilmari_Karonen/First_link">someone else had already done this</a> :D)</p>
<h2>questions</h2>
<p>a <a href="http://xkcd.com/903/">recent</a> xkcd posed the idea...</p>
<p><i>wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at Philosophy.</i></p>
<p>this raises a number of questions</p>
<ol>
<li>Q: though i wouldn't be surprised if it's true for <em>most</em> articles it can't be true for <em>all</em> articles. can it?</li>
<li>Q: what's the distribution of distances (measured in "number of clicks away") from 'Philosophy'?</li>
<li>Q: by this same measure what's the furthest article from 'Philosophy'?</li>
<li>Q: are there any other articles that are more common than 'Philosophy'? </li>
<li>Q: what are the common paths to 'Philosophy'?</li>
</ol>
<p>there's only one way to find out!</p>
<ol>
<li>grab a wikipedia dump</li>
<li>build the graph of 'article' to 'first link to next article' (not in parentheses or italics)</li>
<li>do breadth first search backwards from 'Philosophy' and see what things look like</li>
</ol>
<hr>

<h2>getting and processing the data</h2>
<p>for my first attempt i tried to use the <a href="http://wiki.freebase.com/wiki/WEX">freebase wikipedia dump</a>. my thought was it'd be easier
to deal with a preparsed dataset but it didn't turn out. </p>
<p>two big problems....</p>
<ol>
<li>lots of information has been lost in the preparsing (eg. it was sometimes hard to determine if the first links were from the main body of text or from a sidebar )</li>
<li>some pages weren't parsed properly at all and were just blank; included ones like <a href="http://en.wikipedia.org/wiki/Greeks">Greeks</a>
which ended up being pretty important.</li>
</ol>
<p>instead i went for a <a href="http://download.wikimedia.org/enwiki/20110722/">raw wikimedia dump</a>, in particular the enwiki-20110722-pages-articles.xml.bz2 version.
it's 7gb compressed &amp; 30gb uncompressed.</p>
<p>for preprocessing there were a number of steps</p>
<ol>
<li>split the dataset into pages that represent redirects and the actual articles themselves</li>
<li>dereference all the redirects (to avoid redirects that redirect to other redirects)</li>
<li>parse all the articles; the crux of this is done with <a href="http://code.pediapress.com/wiki/wiki/mwlib">mwlib</a> 
and <a href="https://github.com/matpalm/wikipediaPhilosophy/blob/master/article_parser.py">article_parser.py</a>; to make a big list of edges of 'from' nodes (the article) and 'to' nodes (the first applicable link on the article page)</li>
<li>dereference the edges to make sure all redirects have been followed</li>
</ol>
<p>some general statements before we go further</p>
<ol>
<li>wikipedia is under heavy edit churn. i've been doing this project in 15-30 minutes chunks for a few weeks and it's amazing
 how often i'd compare the parsing to live wikipedia and find out a page had already subtely changed. god knows what it looks like currently.</li>
<li>i wrote all the code for this in python as i'm trying to move away from ruby to get better data related library support. everything in fact <em>except</em> for
the depth first search which i did in java. the full graph as a dict was <em>insanely</em> slow to access, i must be doing something wrong.
for the full details see 
<a href="http://www.github.com/matpalm/wikipediaPhilosophy">the code on this project</a>. git cloning the project and executing the README
as a shell script may [1] do something close to all the steps from start to finish. <small>[1] or it might not</small></li>
</ol>
<p>the end result of the parsing is a list of 3.6e6 edges of the form 'article' -&gt; 'first link to next article' (after following redirects).</p>
<p>all the 'article's are unique but there are only 500e3 distinct 'next article's which is already very interesting; it means less than 15% of articles 
on wikipedia are represented by one of these first links; this graph is very "bushy" (ie lots of leaf nodes).</p>
<p>to calculate the distance from 'Philosophy' for all articles it's a straight forward 
<a href="http://en.wikipedia.org/wiki/Breadth_first_search">breadth first search</a> and
because this search doesnt <a href="http://en.wikipedia.org/wiki/Graph_cycle">cycle</a> back to 'Philosophy' again it ends
up building a <a href="http://en.wikipedia.org/wiki/Tree_(graph_theory)">tree</a>.</p>
<hr>

<h2>the results</h2>
<p>with this tree we can start answering some of our original questions ...</p>
<hr>

<h3>Q: though i wouldn't be surprised if it's true for <em>most</em> articles it can't be true for <em>all</em> articles. can it?</h3>
<p>seems it's not true for all articles; 3.5e6 articles lead to 'Philosophy' but 100e3 don't.</p>
<p>these 100e3 fall into two types</p>
<p>1) 50e3 of them end up in cycles. this is a remarkably low count given 3.5e6 make it to 'Philosophy'.</p>
<p>the vast majority of the cycles are of length 2; eg <strong>Waste management -&gt; Waste collection -&gt; Waste management</strong></p>
<p>( my favorite that i stumbled across is <strong>Sand fence -&gt; Snow fence -&gt; Sand fence</strong></br>
the first sentence of Snow fence being "A snow fence is a structure, similar to a sand fence ..."</br>
the first sentence of Sand fence being "A sand fence is a structure similar to a snow fence ..." )</p>
<p>2) the other 50e3 are dead ends; all sorts of examples for this, mainly around pages that were never written or have been deleted.</p>
<p>eg <strong>Windsurfing -&gt; Surface water sports -&gt; Discing</strong> (which has deleted)</p>
<hr>

<h3>Q: what's the distribution of distances of articles from 'Philosophy'?</h3>
<p>the bulk of the articles are between 10 to 30 clicks away...</p>
<img src="http://matpalm.com/wikipediaPhilosophy/num_articles__number_clicks__philosophy.png"/>

<p>i've trimmed this graph at 70 clicks away since there's a long tail of one single path that is 1001 articles long.</p>
<p><strong>List of state leaders in 1977 -&gt; List of state leaders in 1976 -&gt; List of state leaders in 1975 -&gt;
.... -&gt; List of state leaders in 1001 -&gt; List of state leaders in 1000 -&gt; Fatimid Caliphate -&gt; Arab people
-&gt; Panethnicity -&gt; Ethnic group -&gt; Social group -&gt; Social sciences -&gt; List of academic disciplines 
-&gt; Academia -&gt; Community -&gt; Living -&gt; Life -&gt; Physical body -&gt; Physics -&gt; Natural science -&gt; Science 
-&gt; Knowledge -&gt; Fact -&gt; Information -&gt; Sequence -&gt; Mathematics -&gt; Quantity -&gt; Property (philosophy) 
-&gt; Modern philosophy -&gt; Philosophy</strong></p>
<p>seems a bit of a "meta article" outlier we can ignore.</p>
<p>( there's an interesting dip at a distance of 19 too; wonder what's going on there? )</p>
<hr>

<h3>Q: what's the furthest article from 'Philosophy'?</h3>
<p>'Violet &amp; Daisy' is the longest chain i found that didn't include "meta" pages with some kind of sequence number in it. it's 36 articles from 'Philosophy'.</p>
<p><strong>Violet &amp; Daisy -&gt; Saoirse Ronan -&gt; BAFTA Award for Best Actress in a Supporting Role -&gt; British Academy Film Awards -&gt; 
 British Academy of Film and Television Arts -&gt; David Lean -&gt; Order of the British Empire -&gt; Chivalric order -&gt; Knight -&gt; 
 Warrior -&gt; Combat -&gt; Violence -&gt; Psychological manipulation -&gt; Social influence -&gt; Conformity -&gt; Unconscious mind -&gt; 
 Germans -&gt; Germanic peoples -&gt; Proto-Indo-Europeans -&gt; Proto-Indo-European language -&gt; Linguistic reconstruction -&gt; 
 Internal reconstruction -&gt; Language -&gt; Human -&gt; Extant taxon -&gt; Biology -&gt; Natural science -&gt; Science -&gt; Knowledge -&gt; 
 Fact -&gt; Information -&gt; Sequence -&gt; Mathematics -&gt; Quantity -&gt; Property (philosophy) -&gt; Modern philosophy -&gt; Philosophy</strong></p>
<hr>

<h3>Q: are there any other articles that are "more common" than 'Philosophy'?</h3>
<p>with 95+% of articles clicking through to 'Philosophy' it's not possible for there to be another unconnected graph with an article more represented than
'Philosophy'. </p>
<p>but if we <em>continue</em> to click through past 'Philosophy' we see we're in a short cycle of 12 articles...</p>
<p><strong>Philosophy -&gt; Reason -&gt; Natural science -&gt; Science -&gt; Knowledge -&gt; Fact -&gt; Information -&gt; Sequence -&gt; Mathematics -&gt; 
 Quantity -&gt; Property (philosophy) -&gt; Modern philosophy -&gt; Philosophy</strong></p>
<p>so really <em>any</em> of these are reasonable candidates and are equally good as 'Philosophy' itself for this game.</p>
<hr>

<h3>Q: what are the common paths into 'Philosophy'?</h3>
<p>as mentioned the breadth first search builds a tree of articles with 'Philosophy' at it's root.</p>
<p>one metric we can assign to each article in this tree is the number of descendant articles it has.</br>
'Philosophy', as the root, has all articles as descendants so it's number is 3.5e6 and it's rank 1.</br>
the next ranked by number of descendants is 'Modern philosophy' with 3.4e6 descendants; 
( ie of the 3.5e6 articles that eventually led to 'Philosophy' only 100e3 of them <em>didn't</em> click through 'Modern Philosophy').</p>
<p>by ranking articles by this metric we can observe the core structure of the tree.</p>
<p><hr>
in fact for the top 10 ranked articles it's hardly a tree, just the chain ...</p>
<p><a href="http://matpalm.com/wikipediaPhilosophy/top10.png"><img src="http://matpalm.com/wikipediaPhilosophy/top10.png" width="100%"/></a></p>
<p><small>(width of the edge is proportional to the number of descendants)</small></p>
<p>it turns out that 3e6 articles (85% of the lot) get to 'Philosophy' through 'Science'.</p>
<p><hr>
in fact it's not until we consider up to the 20th ranked item, 'Biology', before it actually becomes a tree structure ...</p>
<p><a href="http://matpalm.com/wikipediaPhilosophy/top20.png"><img src="http://matpalm.com/wikipediaPhilosophy/top20.png" width="100%"/></a></p>
<p><small>(click for a bigger version)</small></p>
<p><hr>
when we consider the top 200 things start to look a bit more interesting ...</p>
<script src="http://zoom.it/adTw.js?width=auto&height=500px"></script>

<p><hr>
and by the top 1000 things are starting to lose an obvious core structure ...</p>
<script src="http://zoom.it/QyGA.js?width=auto&height=500px"></script>

<p>( though dot's a pretty poor layout engine for this one, i should redo this one )</p>
<h2>conclusions</h2>
<p>so i managed to answer the main questions i had, but it's a fun dataset so there's lots more to do yet!</p>
<p>todos include </p>
<ol>
<li>a better layout for the top 1000 or so</li>
<li>redo with a more recent wiki dump to see what's changed</li>
<li>what happened at a depth of 19 articles?</li>
</ol>]]></content:encoded>
    </item>
    <item>
      <title>e10.6 community detection for my twitter network</title>
      <link>http://matpalm.com/blog/2010/04/04/375/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[betweenness]]></category>
      <category><![CDATA[social network]]></category>
      <category><![CDATA[graph]]></category>
      <guid>http://matpalm.com/blog/?p=375</guid>
      <description>e10.6 community detection for my twitter network</description>
      <content:encoded><![CDATA[<p>last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.</p>
<p>first i build a friend graph for 100 people 'around' me (taken from a <a href="http://matpalm.com/blog/2009/09/29/e10-3-twitter-crawl-progress/">crawl</a> i did last year). by 'friend' i mean that if alice follows bob then bob also follows alice.</p>
<p>here the graph, some things to note though; it was an unfinished crawl (can a crawl of twitter EVER be finished) and was done october last year so is a bit out of date.</p>
<p><a href="/blog/imgs/2010/04/friends.jpg"><img class="aligncenter size-large wp-image-377" title="friends" src="/blog/imgs/2010/04/friends-1024x204.jpg" alt="friends" width="1024" height="204" /></a><!--more--></p>
<p>and here is the dendrogram decomposition</p>
<p><a href="/blog/imgs/2010/04/dendrogram.vert_.600.jpg"><img class="aligncenter size-full wp-image-391" title="dendrogram.vert.600" src="/blog/imgs/2010/04/dendrogram.vert_.600.jpg" alt="dendrogram.vert.600" width="600" height="1500" /></a>some interesting clusterings come out..</p>
<p>right at the bottom we have a small clique (ie everyone following everyone else) of people i've known from when i was in <em>sydney</em></p>
<p><a href="/blog/imgs/2010/04/sydney.nokia_.jpg"><img class="aligncenter size-full wp-image-387" title="sydney.nokia" src="/blog/imgs/2010/04/sydney.nokia_.jpg" alt="sydney.nokia" width="185" height="98" /></a></p>
<p>this small group connects to the group i'm in; <a href="http://twitter.com/tinybuddha">tinybuddha</a> down to <a href="http://twitter.com/evanbottcher">evanbottcher</a>; which roughly describes the group of people i've met in <em>melbourne</em>.</p>
<p>the order of the single breakaways in the melbourne group is pretty arbitrary. i get quite different ordering if i run the decomposition multiple times due to the random tie breaking involved. i could either run the decomposition multiple times and work out some kind of averaging or choose another more granular way of deciding how to break ties.</p>
<p>the next connector after <em>syndey</em> and <em>melbourne</em> are unified is <a href="http://twitter.com/deanemorrow">deanemorrow</a> a coworker when i was at <a href="http://twitter.com/distra">distra</a>. this one sticks out for me as being the biggest flaw in the clustering since it would have made more sense to have him placed near distra at the bottom.</p>
<p>another interesting clique is near me..</p>
<p><a href="/blog/imgs/2010/04/twers.jpg"><img class="aligncenter size-full wp-image-393" title="twers" src="/blog/imgs/2010/04/twers.jpg" alt="twers" width="115" height="123" /></a>it has four thoughtworkers; <a href="http://twitter.com/markryall">mark</a>, <a href="http://twitter.com/grillp">gill</a>, <a href="http://twitter.com/debbiecheong">debs</a> and <a href="http://twitter.com/evanbottcher">evan</a> and one sensiser; <a href="http://twitter.com/kornys">korny</a>. did korny perhaps work for thoughtworks in a previous life ;)</p>
<p>another interesting note is there exists a path from me to <a href="http://twitter.com/norvig">peter norvig</a> (who is too busy for twitter it seems) but only because of the huge connector nodes that exist in twitter. an example in this case is <a href="http://twitter.com/tuaw">TUAW</a> who follow 30,000+ people and have even more followers. these nodes cause a bit of noise in the system since they are slightly false representations of what a 'friend' means in my mind. not sure how to take these numbers into account...</p>
<p>things to do...</p>
<ul>
<li>the biggest oversimplification in this system is how i break ties for deciding which edge to cut out next if multiple exist with the same betweenness. currently it chooses the one that would make the most even sized break (based on smallest standard deviation of the connected components). though this is good for breaking a group into even sizes it's bad since it favours breaking a single element off a large group. this is what has caused the 'laddering' we see in the melbourne group.</li>
<li>the shortest path algorithm used to calculate edge betweenness is stochastic and if multiple shortest paths exist only one of them is chosen. it'd be better if all were considered with a weighting scheme.</li>
<li>it might be better to consider vertex betweenness instead of edge betweenness since one person could exist in multiple groups. if i started down this path though i think i'd rather just rewrite the lot using something like  the <a href="http://en.wikipedia.org/wiki/Clique_percolation_method">clique percolation method</a></li>
</ul>
<p><a href="http://github.com/matpalm/tgraph">all the code is on github</a></p>]]></content:encoded>
    </item>
    <item>
      <title>e10.5 revisiting community detection</title>
      <link>http://matpalm.com/blog/2010/03/30/e10-5-revisiting-community-detection/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[betweenness]]></category>
      <category><![CDATA[social network]]></category>
      <category><![CDATA[graph]]></category>
      <guid>http://matpalm.com/blog/?p=357</guid>
      <description>e10.5 revisiting community detection</description>
      <content:encoded><![CDATA[<p>i've decided to switch back to <a href="http://matpalm.com/blog/2009/10/06/e10-4-communities-in-social-graphs/">some previous work</a> i did on community detection in (social) graphs</p>
<p>the <a href="http://github.com/matpalm/tgraph/tree/master/girvan_newman">last chunk of code</a> i wrote which tried to deal with weighted directed graphs was terribly, terribly, broken but it seems that simplifying to undirected graphs is giving me much saner results. yay!</p>
<p>here's an example of my work in progress generated from <a href="http://github.com/matpalm/tgraph/tree/master/girvan_newman_2">the new version of the code</a></p>
<p>consider the graph</p>
<img class="aligncenter size-medium wp-image-358" title="p97" src="/blog/imgs/2010/03/p97-214x300.png" alt="p97" width="214" height="300" />

<p>and it's corresponding decomposition</p>
<img class="aligncenter size-full wp-image-360" title="p97.dendrogram" src="/blog/imgs/2010/03/p97.dendrogram.jpg" alt="p97.dendrogram" width="400" height="400" />

<p>the results are reasonable; the initial breaking of clusters [1,2,3,4,5,6] and [7,8,9,10,11,12] is the most obvious but some of the others are not as intuitive</p>
<p>[1,2,5] and [7,8,10] remain as unbreakable <a href="http://en.wikipedia.org/wiki/Clique_(graph_theory)">cliques</a> though it's arbitrary that 11 was broken off from [7,8,10] instead of 10 (arbitrary but an artifact related to my shortest path calculation for the edge betweenness)</p>
<p>the idea of identifying the edge to remove using <a href="http://en.wikipedia.org/wiki/Betweenness#Betweenness_centrality">edge betweenness</a> works well but it is often the case there are many edges with the same maximal betweeness and you have to choose only one. i think my current implementation of picking one is a bit naive and i'm not sure if i should move to a stochastic / <a href="http://en.wikipedia.org/wiki/Monte_Carlo_method">monte carlo style approach</a> or focus more on <a href="http://en.wikipedia.org/wiki/Community_structure#Modularity_maximization">modularity maximisation</a></p>]]></content:encoded>
    </item>
    <item>
      <title>e10.4 communities in social graphs</title>
      <link>http://matpalm.com/blog/2009/10/06/e10-4-communities-in-social-graphs/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[social network]]></category>
      <category><![CDATA[betweenness]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[graph]]></category>
      <guid>http://matpalm.com/blog/?p=83</guid>
      <description>e10.4 communities in social graphs</description>
      <content:encoded><![CDATA[<p>social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.</p>
<p>these connecting edges define the boundaries of communities in the social network and can be identified by algorithms that measure <a href="http://en.wikipedia.org/wiki/Betweenness#Betweenness_centrality">betweenness</a>.</p>
<p>the <a href="http://en.wikipedia.org/wiki/Girvan-Newman_algorithm">girvan-newman algorithm</a> can be used to decompose a graph hierarchically based on successive removal of the edges with the highest betweenness.</p>
<p>the algorithm is basically</p>
<ol>
<li>calculate the betweenness of each edge (using an all shortest paths algorithm)</li>
<li>remove the edge(s) with the highest betweenness</li>
<li>check for connected components (using <a href="http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm">tarjan's</a> algorithm)</li>
<li>repeat for graph or subgraphs if graph was split</li>
</ol>]]></content:encoded>
    </item>
    <item>
      <title>e10.2 tgraph crawl order example</title>
      <link>http://matpalm.com/blog/2009/09/21/e10-2-tgraph-crawl-order-example/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[graph]]></category>
      <guid>http://matpalm.com/blog/?p=65</guid>
      <description>e10.2 tgraph crawl order example</description>
      <content:encoded><![CDATA[<p>let's consider an example of the crawl order for tgraph...</p>
<p>we seed our frontier with 'a' and bootstrap cost of 0.</p>
<p>fetching the info for 'a' shows 2 outedges to 'b' and 'c', from our cost formula these all have cost 0 + 1 + Log10(2+1) = 1.6</p>
<p>our frontier becomes [ {b,1.6}, {c,1.6} ]</p>
<p>next is 'b' and see it has an outdegree of 3, these nodes, b1 -&gt; b3, all have a cost of 1.6 + 1 + Log10(3+1) = 3.2</p>
<p>our frontier becomes [ {c,1.6}, {b1,3.2}, {b2,3.2}, {b3,3.2} ]</p>
<p>next is 'c' with an outdegree of 15. these 15 nodes, c1 -&gt; c15, have cost 1.6 + 1 + Log10(16) = 3.8</p>
<p>our frontier is now [ {b1,3.2}, {b2,3.2}, {b3,3.2}, {c1,3.8} ... {c15,3.8} ]</p>
<p>we would then continue with the 'b' nodes before the 'c' ones</p>
<p>note that this cost system is more than just an ordering, it allows an element at depth n+1 to be checked an element at depth n, if the cost of the latter is high enough.</p>]]></content:encoded>
    </item>
    <item>
      <title>e10.1 crawling twitter</title>
      <link>http://matpalm.com/blog/2009/09/19/e10-1-crawling-twitter/</link>
      <category><![CDATA[e10]]></category>
      <category><![CDATA[twitter]]></category>
      <category><![CDATA[algorithms]]></category>
      <category><![CDATA[graph]]></category>
      <guid>http://matpalm.com/blog/?p=55</guid>
      <description>e10.1 crawling twitter</description>
      <content:encoded><![CDATA[<p>our first goal is to get some data and <a href="http://apiwiki.twitter.com/">the twitter api</a> makes getting the data trivial. i'm focused mainly on the <a href="http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-friends%C2%A0ids">friends</a> stuff but because it only gives user ids i'll also get the <a href="http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-users%C2%A0show">user info</a> so i can put names to ids.</p>
<p>a <a href="http://en.wikipedia.org/wiki/Depth-first_search">depth first crawl</a> makes no sense for this one experiment, we're unlikely to get the entire graph and are more interested in following edges "close" to me. instead we'll use a <a href="http://en.wikipedia.org/wiki/Breadth-first_search">breadth first search</a>.</p>
<p>since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth first we'll introduce a cost component to elements on the frontier so help decide what to grab next. this is especially important for a graph like  twitter where the outdegree of a node is often in the hundreds. it turns the crawl into something that is not strictly depth first but it works out.</p>
<p>to explain the cost component consider the expected connectivity of nodes in the twitter friend graph. most nodes have an outdegree of the order 20-200. occasionally you see much larger (in the 1000's) or much smaller (under 10).</p>
<p>we might, naively perhaps, say that having a large outdegree means the person is a bit less strict with her following criteria and that some of them are not really that important to her. if this is the case we should focus a little more  on getting nodes with smaller outdegree.</p>
<p>the formula i've come up with is to not consider the depth but instead add 1 + Log10(1+the outdegree of the previous node). in this way we penalise large outdegress, but not by a huge amount. we always add 1 to counter the cases where there are no edges leaving a node.</p>]]></content:encoded>
    </item>
  </channel>
</rss>

