<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     >
  <channel>
    <title>brain of mat kelcey</title>
    <link>http://matpalm.com/blog</link>
    <description>thoughts from a data scientist wannabe</description>
    <generator>Blogofile</generator>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <item>
      <title>cool bash stuff; mkfifo</title>
      <link>http://matpalm.com/blog/2010/04/15/cool-bash-stuff-mkfifo/</link>
      <category><![CDATA[unix]]></category>
      <category><![CDATA[bash]]></category>
      <guid>http://matpalm.com/blog/?p=426</guid>
      <description>cool bash stuff; mkfifo</description>
      <content:encoded><![CDATA[<p><a href="http://www.gnu.org/software/coreutils/manual/html_node/mkfifo-invocation.html">mkfifo</a> is one of those shell commands provided as part of <a href="http://www.gnu.org/software/coreutils/manual/html_node/">coreutils</a> that not many people seem to know about.</p>
<p>here's an (semi contrived) example close to something i did the other day to show how awesome it is</p>
<p>say you have a number of largish presorted files;<strong> </strong><em>run-00</em> to <em>run-03</em>; and you want to find the most frequent lines. you could do something like the following...</p>
<pre>sort -m run-* | uniq -c | sort -nr | head</pre>

<p>however you'll know that from <a href="http://matpalm.com/blog/2009/06/28/how-using-compressed-data-can-make-you-app-faster/">previous posts</a> i just loooove keeping all my data compressed on disk so instead i've got <em>run-00.gz</em> to <em>run-03.gz</em></p>
<p>without having to uncompress the files to disk i'd have to do something like this...</p>
<pre>zcat run*gz | sort | uniq -c | sort -nr | head</pre>

<p>but this pains me since it results in completely resorting the stream. i know the input files are sorted so i'd <strong>much</strong> prefer doing a <em>sort -m</em> than <em>sort</em></p>
<p>so how can i mix the combo of zcat and a pipe to sort with sort -m wanting the multiple inputs as file descriptors instead of STDIN?</p>
<p>well, mkfifo of course!  it's a way of making a file that acts like a pipe ( a named pipe )</p>
<pre>ls | sort</pre>

<p>is sort-of, roughly, equivalent to</p>
<pre>
mkfifo bob
ls > bob &
sort < bob
rm bob
</pre>

<p>( have to background the <em>ls</em> since the write to the named pipe blocks until the read starts )</p>
<p>apart from being a cool way to get pipes working between totally seperate processes on a box this provides a solution for our original problem</p>
<pre>mkfifo p0 p1 p2 p3
zcat run-00.gz > p0 &
zcat run-01.gz > p1 &
zcat run-02.gz > p2 &
zcat run-03.gz > p3 &
sort -m p* | uniq -c | sort -nr | head
rm p[0123]</pre>

<p>and all four zcat can burn cpu while avoiding the need to resort.</p>
<p>yay!</p>]]></content:encoded>
    </item>
    <item>
      <title>xargs parallel execution</title>
      <link>http://matpalm.com/blog/2009/11/06/xargs-parallel-execution/</link>
      <category><![CDATA[unix]]></category>
      <category><![CDATA[bash]]></category>
      <guid>http://matpalm.com/blog/?p=217</guid>
      <description>xargs parallel execution</description>
      <content:encoded><![CDATA[<p>just recently discovered xargs has a parallelise option!</p>
<p>i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script over</p>
<p>one option is
<pre>zcat sample*gz | ./script.rb &gt; output</pre>
but this will process the files sequentially on a single core.</p>
<p>to get some parallel action going i could generate a temp script that produces
<pre>zcat sample.01.gz | ./script.rb &gt; sample.01.out &amp;
zcat sample.02.gz | ./script.rb &gt; sample.02.out &amp;
...
zcat sample.20.gz | ./script.rb &gt; sample.20.out &amp;</pre>
and run that but this will have all 20 running at the same time and produce contention</p>
<p>(though with only 20 files this might not be a problem)</p>
<p>instead i can make a temp script, parse.sh
<pre>zcat $1 | ./script.rb &gt; $1.out</pre>
and run
<pre>find sample<em>gz | xargs -n1 -P4 sh parse.sh
cat </em>out &gt; output</pre>
what is this xargs command doing?
<ul>
    <li>-n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args)</li>
    <li>-P4 says have at most 4 commands running at the same time</li>
</ul>
100% on all cores (and only because the disk can keep up)</p>
<p>awesome!</p>]]></content:encoded>
    </item>
  </channel>
</rss>

