<?php //
      //     YOU THINK YOU KNOW--    YOU DON'T!    LISTEN UP!
      //
      // I RENAME THIS to makeindex.php_txt, edit it,
      // then create the actual .html file like this:
      //
      //     touch datefile; php makeindex.php_txt >index.html
      //     ^^^^^^^^^^^^^^^
      //
      // or just  make  ...ceteris paribus (everything necessary to
      // make the same thing happen being inserted in the makefile).
?>
<HTML>
<HEAD>
<TITLE>Clumpy Heapsort</TITLE>
</HEAD>
<BODY>

<table width="100%" border="0">
  <tr>
    <td width="129" align="center">
      <a href="#graph">
        <img
        src="clumpy_thumb.png" 
        width="109" border="0">

      </a></td>
    <td valign="top"> 
<p>
<h2>Clumpy Heapsort</h2>
<font size="-1">version 0.2</font>
    </td>
  </tr>
</table>

<p>

What if you laid out a heapsort heap differently to improve locality
of reference, so that as the sort walks down the heap, it
accesses sets of locations that are close together rather than
scattered, in the hope of running faster by making more of its
accesses in cached memory?
<p>

The original Williams/Floyd heap (what's called "Vanilla" here) looks
like this:<pre>
                                 1
               2                                     3
      4                  5                6                 7
  8        9       10         11     12        13       14      15
...
</pre>

Here the left and right children of node n are nodes 2n and 2n+1.  At
each level of the tree, the distance between children and parents doubles.
Suppose instead we arrange a tree like this:<pre>
                                   1
                                 2   3
                    4        7           10          13
                  5   6    8   9       11   12     14   15
                 / \_________________
               16                     31
            17   18                 32   33
      19    22     25    28       33
     20 21 23 24 26  27 29 30    34 35 ...
</pre>
This builds "clumps" of three, then clumps of 15 out of clumps of 3,
then clumps of 15 * 17 = 255, then 65535, etc.  Every other step down
the tree is a distance of 1 or 2; three out of four steps are &lt; 10,
7/8 of steps are &lt;= 240, and so forth.  So this layout has
better locality of reference, although the relation between
the parent and child node addresses is less simple than in the Vanilla heap.
<p>

Another difference with a heap arranged this way is that it can become
more unbalanced than a Vanilla heap, for which the lowest nodes are
different by at most one level of depth.  With the Clumpy tree,
once a tree of, e.g., 15 elements and a depth of 4 is built, a depth 4
subtree is constructed next.  In the diagram above, you can see that a
tree of 29 nodes would have a depth of 8 in seven of its leaves (20,
21, 23... 29) and 4 in the other seven (6, 8, 9 ... 15).  Similar imbalance happens
just after trees of 255 and 65535 nodes become filled.
<p>

I tested the effect of Clumpy and other arrangements by writing a C
program that heapsorts arrays of random 32-bit numbers, with different
formulas plugged in (with #ifdefs) to set the memory layout.
A bash script runs the test over the
different variations of the program, and a Python program takes the
results and produces the graph below, using the <a
href="http://www.reportlab.org/rl_toolkit.html">ReportLab pdfgen</a>
library.  I should stress that this is a test with just one record size
(four bytes), one computer
(a 700MHz PPC G3), one cache configuration (256K of level-2) and
one configuration of memory (768 MB).
<p>

<a id="graph">
<h3>Rough Analysis</h3>

<a href="graph.pdf"><img src="graph.png">
<p>
<font size="-1">(Click for the pdf.)</font></a>
<p>

As has been pointed out by others who have worked on "cache-friendly,"
"cache- obvlivious," and disk-based heapsorts, improving the locality
of reference of a heapsort might improve efficiency by at most O(lg
n), while cache lines, pages sizes and access times grow with the
amount of memory accessed, and so although
this test shows some modified heapsorts running up to twice as fast as
Vanilla heapsort, this approach will never be a big win.
<p>

The horizontal axis of the graph is the log of the number of elements
of the set to be sorted.  The theoretical time complexity of heapsort,
assuming all memory accesses take the same time, is O( n lg n ).  The
vertical axis shows the linear <i>ratio</i> of the actual time taken to the
theoretical time.  So, for example, the green Vanilla curve runs
nearly flat for two and a half orders of magnitude of n and then starts
getting slower.
<p>

There are two important points on the x axis.  At 65536 entries (which
for 32-bit numbers means 256K bytes) two
things happen.  First, the level-2 cache of this particular computer
fills up.  Second, the Clumpy arrangement hits one
of the places (65535 nodes) where the tree balances perfectly.
<p>

What happens roughly for all the variations, is that some top part of
the heap-- around 16 levels-- tends to stay within level-2 cache
because access frequency is higher toward the root of the tree.
Then, as the problem size gets larger, paths down the tree spend more
and more of their time accessing the slower main memory.  Since the
total depth of the tree is O( lg n ), the fraction of a path in slower
memory is O(  ((lg n) - 16) / lg n ).  At first this grows like
(lg n)-16, which explains the nearly linear second half of the Vanilla
curve (with the horizontal axis a log).
Eventually, with many gigabytes of memory, the slowdown would level
out.  But main memory is many times slower than
level-2 cache.  The clearest measure
of that is that Vanilla slows down by a factor of five (compared to
O(n lg n)) with hardly a sign of flattening out.
<p>

The
second important point is the right edge of the graph.  At 2^27 entries
or about 512MB, the real memory is almost full.  Doubling the problem
size makes the program start swapping virtual memory pages, and the
performance goes down so badly I haven't had computer time to run the
test.  Sorts like the <a
href="http://www.diku.dk/forskning/performance-engineering/Jesper/heaplab/heapsurvey_html/node23.html">"external
heapsort"</a>, which merge sorted blocks in a heap arrangement, make
much better use of slow-to-access memory.
<p>

The code to calculate the Clumpy parent-child address relationship
involves five cases, and multiple multiplication, division and modulo
operations.  So Clumpy starts out about 100 times slower than the Vanilla
heapsort.  It hits one sweet spot at 255 entries, and another at
65535, and then slowly catches up with and overtakes Vanilla heapsort.
<p>

A few other arrangements tested are shown in the graph:
<p>

<b>Pages of 1022 aligned</b> groups the array into pages of 1024
entries, of which 1023 are used on the first page and 1022 on every
subsequent page.  The first page is a Vanilla heap, and every
subsequent page is a heap with the top node removed, so that the first
two nodes are sibling children (at locations 2 and 3 within the page)
of a parent node on an earlier page.  The reason is that heapsort always
accesses both children of every node on a path down the tree, and if
pages started with a single node instead of a pair, pairs of pages
would need to be fetched instead of single pages.  The word "aligned"
refers to this arrangement starting with two siblings, and all sibling
pairs on even boundaries.
<p>

However, memory areas the size of a page aren't really relevant to
level-2 cache.  The main advantage of this method over Clumpy is that
it is simpler to calculate the parent-child address relationship.
<p>

<b>Pages of 510 aligned</b> This is similar to Pages of 1022, with pages
of half the size and one less level of depth.
<p>

<b>Clumpy 510a</b>  This is like Pages of 510 aligned, but within the
pages there is a modified clumpy arrangement:<pre>
                               (1)
                             2     3
                  4 5                         6 7
    8 9          14 15
10 11  12 13  16 17  18 19 ...
</pre>
The (1) node is only used on the first page of the array.
This is supposed to improve Clumpy's locality by always putting
siblings together and on an even boundary.  This method has
calculation complexity between that of Clumpy and Pages of 510.
Clumpy 510a has a sweet spot at 511, and a lesser one at 130,816,
which matches a slight sweet spot for Pages of 510.
<p>

I haven't tested the paired-sibling arrangement in a non-paged Clumpy.
<p>

Why do Clumpy and Clumpy 510a finally flatten out more than Pages of
1022 and Pages of 510?  Probably even while chewing into uncached
memory, the Clumpies do multiple accesses to
the same cache lines on the way down a path, more often than the Pages
methods do.  Also, Clumpy
is coming out of one of its unbalanced- tree phases, which are worse
than the imbalances of the three paged methods, so it's not so much getting better as getting less worse.
<p>

It's not clear whether any of these arrangements take advantage of the
level-1 cache in a way that isn't spoiled by the additional code
complexity inside the inner loop.  The child- address calculation could
be unrolled to remove the case statements and reuse the results of some
calculations between levels of the tree...but I haven't tried that.
<p>

<h3>The Code</h3>

<?php
// The names here are as in the SRC directory,
// not as translated by the subst_url stuff below.
// In the listing in index.html they look like the original names,
//     but the URLs are modified if and as in subst_url.
$cmt = array(
"Makefile" => "compile code, create index.html,
	         run tests with 'make test', then 'make graph.pdf'",

"an_eqn.c" => "solves some linear equations to get constants in clumpy code",
"solveqn.c" => "linear equation solver",
"cache_friend.c" => "source for #ifdef'ed variations of heapsort",

"cache_friends_test" => "test loop run by Makefile",
"cache_friends_plot.py" => "generates PDF graph from test ouput files,
                  easiest to run with 'make graph.pdf'",
"clumpy_sequence.py" => "generate the clumpy index sequence for <a href='https://oeis.org/A082008'>OEIS A082008</a>.",
"graph.pdf" => "PDF version of the graph above",
"graph.png" => "PNG graph image shown above"
);

$subst_url = array(
"Makefile" => "Makefile.txt",
"cache_friends_test" => "cache_friends_test.txt",
"cache_friends_plot.py" =>  "cache_friends_plot.py_txt",
"clumpy_sequence.py" => "clumpy_sequence.py_txt"
);

echo "<table>\n";
exec( "/bin/ls -ld Makefile an_eqn.c  solveqn.c cache_friend.c \
cache_friends_test cache_friends_plot.py graph.pdf graph.png \
clumpy_sequence.py", $lines );
foreach( $lines as $line ) {
   echo "<tr>\n";
   $parts=explode( " ", $line );
   foreach( $parts as $path ) {
   } ; // Empty loop leaves path as last part
   $pieces=explode( "/", $path );
   foreach( $pieces as $file ) {
   } ; // Empty loop leaves file as last part
   $size=ceil( filesize($file) / 1024 );
   echo '<td valign="top">';
   if( isset( $subst_url[$file] ) )  $url = $subst_url[ $file ];
   else                              $url = $file;
   if( $line[0]=="d" ) {
      echo "<code><a href=\"$url/index.html\"\n>$file/</a></code>";
   } else {
      echo "<code><a href=\"$url\"\n>$file</a> (${size}K)</code>";
   }
   echo "</td>\n";
   if( isset( $cmt[$file] ) ) {
     echo "<td><font size=\"-1\">$cmt[$file]</font></td>\n";
   }
   echo "</tr>\n\n";
}
echo "</table>\n";
?>
<p>

<font size="-1">
This page was made with a
<a href="makeindex.php_txt">php script</a>.
Last change <?php date_default_timezone_set('America/New_York');
echo date("F d, Y  H:i:s T.", filemtime( "datefile" )) ?><br>

--<a href="mailto:sw_at_tiac_remove-this_dot_net">Steve Witham</a>
Up to my <a href="http://www.mac-guyver.com/switham">home page</a>.


</BODY>
</HTML>
