<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Preshing on Programming</title>
	<atom:link href="http://preshing.com/feed" rel="self" type="application/rss+xml" />
	<link>http://preshing.com</link>
	<description></description>
	<lastBuildDate>Thu, 09 May 2013 17:27:16 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5</generator>
		<item>
		<title>Introducing Mintomic: A Small, Portable Lock-Free API</title>
		<link>http://preshing.com/20130505/introducing-mintomic-a-small-portable-lock-free-api</link>
		<comments>http://preshing.com/20130505/introducing-mintomic-a-small-portable-lock-free-api#comments</comments>
		<pubDate>Sun, 05 May 2013 15:39:12 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=5259</guid>
		<description><![CDATA[Today, I&#8217;m releasing an open source library called Mintomic. Mintomic is an API for low-level lock-free programming in C and C++. It runs on a variety of platforms including Windows, Linux, MacOS, iOS and Xbox 360. Mintomic&#8217;s goals are to &#8230; <a href="http://preshing.com/20130505/introducing-mintomic-a-small-portable-lock-free-api">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Today, I&#8217;m releasing an open source library called Mintomic. Mintomic is an API for low-level <a href="http://preshing.com/20120612/an-introduction-to-lock-free-programming">lock-free programming</a> in C and C++. It runs on a variety of platforms including Windows, Linux, MacOS, iOS and Xbox 360. Mintomic&#8217;s goals are to be efficient, straightforward, and (mostly) compatible with older compilers.</p>
<table width="100%" style="font-size: 12px;">
<tr>
<td width="50%"><center><a href="http://mintomic.github.io/"><img src="http://preshing.com/wp-content/uploads/2013/05/mintomic-documentation.png" alt="" width="54" height="49" class="alignnone size-full wp-image-5371" /><br />View the documentation</a></center></td>
<td width="50%"><center><a href="https://github.com/mintomic/mintomic"><img src="http://preshing.com/wp-content/uploads/2013/05/mintomic-github.png" alt="" width="115" height="35" class="alignnone size-full wp-image-5361" /><br />View on Github</a></center></td>
</tr>
</table>
<p>Mintomic (short for &#8220;minimal atomic&#8221;) draws lot of inspiration from the C/C++11 atomic library standards, with an important exception: In Mintomic, all atomic operations are &#8220;relaxed&#8221;. The only way to enforce memory ordering is with explicit fences. Here&#8217;s an example taken from <a href="http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu">my post about weak hardware ordering</a>, rewritten using Mintomic:</p>
<pre>
// Define a shared atomic variable:
<span class="highlight">mint_atomic32_t</span> flag;

void IncrementSharedValue10000000Times(TimeWaster&#038; tw)
{
    int count = 0;
    while (count &lt; 10000000)
    {
        tw.wasteRandomCycles();

        // Atomic read-modify-write operation:
        if (<span class="highlight">mint_compare_exchange_strong_32_relaxed</span>(&#038;flag, 0, 1) == 0)
        {
            <span class="highlight">mint_thread_fence_acquire</span>();    // Acquire fence
            g_sharedValue++;
            <span class="highlight">mint_thread_fence_release</span>();    // Release fence

            // Atomic store:
            <span class="highlight">mint_store_32_relaxed</span>(&#038;flag, 0);
            count++;
        }
    }
}
</pre>
<p><span id="more-5259"></span>I started Mintomic out of the desire for a better way to share examples of lock-free programming on this blog. So why not just use C++11 atomics instead?</p>
<ul>
<li><strong>Availability.</strong> Some readers are likely stuck using Visual Studio 2010, GCC 4.3, or an older compiler in which C++11 atomics are not available.</li>
<li><strong>Efficiency.</strong> The C++11 atomic library standard does not guarantee an efficient implementation, only a correct one. Technically, <a href="http://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free">it does not even guarantee</a> your code will be lock-free. Mintomic, on the other hand, always aims to generate optimal lock-free machine code on the platforms it supports.</li>
<li><strong>Lower learning curve.</strong> Mintomic asks you to keep fewer concepts in your head than C++11 atomics. I actually hope Mintomic will serve as a stepping stone towards learning C++11 atomics, since it maps to a subset of the same functionality and uses a similar naming convention.</li>
</ul>
<p>In particular, because all of Mintomic&#8217;s atomic operations are relaxed, there are no <a href="http://preshing.com/20120612/an-introduction-to-lock-free-programming#sequential-consistency">sequentially consistent</a> data types. This is different from C++11&#8242;s atomic library standard, where every atomic operation is sequentially consistent by default. To remind you of the difference, all atomic functions in Mintomic have the <code>_relaxed</code> suffix on their names.</p>
<p>Mintomic comes with a test suite which you can build and run yourself. The only requirement is <a href="http://www.cmake.org/">CMake</a>. This test suite helps ensure that Mintomic was implemented correctly on each platform, while bringing the library to life with a working set of examples. Here&#8217;s what its output looks like as a 64-bit Windows application:</p>
<p><img src="http://preshing.com/wp-content/uploads/2013/05/testsuite_win64.png" alt="" width="553" height="193" class="aligncenter size-full wp-image-5351" /></p>
<p>And here it is on an iPhone 4S:</p>
<p><img src="http://preshing.com/wp-content/uploads/2013/05/testsuite_iphone4s.png" alt="" width="464" height="231" class="aligncenter size-full wp-image-5355" /></p>
<p>Every test case with a <code>_fail</code> suffix on its name contains an intentional bug. These tests are allowed to fail, and in general, designed to do so, in the same spirit as <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act">previous</a> <a href="http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu">blog posts</a> written here. You&#8217;ll notice that on 64-bit Windows, more of these test cases pass (10/11) than on the iPhone (6/11). The point is to show how incorrect lock-free code may succeed on certain platforms out of luck, depending on things like <a href="http://preshing.com/20120625/memory-ordering-at-compile-time">compiler ordering</a>, machine word size and <a href="http://preshing.com/20120930/weak-vs-strong-memory-models">hardware memory model</a>.</p>
<p>To support this test suite, Mintomic also comes with <a href="http://mintomic.github.io/mintthreads/">MintThreads</a>, a portable C module to create and manipulate threads and semaphores, and <a href="http://mintomic.github.io/mintpack/">MintPack</a>, a collection of useful data structures in C++.</p>
<p>I&#8217;ll admit, the existence of a library like Mintomic is a bit funny, for several reasons. First, Mintomic is squarely focused on <em>low-level</em> lock-free programming, providing only relaxed atomics and standalone memory fences &#8212; two things which some C++ experts claim you should avoid using. Second, lock-free programming is definitely not something which every programmer will ever need to do, and in a real project, it typically accounts for a very tiny fraction of the codebase. And finally, some programmers believe that lock-free programming can always be encapsulated using patterns and techniques such as the actor model, goroutines, or functional programming.</p>
<p>Meanwhile, in the games industry at least, we use Mintomic-like primitives to achieve real-world performance gains on a regular basis. Not out of hubris, but because we have deadlines and performance targets, and there are (rare) occasions when it genuinely makes a difference. I&#8217;m certain that in the long term, low-level lock-free programming will continue to play a role in games, audio synthesis, financial trading software &#8212; anywhere that parallel, high-contention tasks must be optimized for latency. Therefore, it&#8217;s worth trying to stop messing up, which is why I keep blogging about it.</p>
<p>I think Mintomic turned out decent, though there are still things to improve. Feedback and suggestions are welcome! In the meantime, stay tuned for further examples using Mintomic.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20130505/introducing-mintomic-a-small-portable-lock-free-api/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>View Your Filesystem History Using Python</title>
		<link>http://preshing.com/20130115/view-your-filesystem-history-using-python</link>
		<comments>http://preshing.com/20130115/view-your-filesystem-history-using-python#comments</comments>
		<pubDate>Tue, 15 Jan 2013 05:56:49 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=5210</guid>
		<description><![CDATA[Sometimes, it&#8217;s useful to look back on your filesystem history. For example, after installing some new software, you might want to know which files have changed on your hard drive. Or, if you&#8217;re a programmer getting started on a new &#8230; <a href="http://preshing.com/20130115/view-your-filesystem-history-using-python">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Sometimes, it&#8217;s useful to look back on your filesystem history.</p>
<p>For example, after installing some new software, you might want to know which files have changed on your hard drive. Or, if you&#8217;re a programmer getting started on a new project, you may need to follow a complex and unfamiliar build process. A list of recently modified files can reveal a lot about how that build process works.</p>
<p>Here&#8217;s a <a href="https://gist.github.com/4536459">short Python script</a> to create such a list. It lists the contents of a folder recursively, sorted by modification time.</p>
<p><a href="https://gist.github.com/4536459"><img src="http://preshing.com/wp-content/uploads/2013/01/list_modifications.png" width="170" height="32" class="aligncenter size-full wp-image-5223" /></a></p>
<p><span id="more-5210"></span>
<p>As a simple example, I ran it after setting up a fresh copy of my <a href='https://github.com/preshing/RandomSequence'>random number sequence</a> project. Here&#8217;s the output (with some lines deleted to save space):</p>
<pre>2013-01-14 21:44:29       5564 .\build\Testing\Temporary\LastTest.log
2013-01-14 21:44:29         29 .\build\Testing\Temporary\CTestCostData.txt
------------------------------
2013-01-14 21:28:38         91 .\build\Win32\Release\ALL_BUILD\ALL_BUILD.lastbuildstate
2013-01-14 21:28:38       1560 .\build\Win32\Release\ALL_BUILD\custombuild.command.1.tlog
2013-01-14 21:28:38       6386 .\build\Win32\Release\ALL_BUILD\custombuild.read.1.tlog
2013-01-14 21:28:38        674 .\build\Win32\Release\ALL_BUILD\custombuild.write.1.tlog
2013-01-14 21:28:38         51 .\build\CMakeFiles\generate.stamp
2013-01-14 21:28:37         91 .\build\RandomSequence.dir\Release\RandomSequence.lastbuildstate
2013-01-14 21:28:37        678 .\build\RandomSequence.dir\Release\mt.command.1.tlog
2013-01-14 21:28:37        818 .\build\RandomSequence.dir\Release\mt.read.1.tlog
2013-01-14 21:28:37        446 .\build\RandomSequence.dir\Release\mt.write.1.tlog
2013-01-14 21:28:37       7680 .\build\Release\RandomSequence.exe
...
------------------------------
2013-01-14 21:28:21         86 .\build\CMakeFiles\cmake.check_cache
2013-01-14 21:28:21      12856 .\build\CMakeCache.txt
2013-01-14 21:28:21       3712 .\build\RandomSequence.sln
2013-01-14 21:28:21        270 .\build\CMakeFiles\TargetDirectories.txt
2013-01-14 21:28:21        391 .\build\CTestTestfile.cmake
2013-01-14 21:28:21       1586 .\build\cmake_install.cmake
2013-01-14 21:28:21       4204 .\build\CMakeFiles\generate.stamp.depend
2013-01-14 21:28:21      25207 .\build\ZERO_CHECK.vcxproj
2013-01-14 21:28:21        832 .\build\ZERO_CHECK.vcxproj.filters
...
------------------------------
2013-01-14 21:27:40        959 .\randomsequence.h
2013-01-14 21:27:40        416 .\.git\index
2013-01-14 21:27:40       1255 .\main.cpp
2013-01-14 21:27:40        714 .\README.md
2013-01-14 21:27:40        246 .\CMakeLists.txt
2013-01-14 21:27:40         12 .\.gitignore
2013-01-14 21:27:40        336 .\.git\config
2013-01-14 21:27:40        201 .\.git\logs\refs\heads\master
2013-01-14 21:27:40        201 .\.git\logs\HEAD
...</pre>
<p>The horizontal dashes separate modifications greater than 10 seconds apart, which helps organize the files visually into groups. In reverse order, you can see the groups of files created by <code>git clone</code>, project files generated by <code>cmake</code>, the build output from <code>cmake --build</code>, and a couple of files written by <code>ctest</code>.</p>
<p>I&#8217;ve used this kind of script to help make sense of the filesystem on Ubuntu, and to figure out where files were written on MacOS X using the App Store.</p>
<h2 id='commandline_options'>Command-Line Options</h2>
<p>Running with no options or with <code>--help</code> displays the following help message:</p>
<pre>Usage: list_modifications.py [options] path [path2 ...]

Options:
  -h, --help    show this help message and exit
  -g SECS       set threshold for grouping files
  -f EXC_FILES  exclude files matching a wildcard pattern
  -d EXC_DIRS   exclude directories matching a wildcard pattern</pre>
<p>You can filter the output using <code>-f</code> and <code>-d</code>. For example:</p>
<pre>list_modifications.py -d obj* -f *.log -f *.bin -g 30 .git build\CMakeFiles</pre>
<p>The above command lists the contents of the <code>.git</code> and <code>build\CMakeFiles</code> folders, excluding the <code>objects</code> subfolder and any files ending in <code>.log</code> or <code>.bin</code>. It also groups files modified within 30 seconds of each other, instead of the default 10.</p>
<h2 id='a_quick_look_at_the_code'>A Quick Look at the Code</h2>
<p>This script is a pretty good example of the kind of problem Python can solve quickly using very little code. Here&#8217;s a quick run-through.</p>
<pre>parser = <span class="highlight">optparse</span>.OptionParser(usage='Usage: %prog [options] path [path2 ...]')
parser.add_option('-g', action='store', type='long', dest='secs', default=10,
                  help='set threshold for grouping files')
parser.add_option('-f', action=<span class="highlight">'append'</span>, type='string', dest='exc_files', default=<span class="highlight">[]</span>,
                  help='exclude files matching a wildcard pattern')
parser.add_option('-d', action='append', type='string', dest='exc_dirs', default=[],
                  help='exclude directories matching a wildcard pattern')
options, <span class="highlight">roots</span> = parser.parse_args()</pre>
<p>This block of code takes care of all command-line option parsing using the built-in <code>optparse</code> module. <code>optparse</code> is deprecated as of Python 2.7, but it&#8217;s handy and available since Python 2.5. The <code>--help</code> option is handled automatically.</p>
<p>The <code>-f</code> option uses the <code>'append'</code> action with a default of <code>[]</code>, which means the user can specify <code>-f</code> multiple times, creating a list. In the previous example, we end up with <code>options.exc_files</code> set to <code>['*.log', '*.bin']</code>. Any leftover positional arguments are assigned to <code>roots</code> as another list; in the previous example, <code>roots</code> becomes <code>['.git', 'build\\CMakeFiles']</code>.</p>
<pre>def iterFiles(options, roots):
    """" A generator to enumerate the contents of directories recursively. """
    for root in roots:
        for dirpath, dirnames, filenames in <span class="highlight">os.walk</span>(root):
            name = os.path.split(dirpath)[1]
            if <span class="highlight">any(fnmatch.fnmatch(name, w) for w in options.exc_dirs)</span>:
                <span class="highlight">del dirnames[:]</span>  # Don't recurse here
                continue
            for fn in filenames:
                if any(fnmatch.fnmatch(fn, w) for w in options.exc_files):
                    continue
                path = os.path.join(dirpath, fn)
                mtime = os.path.getmtime(path)
                size = os.path.getsize(path)
                <span class="highlight">yield</span> mtime, size, path</pre>
<p><code>iterFiles</code> looks like a function definition, but the presence of the <code>yield</code> statement in the body means it actually defines a <a href='http://docs.python.org/2/tutorial/classes.html#generators'>generator</a>. As such, calling <code>iterFiles()</code> does not actually execute the function. It returns an iterator, which you can then use in a <code>for</code> loop, as we&#8217;ll see later.</p>
<p><code>iterFiles</code> uses the <a href='http://docs.python.org/2/library/os.html#os.walk'><code>os.walk</code></a> generator, which lets us modify the contents of <code>dirnames</code> in-place during iteration. In particular, we clear the contents of the list using <code>del dirnames[:]</code> to avoid descending into certain subdirectories.</p>
<p>In the above code, the expression <code>any(fnmatch.fnmatch(name, w) for w in options.exc_dirs)</code> is known as a <a href="http://www.python.org/dev/peps/pep-0289/">generator expression</a>. It&#8217;s a lot like a <a href='http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions'>list comprehension</a>, but we&#8217;re allowed to omit the square brackets since the list is fed to a single function. In this case, the <code>any</code> function will return <code>True</code> if <code>fnmatch.fnmatch(name, w)</code> returns <code>True</code> for any item in the list.</p>
<pre>ptime = 0
for mtime, size, path in <span class="highlight">sorted</span>(iterFiles(options, roots), reverse=True):
    if ptime - mtime >= options.secs:
        print(<span class="highlight">'-' * 30</span>)
    timeStr = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(mtime))
    print('%s %10d %s' % (timeStr, size, path))
    ptime = mtime</pre>
<p>Here, we feed the <code>iterFiles</code> generator to <code>sorted</code>, resulting in a sorted list of 3-tuples. The list is sorted by the first item in the tuple &#8212; the modification time &#8212; which is exactly what we want. We loop through, writing one line of formatted output for each tuple. Since Python lets us multiply a string by an integer, <code>'-' * 30</code> is used as a shortcut for drawing horizontal lines.</p>
<p>That&#8217;s all there is to it! Hopefully, some readers have managed pick up a few nuggets of Pythonic goodness along the way.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20130115/view-your-filesystem-history-using-python/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>This Hash Table Is Faster Than a Judy Array</title>
		<link>http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array</link>
		<comments>http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array#comments</comments>
		<pubDate>Mon, 07 Jan 2013 10:50:22 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=4915</guid>
		<description><![CDATA[In game development, we use associative maps for many things. Dynamic loading, object reflection, rendering, shader management. A lot of them fit the following usage pattern: The keys and values have simple integer or pointer types. The keys are effectively &#8230; <a href="http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>In game development, we use <a href="http://en.wikipedia.org/wiki/Associative_array">associative maps</a> for many things. Dynamic loading, object reflection, rendering, shader management.</p>
<p>A lot of them fit the following usage pattern:</p>
<ul>
<li>The keys and values have simple <strong>integer</strong> or <strong>pointer</strong> types.</li>
<li>The keys are effectively <strong>random</strong>.</li>
<li>The basic operation is <strong>lookup-or-insert:</strong> Look for an existing item first, and add one if not found. Usually, the item is found.</li>
<li>Deletes happen infrequently, and when they do, they tend to happen in bulk.</li>
</ul>
<p><img src="http://preshing.com/wp-content/uploads/2013/01/integer_map.png" alt="" width="226" height="120" class="aligncenter size-full wp-image-4981" /></p>
<p>Occasionally, one of these maps is identified as a performance bottleneck. It raises the question: What&#8217;s the best data structure to use in this scenario?</p>
<p><span id="more-4915"></span>I wrote a small C++ application to compare two top candidates, a Judy array and a finely-tuned hash table, using various experiments. All the code is available <a href="https://github.com/preshing/CompareIntegerMaps">on GitHub</a>.</p>
<p><a href="https://github.com/preshing/CompareIntegerMaps"><img src="http://preshing.com/wp-content/uploads/2013/01/github-integermap.png" alt="" width="223" height="34" class="aligncenter size-full wp-image-5009" /></a></p>
<p>A <a href="http://judy.sourceforge.net/">Judy array</a> &#8212; specifically, the <code>JudyL</code> variant &#8212; is an efficient mapping of integer keys to integer values. It is optimized to avoid CPU cache misses as often as possible. Its memory consumption scales smoothly with number of entries, even when the keys are sparsely distributed. It&#8217;s a real feat of engineering &#8212; hats off to Doug Baskins.</p>
<p><img src="http://preshing.com/wp-content/uploads/2013/01/hashtable.png" alt="" width="37" height="90" class="alignright size-full wp-image-5169" />A hash table is a relatively well-known data structure. For this post, I&#8217;ve written a custom hash table based on <a href="http://en.wikipedia.org/wiki/Open_addressing">open addressing with linear probing</a>. This particular hash table will dynamically resize itself as the population grows.</p>
<p>Now, there are already several existing shootouts between Judy arrays and hash tables on the web. Some of them are good, but in this post, I wanted to target a very specific usage pattern. Also, I wrote the single most efficient integer hash table I could, with the explicit goal of taking Judy down &#8212; no smorgasbord of lame hash tables here. Finally, I took considerable pains to ensure that the benchmarking method produced fair, precise and detailed results, as outlined in the <a href="https://github.com/preshing/CompareIntegerMaps/blob/master/README.md" rel="nofollow">README</a>. Feedback is welcome.</p>
<h2>Adding Items to the Map</h2>
<p>In this experiment, more than 10000000 <a href="http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers">unique 32-bit random keys</a> are inserted into each map, and the insertion times are measured. The timings were taken on my aging Core 2 Duo E6300.</p>
<p>Each point on the graph represents the time to insert the <em>N</em>th element into a map containing <em>N</em> &#8211; 1 elements, averaged around small neighborhoods of <em>N</em>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2013/01/insert.png" alt="" width="516" height="215" class="aligncenter size-full wp-image-4956" /></p>
<p>First, you&#8217;ll notice all the spikes in the hash table insertion times. These spikes are due to dynamic resizing of the hash table, which comes at a cost. Once the table becomes 75% full &#8212; for example, when the <strong>12280</strong>th item is inserted into a hash table of size <strong>16384</strong> &#8212; we dynamically allocate a new hash table with double the size, copy all existing items to new hash slots, and delete the old table. As you&#8217;d expect, this operation is slow. For example, where most items take between 30 and 40 nanoseconds to insert, the 12280th item takes 0.4 <em>milliseconds</em>.</p>
<p>That&#8217;s expensive, but remember, resizing happens infrequently. In practice, these one-time costs are amortized against thousands of other insert operations which run very quickly. Another way to look at it is that the <em>area</em> under the red curve is much smaller than the area under the blue curve, meaning that the <em>total</em> CPU time spent inserting into the hash table is much less, even with the presence of spikes. And in cases where it really matters, you could always pre-allocate the hash table to a large enough size, eliminating those spikes completely.</p>
<p>Judy makes heavy use of the memory allocator, so to show it in its best light, I hardwired the Judy array to <a href="http://g.oswego.edu/dl/html/malloc.html">DLMalloc</a>. DLMalloc is a great allocator, and helps avoid the possibility of hidden performance pitfalls on Windows, such as <a href="http://preshing.com/20110723/finding-bottlenecks-by-random-breaking">DLL runtime overhead</a> and (God forbid) the <a href="http://preshing.com/20110717/the-windows-heap-is-slow-when-launched-from-the-debugger">debug system heap</a>.</p>
<p>It should be acknowledged that Judy insertion times are faster in cases when the keys are <em>not</em> random; however, that&#8217;s not what we&#8217;re testing in this experiment.</p>
<h2>Finding Existing Items in the Map</h2>
<p>Usually, during the lookup-or-insert operation, an existing item is already found in the map. That makes lookup times even more significant than insert times.</p>
<p>In this next experiment, each map is filled to various populations. At each population, thousands of random lookups are performed, and the average time per lookup is plotted. Here, the lookup function is actually the same as the insert function used in the previous experiment &#8212; the only difference is that this time, we call it using keys which are known to already exist in the map.</p>
<p><img src="http://preshing.com/wp-content/uploads/2013/01/lookup.png" alt="" width="516" height="215" class="aligncenter size-full wp-image-4916" /></p>
<p>At all populations, hash table lookups are twice as fast as Judy array lookups, give or take. I suspect part of the reason is because hash tables cause even fewer cache misses than Judy. With a good hash function, most lookups should be found in the first memory address checked; if not, linear probing ensures they are likely to be found in the same cache line. Notice that when the population reaches <strong>98304</strong> items, the hash table stops fitting entirely within the CPU&#8217;s 2 MB L2 cache, which is when the lookup times begin to climb.</p>
<p>Because the keys are random, the choice of hash function doesn&#8217;t matter too much for this experiment. Nonetheless, I borrowed MurmurHash3&#8242;s <a href="http://code.google.com/p/smhasher/wiki/MurmurHash3">integer finalizer</a> as the hash function. It consists of a few XORs, multiplies and shifts, and seems to distribute the hashes just as effectively when the keys are not random.</p>
<pre>
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
</pre>
<p>I chose not to benchmark the delete operation, because under the given usage pattern, deletes tend to be one-time operations which happen in bulk, such as when unloading a section of a game world. Still, one cool thing about hash tables with linear probing is that the <code>Delete</code> method needs only to shuffle existing entries around. This hash table also offers a <code>Compact</code> method, to optionally reclaim memory. Finally, you could write an even faster delete function if you require that <code>Compact</code> is always called at the end of each bulk delete.</p>
<h2>Polluting the CPU Cache Between Operations</h2>
<p>The above experiments are purely micro-benchmarks, which makes them somewhat unrealistic. A real application would run plenty of other code in between operations on the map. This would continuously evict parts of the map from the CPU cache, which would have a direct impact on the map&#8217;s performance.</p>
<p>With that in mind, I added some artificial cache pollution to both experiments, and ran them again. This time, random chunks of memory were manipulated between each operation. Using this strategy, I collected two extra datasets for each data structure, and plotted them below. The thin lines have an average of <strong>1000</strong> bytes of cache pollution between operations, while the thick lines have an average of <strong>10000</strong> bytes.</p>
<p><center><img src="http://preshing.com/wp-content/uploads/2013/01/insert-cache-stomp.png" alt="" width="266" height="169" class="alignnone size-full wp-image-5100" /> &nbsp;&nbsp;&nbsp; <img src="http://preshing.com/wp-content/uploads/2013/01/lookup-cache-stomp.png" alt="" width="266" height="169" class="alignnone size-full wp-image-5101" /></center></p>
<p>As you can see, cache pollution makes all of the timings slower, especially at larger populations. Regardless, the hash table still runs roughly twice as fast as the Judy array at nearly all populations and pollution levels.</p>
<p>Obviously, the most useful performance measurements would be taken in a real application using a consistently reproducible test case. For such measurements, an <code><a href="http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis">APIProfiler</a></code>-like profiling technique comes in handy.</p>
<h2>Memory Consumption</h2>
<p>Judy is known for its memory-efficiency. So, how does its memory consumption compare to that of the hash table? To find the answer, I called <code>dlmalloc_stats</code>, a function which includes the overhead of the memory allocations themselves, an intrinsic cost which is all-too-easily overlooked. This graph plots the total bytes consumed by each map, divided by the total number of items in the map.</p>
<p><img src="http://preshing.com/wp-content/uploads/2013/01/memory.png" alt="" width="495" height="215" class="aligncenter size-full wp-image-5202" /></p>
<p>You can see that the hash table&#8217;s memory consumption follows a sawtooth pattern, doubling in size each time the 75% threshold is reached. Judy&#8217;s memory consumption, on the other hand, scales much more smoothly, and tends to hover around the lower end of the hash table&#8217;s amount. Still, there are some points where the two lines cross, which means that the hash table actually takes less memory at certain populations. Beyond 1000 items, though, those cases are generally the exception.</p>
<p>Judy also has the interesting ability to consume even less memory when the keys are not distributed randomly. In particular, when all the keys are clustered around a small range, its memory consumption tends towards that of a plain array. That&#8217;s impressive, but again, not the usage pattern we&#8217;re testing in this experiment.</p>
<h2>Conclusion</h2>
<p>For the given usage pattern, a Judy array is not the fastest associative map. But for general use, Judy arrays tend to be among the fastest associative maps while also being among the most memory-efficient, which is what made them a contender here in the first place. As a bonus, Judy arrays are always sorted. That means you can iterate sequentially through the map at any moment, something not possible using a hash table &#8212; though this property generally does not bring any benefit in the usage pattern studied here. Judy is also LGPL-licensed and patented, which may make it unsuitable for certain kinds of development.</p>
<p>The hash table tested here was specifically written for the job. In general, you can&#8217;t just grab any module with &#8220;hash table&#8221; in the name, slap it in place and expect the same performance. The addressing strategy, choice of hash function, handling of free/deleted slots, and reliance on inlining, templates and indirection each has the potential to bog down the CPU in various ways.</p>
<p>Another fun, useful fact about this type of hash table is that it lends itself surprisingly well to <a href="http://preshing.com/20120612/an-introduction-to-lock-free-programming">lock-free programming</a>. But that&#8217;s a subject for another post.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>How to Generate a Sequence of Unique Random Integers</title>
		<link>http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers</link>
		<comments>http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers#comments</comments>
		<pubDate>Mon, 24 Dec 2012 12:31:40 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=4833</guid>
		<description><![CDATA[Suppose we wish to generate a sequence of 10000000 random 32-bit integers with no repeats. How can we do it? I faced this problem recently, and considered several options before finally implementing a custom, non-repeating pseudo-random number generator which runs &#8230; <a href="http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Suppose we wish to generate a sequence of <strong>10000000</strong> random 32-bit integers with <em>no repeats</em>. How can we do it?</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=422253117%2C+3056114362%2C+1677071617%2C+478652086%2C+2970049140%2C+...&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='422253117, 3056114362, 1677071617, 478652086, 2970049140, ...' title='422253117, 3056114362, 1677071617, 478652086, 2970049140, ...' class='latex' /></center></p>
<p>I faced this problem recently, and considered several options before finally implementing a custom, non-repeating pseudo-random number generator which runs in O(1) time, requires just 8 bytes of storage, and has pretty good distribution. I thought I&#8217;d share the details here.</p>
<h2>Approaches Considered</h2>
<p>There are already several well-known pseudo-random number generators (PRNGs) such as the Mersenne Twister, an excellent PRNG which distributes integers uniformly across the entire 32-bit range. Unfortunately, calling this PRNG 10000000 times does not tend to generate a sequence of 10000000 unique values. According to <a href="http://preshing.com/20110504/hash-collision-probabilities">Hash Collision Probabilities</a>, the probability of all 10000000 random numbers being unique is just:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=1.970768+%5Ctimes+10%5E%7B-10112%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='1.970768 &#92;times 10^{-10112}' title='1.970768 &#92;times 10^{-10112}' class='latex' /></center></p>
<p><span id="more-4833"></span>That&#8217;s astronomically unlikely. In fact, the <a href="http://math.stackexchange.com/questions/5775/how-many-bins-do-random-numbers-fill">expected number</a> of unique values in such sequences is only about 9988367. You can try it for yourself using Python:</p>
<pre>
>>> len(set([random.randint(0, 0xffffffff) for i in xrange(10000000)]))
9988432
</pre>
<p>One obvious refinement is to reject random numbers which are already in the sequence, and continue iterating until we&#8217;ve reached 10000000 elements. To check whether a specific value is already in the sequence, we could search linearly, or we could keep a sorted copy of the sequence and use a binary search. We could even track the presence of each value explicitly, using a giant 512 MB bitfield or a sparse bitfield such as a <a href="http://judy.sourceforge.net/doc/Judy1_3x.htm">Judy1 array</a>.</p>
<p>Another refinement: Instead of generating an arbitrary 32-bit integer for each element and hoping it&#8217;s unique, we could generate a random index in the range [0, N) where N is the number of remaining unused values. The index would tell us which free slot to take next. We could probably locate each free slot in logarithmic time by implementing a <a href="http://en.wikipedia.org/wiki/Trie">trie</a> suited for this purpose.</p>
<p>Brainstorming some more, an approach based on the <a href="http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Fisher-Yates Shuffle</a> is also quite tempting. Using this approach, we could begin with an array containing all possible 32-bit integers, and shuffle the first 10000000 values out of the array to obtain our sequence. That would require 16 GB of memory. The footprint could be reduced by representing the array as a sparse associative map, such a <a href="http://judy.sourceforge.net/doc/JudyL_3x.htm">JudyL array</a>, storing only those <em>x</em> where A[<em>x</em>] &ne; <em>x</em>. Or, instead of starting with an array of all possible 32-bit integers, we could start with an initial sequence of any 10000000 sorted integers. In an attempt to span the available range of 32-bit values, we could even model the initial sequence as a <a href="http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process">Poisson process</a>.</p>
<p>All of the above approaches either run in non-linear time, or require large amounts of storage. Several of them would be workable for a sequence of just 10000000 integers, but it got me thinking whether a more efficient approach, which scales up to any sequence length, is possible.</p>
<h2>A Non-Repeating Pseudo-Random Number Generator</h2>
<p>The ideal PRNG for this problem is one which would generate a unique, random integer the first 2<sup>32</sup> times we call it, then repeat the same sequence the next 2<sup>32</sup> times it is called, ad infinitum. In other words, a repeating cycle of 2<sup>32</sup> values. That way, we could begin the PRNG at any point in the cycle, always having the guarantee that the next 2<sup>32</sup> values are repeat-free.</p>
<p>One way to implement such a PRNG is to define a one-to-one function on the integers &#8212; a function which maps each 32-bit integer to another, uniquely. Let&#8217;s call such a function a <strong>permutation</strong>. If we come up with a good permutation, all we need is to call it with increasing inputs { 0, 1, 2, 3, &#8230; }. We could even begin the input sequence at any value.</p>
<p>For some reason, I remembered from first-year Finite Mathematics that when <em>p</em> is a prime number, <img src='http://s0.wp.com/latex.php?latex=x%5E2%5C%2C%5Cbmod%5C%2Cp&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='x^2&#92;,&#92;bmod&#92;,p' title='x^2&#92;,&#92;bmod&#92;,p' class='latex' /> has some interesting properties. Numbers produced this way are called <a href="http://en.wikipedia.org/wiki/Quadratic_residue">quadratic resides</a>, and we can compute them in C using the expression <code>x * x % p</code>. In particular, the quadratic reside of <em>x</em> is unique as long as <img src='http://s0.wp.com/latex.php?latex=2x+%3C+p&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='2x &lt; p' title='2x &lt; p' class='latex' />. For example, when <em>p</em> = 11, the quadratic residues of 0, 1, 2, 3, 4, 5 are all unique:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%5Cbegin%7Barray%7D%7Bl%7D++0%5E2%5C%2C%5Cbmod%5C%2C11+%5Cequiv+0+%5C%5C++1%5E2%5C%2C%5Cbmod%5C%2C11+%5Cequiv+1+%5C%5C++2%5E2%5C%2C%5Cbmod%5C%2C11+%5Cequiv+4+%5C%5C++3%5E2%5C%2C%5Cbmod%5C%2C11+%5Cequiv+9+%5C%5C++4%5E2%5C%2C%5Cbmod%5C%2C11+%5Cequiv+5+%5C%5C++5%5E2%5C%2C%5Cbmod%5C%2C11+%5Cequiv+3+%5C%5C++%5Cend%7Barray%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;begin{array}{l}  0^2&#92;,&#92;bmod&#92;,11 &#92;equiv 0 &#92;&#92;  1^2&#92;,&#92;bmod&#92;,11 &#92;equiv 1 &#92;&#92;  2^2&#92;,&#92;bmod&#92;,11 &#92;equiv 4 &#92;&#92;  3^2&#92;,&#92;bmod&#92;,11 &#92;equiv 9 &#92;&#92;  4^2&#92;,&#92;bmod&#92;,11 &#92;equiv 5 &#92;&#92;  5^2&#92;,&#92;bmod&#92;,11 &#92;equiv 3 &#92;&#92;  &#92;end{array}' title='&#92;begin{array}{l}  0^2&#92;,&#92;bmod&#92;,11 &#92;equiv 0 &#92;&#92;  1^2&#92;,&#92;bmod&#92;,11 &#92;equiv 1 &#92;&#92;  2^2&#92;,&#92;bmod&#92;,11 &#92;equiv 4 &#92;&#92;  3^2&#92;,&#92;bmod&#92;,11 &#92;equiv 9 &#92;&#92;  4^2&#92;,&#92;bmod&#92;,11 &#92;equiv 5 &#92;&#92;  5^2&#92;,&#92;bmod&#92;,11 &#92;equiv 3 &#92;&#92;  &#92;end{array}' class='latex' /></center></p>
<p><img src="http://preshing.com/wp-content/uploads/2012/12/partial-scramble.png" alt="" width="269" height="89" class="aligncenter size-full wp-image-4840" /></p>
<p>As luck would have it, it also happens that for the remaining integers, the expression <code>p - x * x % p</code> fits perfectly into the remaining slots. This only works for primes <em>p</em> which satisfy <img src='http://s0.wp.com/latex.php?latex=p+%5Cequiv+3%5C%2C%5Cbmod%5C%2C4&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='p &#92;equiv 3&#92;,&#92;bmod&#92;,4' title='p &#92;equiv 3&#92;,&#92;bmod&#92;,4' class='latex' />.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/12/full-scramble.png" alt="" width="269" height="89" class="aligncenter size-full wp-image-4841" /></p>
<p>This gives us a one-to-one permutation on the integers less than <em>p</em>, where <em>p</em> can be any prime satisying <img src='http://s0.wp.com/latex.php?latex=p+%5Cequiv+3%5C%2C%5Cbmod%5C%2C4&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='p &#92;equiv 3&#92;,&#92;bmod&#92;,4' title='p &#92;equiv 3&#92;,&#92;bmod&#92;,4' class='latex' />. Seems like a nice tool for building our custom PRNG.</p>
<p>In the case of our custom PRNG, we want a permutation which works on the entire range of 32-bit integers. However, 2<sup>32</sup> is not a prime number. The closest prime number less than 2<sup>32</sup> is 4294967291, which happens to satisfy <img src='http://s0.wp.com/latex.php?latex=p+%5Cequiv+3%5C%2C%5Cbmod%5C%2C4&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='p &#92;equiv 3&#92;,&#92;bmod&#92;,4' title='p &#92;equiv 3&#92;,&#92;bmod&#92;,4' class='latex' />. As a compromise, we can write a C++ function which permutes all integers below this prime, and simply maps the 5 remaining integers to themselves.</p>
<pre lang="cpp" cssfile="none">
unsigned int permuteQPR(unsigned int x)
{
    static const unsigned int prime = 4294967291;
    if (x >= prime)
        return x;  // The 5 integers out of range are mapped to themselves.
    unsigned int residue = ((unsigned long long) x * x) % prime;
    return (x <= prime / 2) ? residue : prime - residue;
}
</pre>
<p>This function, on its own, is not the world's best permutation -- it tends to cluster output values for certain ranges of input -- but it is one-to-one. As such, we can combine it with other one-to-one functions, such as addition and XOR, to achieve a much better permutation. I found the following expression works reasonably well. The <code>intermediateOffset</code> variable acts as a seed, putting a variety of different sequences at our disposal.</p>
<pre lang="cpp" cssfile="none">
permuteQPR((permuteQPR(x) + intermediateOffset) ^ 0x5bf03635);
</pre>
<p>On GitHub, I've <a href="https://github.com/preshing/RandomSequence/blob/master/randomsequence.h">posted a C++ class</a> which implements a pseudo-random number generator based on this expression.</p>
<pre>
0xc2ab6929
0xa0e2502c
0x95b7c9eb
0x2fb14c01
0x5d983e09
0xba14e8c1
0x90994968
0x07058db1
0x061acc1f
0x969860fe
0x92f5a666
0xfa1b08af
0x18a3354f
...
</pre>
<p>I've also posted a <a href="https://github.com/preshing/RandomSequence">working project</a> to verify that this PRNG really does output a cycle of 2<sup>32</sup> unique integers.</p>
<p><a href="https://github.com/preshing/RandomSequence"><img src="http://preshing.com/wp-content/uploads/2012/12/github-randomsequence.png" alt="" width="196" height="34" class="aligncenter size-full wp-image-5031" /></a></p>
<p>So, how does the <em>randomness</em> of this generator stack up? I'm not a PRNG expert, so I put my trust in <a href="http://www.iro.umontreal.ca/~simardr/testu01/tu01.html">TestU01</a>, a library for testing the quality of PRNGs, published by the University of Montreal. <a href="https://gist.github.com/4367443">Here's some test code</a> to put our newly conceived PRNG through its paces. It passes all 15 tests in TestU01's SmallCrush test suite, which I guess is pretty decent. It also passes <a href="/files/Crush.txt">140/144</a> tests in the more stringent Crush suite.</p>
<pre>
========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        ursu_CreateRSU
 Number of statistics:  15
 Total CPU time:   00:00:49.95

 All tests were passed
</pre>
<p>Perhaps this approach for generating a sequence of unique random numbers is already known, or perhaps it shares attributes with existing PRNGs. If so, I'd be interested to find out. If you wanted, you could probably adapt it to work on ranges of integers other than 2<sup>32</sup> as well.  Surfing around, I noticed that OpenBSD implements <a href="http://static.usenix.org/event/usenix99/full_papers/deraadt/deraadt_html/node17.html">another non-repeating PRNG</a>, though I'm not sure their implementation is cyclical or covers the entire number space.</p>
<p>Incidentally, this PRNG is used in my next post, <a href="http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array">This Hash Table Is Faster Than a Judy Array</a>.</p>
<p>Do you know any other way to solve this problem?</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Arithmetic Encoding Using Fixed-Point Math</title>
		<link>http://preshing.com/20121105/arithmetic-encoding-using-fixed-point-math</link>
		<comments>http://preshing.com/20121105/arithmetic-encoding-using-fixed-point-math#comments</comments>
		<pubDate>Mon, 05 Nov 2012 12:13:20 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=4647</guid>
		<description><![CDATA[My previous post acted as an introduction to arithmetic coding, using the 1MB Sorting Problem as a case study. I ended that post with the question of how to work with fractional values having millions of significant binary digits. The &#8230; <a href="http://preshing.com/20121105/arithmetic-encoding-using-fixed-point-math">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>My <a href="http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem">previous post</a> acted as an introduction to arithmetic coding, using the <a href="http://preshing.com/20121025/heres-some-working-code-to-sort-one-million-8-digit-numbers-in-1mb-of-ram">1MB Sorting Problem</a> as a case study. I ended that post with the question of how to work with fractional values having millions of significant binary digits.</p>
<p>The answer is that we don&#8217;t have to. Let&#8217;s work with 32-bit <a href="http://en.wikipedia.org/wiki/Fixed-point_arithmetic">fixed-point math</a> instead, and see how far that gets us.</p>
<p>A 32-bit fixed-point number can represent fractions in the interval [0, 1) using 32 bits of precision. In other words, it can encode the first 32 bits of any binary fraction. To define such a fixed-point number, we simply take a regular 32-bit unsigned integer, and imagine it as the top of a fraction over 2<sup>32</sup>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/fixed-point.png" alt="" title="" width="429" height="163" class="aligncenter size-full wp-image-4532" /></p>
<p>As you can see, <code>0x0288df0d</code> represents a fixed-point number approximately equal to D, which in the previous post, we estimated as the probability of encountering a delta value of <strong>0</strong> in one of our sorted sequences. It's actually not such a bad approximation, either: The error is within 0.00000023%.</p>
<p><span id="more-4647"></span>If you recall <a href="http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem">the way we partitioned</a> the real number line in the previous post, <code>0x0288df0d</code> would represent the boundary between the first and second partitions. In a similar way, we can compute the boundaries around all of the first 63 partitions</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/first-63.png" alt="" title="" width="620" height="37" class="aligncenter size-full wp-image-4759" /></p>
<p>and compile them into a lookup table:</p>
<pre lang="cpp" cssfile="none">
static const u32 LUTsize = 64;
const u32 LUT[LUTsize] = {
    0&#215;00000000, 0x0288df0d, 0x050b5170, 0&#215;07876772,
    0x09fd3131, 0x0c6cbea6, 0x0ed61f9d, 0x113963bd,
    0x13969a84, 0x15edd348, 0x183f1d3b, 0x1a8a8766,
    0x1cd020ad, 0x1f0ff7cc, 0x214a1b5e, 0x237e99d4,
    0x25ad817f, 0x27d6e088, 0x29fac4f7, 0x2c193cad,
    0x2e32556d, 0x30461cd1, 0x3254a056, 0x345ded53,
    0x366210ff, 0x3861186f, 0x3a5b1096, 0x3c500649,
    0x3e40063a, 0x402b1cfa, 0x421156fd, 0x43f2c095,
    0x45cf65f7, 0x47a75337, 0x497a944b, 0x4b49350b,
    0x4d134131, 0x4ed8c45a, 0x5099ca03, 0x52565d8e,
    0x540e8a41, 0x55c25b43, 0x5771dba1, 0x591d1649,
    0x5ac41611, 0x5c66e5b1, 0x5e058fc6, 0x5fa01ed4,
    0x61369d41, 0x62c9155d, 0x6457915a, 0x65e21b51,
    0x6768bd44, 0x68eb8119, 0x6a6a709d, 0x6be59584,
    0x6d5cf96c, 0x6ed0a5d9, 0x7040a435, 0x71acfdd4,
    0x7315bbf3, 0x747ae7b7, 0x75dc8a2c, 0x773aac4a
};
</pre>
<p>You'll recognize this lookup table as the same one found in the <a href="https://gist.github.com/3952090#L24">source code listing</a>.</p>
<h2>The Encoder</h2>
<p>We begin the encoding process by setting up a <code>writeInterval</code> structure having two fixed-point members: <code>lo</code> and <code>range</code>. Initially, <code>lo</code> = <code>0</code> and <code>range</code> = <code>0xffffffff</code>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/lo-range.png" alt="" title="" width="427" height="101" class="aligncenter size-full wp-image-4556" /></p>
<p>This represents an interval along the real number line. Obviously, <code>range</code> is slightly less than 1.0, so we aren't quite taking advantage of the full number line, but it's very close, and it fits inside 32 bits.</p>
<p>Next, we're going to subdivide this interval. Subdividing is a straightforward matter of <a href="http://en.wikipedia.org/wiki/Linear_interpolation">linear interpolation</a>. Given a fixed-point value <code>x</code>, we locate the corresponding point within our interval by performing a <em>lerp</em>:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%5Ctextrm%7Blerp%7D%28x%29+%3D+%5Ctextrm%7Blo%7D+%2B+%5Ctextrm%7Brange%7D+%5Ctimes+x&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;textrm{lerp}(x) = &#92;textrm{lo} + &#92;textrm{range} &#92;times x' title='&#92;textrm{lerp}(x) = &#92;textrm{lo} + &#92;textrm{range} &#92;times x' class='latex' /></center></p>
<p>To implement this formula using 32-bit fixed-point math, we must cast to 64 bits before multiplying, then right-shift the result by 32.</p>
<pre>
u32 lerp(u64 x)        { return lo + (u32) ((range * x) >> 32); }
</pre>
<p>Now, suppose we wish to encode an initial delta value of <strong>29</strong>. First, we need to find the boundaries of the partition which corresponds to 29. To find them, we consult <code>LUT[29]</code> and <code>LUT[30]</code> in our lookup table, <em>lerp</em> those values within the <code>writeInterval</code>, and assign the results to <code>A</code> and <code>B</code>. In this case, we end up with <code>A</code> = <code>0x402b1cf9</code> and <code>B</code> = <code>0x421156fc</code>. Since this is the very first partition along the number line, we immediately know that our final encoded binary fraction will lie somewhere within the interval [A, B).</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/a-b.png" alt="" title="" width="492" height="105" class="aligncenter size-full wp-image-4563" /></p>
<p>You'll notice that <code>A</code> and <code>B</code> share the same first 6 bits. We can go ahead and <strong>write those 6 bits</strong> to the output stream, because they'll never change, no matter what the final encoded binary fraction ends up being. So let's write them, then shift both <code>A</code> and <code>B</code> to the left by 6 bits each.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/a-b-encoded.png" alt="" title="" width="575" height="105" class="aligncenter size-full wp-image-4564" /></p>
<p>We can now think of <code>A</code> and <code>B</code> as <em>trailing digits</em> on the encoded output. At this point, we've written the binary digits we're sure about, and <code>A</code> and <code>B</code> represent the digits we're not sure about yet. We need to continue subdividing to drill deeper and gain more certainty about subsequent digits.</p>
<p>Before handling the <em>next</em> delta value in the input sequence, we set up a new <code>writeInterval</code>, setting <code>lo</code> to <code>A</code>, and <code>range</code> to <code>B</code> - <code>A</code>. We then process the next delta value in the same way we did the first, interpolating new <code>A</code> and <code>B</code> values within the new <code>writeInterval</code>. Once again, we output all of the most significant bits which <code>A</code> and <code>B</code> agree on, shifting those values to compensate. We repeat these steps until the entire input sequence is exhausted, at which point we encode a dummy value somewhere within in the final interval (I chose 32), then flush the contents of <code>A</code> to the bit stream.</p>
<p>It turns out that shifting the values of <code>A</code> and <code>B</code> to the left is <strong>critical</strong>: By shifting them, we maximize the resulting value of <code>range</code>, so there's always enough numerical precision to subdivide further. This trick lets us get away with using 32-bit fixed point math at every step during the encoding process.</p>
<h2>Breaking Up Large Deltas</h2>
<p>The lookup table we defined, <code>LUT</code>, contains only 64 entries. Obviously, we can't use it to look up partition boundaries for delta values greater than 62. We could, perhaps, extend it to hold all 100000001 possible boundary values, but if we did, our fixed-point representation would quickly run out of numerical precision. Check the partitioned number line again: the partition for delta value <strong>10000</strong> is already very close to 1.0. In fact, it would require more than 143 fractional binary digits to represent the boundaries of this partition; way more than 32. And the partitions to the right of that would require even more.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/number-line.png" alt="" title="" width="620" height="37" class="aligncenter size-full wp-image-4484" /></p>
<p>We need a better way. If you do the math, the partition boundaries are actually described by the following function:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%5Ctextrm%7BLUT%7D%28x%29+%3D+1+-+%281-D%29%5Ex&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;textrm{LUT}(x) = 1 - (1-D)^x' title='&#92;textrm{LUT}(x) = 1 - (1-D)^x' class='latex' /></center></p>
<p>And lucky for us, this function satisfies a useful relation:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%5Ctextrm%7BLUT%7D%28x+%2B+63%29+%3D+%5Ctextrm%7BLUT%7D%2863%29+%2B+%5Cbig%281+-+%5Ctextrm%7BLUT%7D%2863%29%5Cbig%29+%5Ctimes+%5Ctextrm%7BLUT%7D%28x%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;textrm{LUT}(x + 63) = &#92;textrm{LUT}(63) + &#92;big(1 - &#92;textrm{LUT}(63)&#92;big) &#92;times &#92;textrm{LUT}(x)' title='&#92;textrm{LUT}(x + 63) = &#92;textrm{LUT}(63) + &#92;big(1 - &#92;textrm{LUT}(63)&#92;big) &#92;times &#92;textrm{LUT}(x)' class='latex' /></center></p>
<p>This relation is basically just another <em>lerp</em> like the one we saw earlier. It tells us that if we want to encode any delta value that is greater than or equal to 63, we can simply reduce the delta value by 63 and encode it within the subinterval [LUT(63), 1) instead.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/after-63.png" alt="" title="" width="620" height="37" class="aligncenter size-full wp-image-4758" /></p>
<p>If the delta value is very large, we simply repeat the reduction step until we have a delta value less than 63. You'll notice this technique bears some similarity to the Golomb encoding strategy I <a href="http://preshing.com/20121026/1mb-sorting-explained#encoding-each">described earlier</a>, except that it doesn't encode bits explicitly; only implicitly, as the result of continuous subdivision of the <code>writeInterval</code>.</p>
<pre lang="cpp" cssfile="none">
    void pushDelta(u32 delta)
    {
        while (delta >= LUTsize - 1)
        {
            encode(LUTsize - 1); // Use the [LUT(63), 1) subinterval
            delta -= LUTsize - 1;
        }
        encode(delta);
    }
</pre>
<h2>The Decoder</h2>
<p>As you'd expect, the decoder reverses the arithmetic encoding process, also using 32-bit fixed-point math. It begins by setting up a <code>readInterval</code> structure with <code>lo</code> = <code>0</code> and <code>range</code> = <code>0xffffffff</code>, then reads the first 32 binary digits of the encoded bit stream into a fixed-point value <code>readSeq</code>.</p>
<p>That's all the information the decoder needs to correctly determine the first delta value. It finds this delta value by performing a <a href="http://en.wikipedia.org/wiki/Binary_search_algorithm">binary search</a> through the lookup table, lerping every <code>LUT</code> entry through the <code>readInterval</code> until it identifies the partition containing <code>readSeq</code>. If <code>readSeq</code> is greater than every lerped entry, it means the delta value was greater than or equal to 63, in which case the actual delta value is reconstructed by iterating the decoding step.</p>
<p>Either way, once the decoder identifies the correct partition, it knows exactly how many bits the encoder pushed to the bitstream, and it sets up a new <code>readInterval</code> structure which exactly matches the one used by the encoder for the next delta value.</p>
<h2>Carrying the One</h2>
<p>After implementing all of the above, one problematic case remains. Once in a while, when several nested partitions align perfectly, we may end up with a partition like this:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/small-range.png" alt="" title="" width="426" height="102" class="aligncenter size-full wp-image-4764" /></p>
<p>The encoder has not yet shifted the upper bits of <code>A</code> and <code>B</code> to the output, since they don't match. Unfortunately, if we subtract <code>A</code> from <code>B</code> to determine the range of the next <code>writeInterval</code>, it will be too small to divide uniquely into further partitions. We've run out of numerical precision. The whole algorithm breaks.</p>
<p>To solve this problem, we must allow some bit shifting even in cases where the leading binary digits don't match, such as in the above case. It ends up boiling down to a simple rule: As long as the second bit of <code>A</code> is <strong>1</strong> and the second bit of <code>B</code> is <strong>0</strong>, keep shifting <code>A</code> to the output.</p>
<p>In doing so, we end up having to account for some funky scenarios: <code>A</code> becomes <em>greater</em> than <code>B</code>; we have to perform some lerps which wrap around the range of an integer; and there are cases where we need to "carry a one" back into previous bits we've already written to the bit stream. I won't drill into all the details here, because it would be too boring, and I assume the handful of readers who are interested enough will study the <a href="https://gist.github.com/3952090">source code</a> anyway. In short, once everything is taken care of, <code>range</code> never dips below <code>0x40000000</code>, providing plenty of precision.</p>
<h2>Evidence of Memory Efficiency</h2>
<p>I modified the source code so that it <a href="https://gist.github.com/4015039">outputs some statistics</a> to <code>stderr</code>, and implemented a Python function to <a href="https://gist.github.com/4015054">validate the program</a> using whatever input sequence you provide. Here's the output from a few trial runs:</p>
<pre>
$ python -i validate.py
>>> from random import *
>>> validate([randint(0, 99999999) for i in xrange(1000000)])
Result OK after 34.537 secs
minRange=0x40000012 maxCircularBytes=<span class="highlight">1011728</span>
>>> validate([99999999] * 1000000)
Result OK after 25.832 secs
minRange=0x40000231 maxCircularBytes=<span class="highlight">1011732</span>
>>> validate([62 * i for i in xrange(999999)] + [99999999])
Result OK after 15.247 secs
minRange=0x40000631 maxCircularBytes=<span class="highlight">1011728</span>
>>> validate([int(gauss(99999999, 1000000)) % 100000000 for i in xrange(1000000)])
Result OK after 29.216 secs
minRange=0x40000006 maxCircularBytes=<span class="highlight">1011728</span>
</pre>
<p>The highlighted number tells us the maximum number of bytes in use by the circular buffer during each trial run. No matter what I tried, I couldn't make it consume more than <strong>1011732</strong> bytes. That's well within the limit of 1013000 bytes reserved for the buffer.</p>
<p>It's also remarkably close to the <a href="http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem#how-many">fundamental limit</a> of <strong>1011717</strong> bytes which we derived in the last post, especially considering a <a href="https://gist.github.com/3952090#L263">few bytes are wasted</a> at the end of the sequence when flushing the <code>BitWriter</code>.</p>
<h2>Pseudo-Mathematical Proof of Efficiency</h2>
<p>It should be possible to construct a mathematical proof giving an upper bound on the amount of memory needed to encode a sorted sequence using this method. I'm totally going to handwave through this part, but if you accept that the encoding process works the same even if you "break up large deltas" using a lookup table of just 2 elements, 0 and D, it follows that it takes</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=-log_2+D+%5Capprox+6.65821147+%5Ctextrm%7B+bits%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='-log_2 D &#92;approx 6.65821147 &#92;textrm{ bits}' title='-log_2 D &#92;approx 6.65821147 &#92;textrm{ bits}' class='latex' /></center></p>
<p>to output a single value in the sorted sequence, which happens 1000000 times, and</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=-log_2+%281+-+D%29+%5Capprox+.01435529+%5Ctextrm%7B+bits%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='-log_2 (1 - D) &#92;approx .01435529 &#92;textrm{ bits}' title='-log_2 (1 - D) &#92;approx .01435529 &#92;textrm{ bits}' class='latex' /></center></p>
<p>to increment the value by 1, which happens at most 99999999 times. Therefore, in pure theory, it shouldn't take more than 6.65821147 &times; 1000000 + .01435529 &times; 99999999 &approx; <strong>8093740</strong> bits to encode any sequence, which is pretty close to the fundamental limit we derived earlier. The proof would also have to add some small epsilon value to account for the loss of precision that happens at each step using fixed-point math, but I haven't figured that out, and I'm not really interested in going that far.</p>
<p>In wrapping up this subject, I learned something interesting while reading up on arithmetic coding. <a href="http://en.wikipedia.org/wiki/Arithmetic_coding#US_patents">Dozens of US patents</a> related to the technique have been granted in recent years. One side effect from these patents is that, apparently, most JPEG images in use today take up to 25% more space than they would without the existence of such patents. It's rather funny (and a bit sad) when you consider that the intended goal of the patent system is to advance science and technology.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20121105/arithmetic-encoding-using-fixed-point-math/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Arithmetic Coding and the 1MB Sorting Problem</title>
		<link>http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem</link>
		<comments>http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem#comments</comments>
		<pubDate>Mon, 05 Nov 2012 12:10:05 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=4478</guid>
		<description><![CDATA[It&#8217;s been two weeks since the 1MB Sorting Problem was originally featured on Reddit, and in case you don&#8217;t think this artificial problem has been thoroughly stomped into the ground yet, here&#8217;s a continuation of last post&#8217;s explanation of the &#8230; <a href="http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>It&#8217;s been two weeks since the 1MB Sorting Problem was originally <a href="http://www.reddit.com/r/programming/comments/11uali/sort_1_million_8digit_decimal_numbers_in_1mb_of/">featured on Reddit</a>, and in case you don&#8217;t think this artificial problem has been thoroughly stomped into the ground yet, here&#8217;s a continuation of <a href="http://preshing.com/20121026/1mb-sorting-explained">last post&#8217;s explanation</a> of the working C++ program which solves it.</p>
<p>In that post, I gave a high-level outline of the approach, and showed an encoding scheme &mdash; which I&#8217;ve <a href="http://preshing.com/20121026/1mb-sorting-explained#comment-27436">since learned</a> is Golomb coding &mdash; which comes close to meeting the memory requirements, but doesn&#8217;t quite fit. Arithmetic coding, on the other hand, does fit. It&#8217;s interesting, because this problem, as it was phrased, almost seems designed to force you into arithmetic coding (though <a href="http://nick.cleaton.net/ramsortsol.html">Nick Cleaton&#8217;s solution</a> manages to avoid it).</p>
<p>I had read about arithmetic coding before, but I never had any reason to sit down to implement it. It always struck me as kind of mystical: How the heck you encode information in a <em>fraction</em> of a bit, anyway? The 1MB Sorting Problem turned out to be a great excuse to learn how arithmetic coding works.</p>
<h2 id="how-many">How Many Sorted Sequences Even Exist?</h2>
<p>It&#8217;s important to note that the whole reason why we are able to represent a <strong>sorted</strong> sequence of one million 8-digit numbers in less than 1 MB of memory is because mathematically, there simply aren&#8217;t that many different sorted sequences which exist.</p>
<p><span id="more-4478"></span>To see this, consider the following method of encoding a sorted sequence as a row of boxes, read from left to right.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/boxes.png" alt="" title="" width="468" height="20" class="aligncenter size-full wp-image-4480" /></p>
<p>An empty box tells us to increment the current value (initially 0), and a box with a dot inside it tells us to return the current value as part of the sorted sequence. For example, the above row of boxes corresponds to the sorted sequence <strong>{ 3, 7, 7, 10, 15, 16, &#8230; }</strong>.</p>
<p>Now, to encode one million 8-digit numbers, we&#8217;d need exactly 1000000 boxes containing dots, plus a maximum of 99999999 empty boxes. In fact, when there are exactly 99999999 + 1000000 boxes in total, we can encode <em>every possible sorted sequence</em> using a unique distribution of dots. The question then becomes: How many different ways are there to distribute those dots? This is a straightforward <a href="http://mathworld.wolfram.com/Combination.html">number of combinations</a> problem:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%7B%7B99999999+%2B+1000000%7D+%5Cchoose+1000000%7D+%5Capprox+2%5E%7B8093729.481...%7D&#038;bg=ffffff&#038;fg=000&#038;s=2' alt='{{99999999 + 1000000} &#92;choose 1000000} &#92;approx 2^{8093729.481...}' title='{{99999999 + 1000000} &#92;choose 1000000} &#92;approx 2^{8093729.481...}' class='latex' /></center></p>
<p>That&#8217;s a lot of combinations. Now, think of the contents of memory as one giant label which represents exactly <em>one</em> of those combinations. The exponent on the 2, above, gives us a lower limit for how many bits of memory would be required to come up with a unique label for every possible combination. In this case, it can&#8217;t be done using fewer than 8093730 bits, or <strong>1011717</strong> bytes.</p>
<p>That&#8217;s the fundamental limit. No encoding scheme can ever do better than that; it would be like trying to uniquely label every state in the USA using fewer than 6 bits. On the bright side, 1011717 bytes is comfortably less than our 1048576 byte limit, which is encouraging.</p>
<h2>The Probability of Encountering Each Delta Value</h2>
<p>In the <a href="http://preshing.com/20121026/1mb-sorting-explained">last post</a>, we saw the potential of encoding <strong>delta values</strong> &mdash; the differences between numbers in a sorted sequence. Thinking in terms of the above rows of boxes, let&#8217;s take a look at the probability of encountering each delta value.</p>
<p>Since we know that there are 99999999 empty boxes, and 1000000 boxes containing dots, the probability of any particular box containing a dot is just:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=D+%3D+%5Cfrac%7B1000000%7D%7B99999999+%2B+1000000%7D+%5Capprox+.00990099019703...&#038;bg=ffffff&#038;fg=000&#038;s=1' alt='D = &#92;frac{1000000}{99999999 + 1000000} &#92;approx .00990099019703...' title='D = &#92;frac{1000000}{99999999 + 1000000} &#92;approx .00990099019703...' class='latex' /></center></p>
<p>For simplicity, let&#8217;s now imagine an infinite row of boxes, with dots occurring at the same frequency as this. The probability of encountering a delta value of <strong>0</strong> is, then, the same as the probability of a box containing a dot, which is just D.</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=P%280%29+%3D+D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='P(0) = D' title='P(0) = D' class='latex' /></center></p>
<p>How about a delta value of 1? Well, the probability of the first box being empty is (1 &#8211; D), while the probability of the second box containing a dot is still just D. Since each outcome is an <a href="http://en.wikipedia.org/wiki/Independence_%28probability_theory%29">independent event</a>, we can multiply those probabilities together. The probability of encountering a delta value of <strong>1</strong>, then, is</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%5Cbegin%7Barray%7D%7Brl%7D++P%281%29+%26%3D+%281-D%29+%5Ctimes+D+%5C%5C++%26%5Capprox+.00980296059015...++%5Cend%7Barray%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;begin{array}{rl}  P(1) &amp;= (1-D) &#92;times D &#92;&#92;  &amp;&#92;approx .00980296059015...  &#92;end{array}' title='&#92;begin{array}{rl}  P(1) &amp;= (1-D) &#92;times D &#92;&#92;  &amp;&#92;approx .00980296059015...  &#92;end{array}' class='latex' /></center></p>
<p>And in general, the probability encountering a delta value of N is</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=P%28N%29+%3D+%281-D%29%5EN+%5Ctimes+D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='P(N) = (1-D)^N &#92;times D' title='P(N) = (1-D)^N &#92;times D' class='latex' /></center></p>
<p>Now, let&#8217;s draw the real number line in the interval [0, 1), and let&#8217;s subdivide it into partitions according to the probabilities of each delta value. They begin quite small &mdash; you can see the first three partitions for delta values <strong>0</strong>, <strong>1</strong> and <strong>2</strong> squished all the way to the left &mdash; and they get infintessimally smaller as we proceed to the right, as larger delta values are exponentially less likely to occur.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/number-line.png" alt="" title="" width="620" height="37" class="aligncenter size-full wp-image-4484" /></p>
<p>If you were to throw a dart at this number line, the likelihood of hitting each partition is about the same as the likelihood of encountering each delta value in one of our sorted sequences.</p>
<p>That&#8217;s exactly the kind of information that&#8217;s useful for arithmetic coding.</p>
<h2>The Idea Behind Arithmetic Encoding</h2>
<p><a href="http://en.wikipedia.org/wiki/Arithmetic_coding">Arithmetic encoding</a> is able to encode a sequence of elements &mdash; in this case, delta values &mdash; by progressively subdividing the real number line into finer and finer partitions. At each step, the relative width of each partition is determined by the probability of encountering each element.</p>
<p>Suppose the first delta value in the sequence is <strong>27</strong>. We begin by locating the corresponding partition in the original number line, and <strong>zooming</strong> into it. This gives us a new interval of the real number line to work with &mdash; in this case, roughly from .236 to .243 &mdash; which we can then subdivide further. Let&#8217;s use the same proportions we used for the first element.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/select-27.png" alt="" title="" width="603" height="116" class="aligncenter size-full wp-image-4501" /></p>
<p>Suppose the next element in the sequence is <strong>39</strong>. Again, we locate the corresponding partition and zoom in, subdividing the interval into even finer partitions.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/select-39.png" alt="" title="" width="599" height="121" class="aligncenter size-full wp-image-4503" /></p>
<p>In this way, the interval gets progressively smaller and smaller. We repeat these steps one million times: once for each element in the sequence. After that, all we need to store is a single real value which lies somewhere within the final partition. This value will unambiguously identify the entire one-million-element sequence.</p>
<p>As you can imagine, to represent this value, it&#8217;s going to take a lot of precision. Hundreds of thousands times more precision than you&#8217;ll find in any single- or even double-precision floating-point value. What we need is a way to represent a fractional value having <strong>millions</strong> of significant digits. And in arithmetic coding, that&#8217;s exactly what the final encoded bit stream is. It&#8217;s one giant <a href="http://floating-point-gui.de/formats/binary/">binary fraction</a> having millions of binary digits, pinpointing a specific value somewhere within the interval [0, 1) with laser precision.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/11/encoded-stream.png" alt="" title="" width="611" height="122" class="aligncenter size-full wp-image-4692" /></p>
<p>That&#8217;s great, you might be thinking, but how the heck do we even work with numbers that precise?</p>
<p>This post has already become quite long, so I’ll answer that question in a separate post. You don&#8217;t even have to wait, because it&#8217;s already published: See <a href="http://preshing.com/20121105/arithmetic-encoding-using-fixed-point-math">Arithmetic Encoding Using Fixed-Point Math</a> for the thrilling conclusion!</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>1MB Sorting Explained</title>
		<link>http://preshing.com/20121026/1mb-sorting-explained</link>
		<comments>http://preshing.com/20121026/1mb-sorting-explained#comments</comments>
		<pubDate>Fri, 26 Oct 2012 12:55:52 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=4348</guid>
		<description><![CDATA[In my previous post, I shared some source code to sort one million 8-digit numbers in 1MB of RAM as an answer to this Stack Overflow question. The program works, but I didn&#8217;t explain how, leaving it as a kind &#8230; <a href="http://preshing.com/20121026/1mb-sorting-explained">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>In my previous post, I shared some source code to <a href="http://preshing.com/20121025/heres-some-working-code-to-sort-one-million-8-digit-numbers-in-1mb-of-ram">sort one million 8-digit numbers in 1MB of RAM</a> as an answer to <a href="http://stackoverflow.com/questions/12748246/sorting-1-million-8-digit-numbers-in-1mb-of-ram/13067807#13067807">this Stack Overflow question</a>. The program works, but I didn&#8217;t explain how, leaving it as a kind of puzzle for the reader.</p>
<p><a href="https://gist.github.com/3952090"><img src="http://preshing.com/wp-content/uploads/2012/10/gist.png" alt="" title="" width="274" height="215" class="aligncenter size-full wp-image-4323" /></a></p>
<p>I had promised to explain it in a followup post, and in the meantime, there&#8217;s been a flurry of discussion in the comments and <a href="http://www.reddit.com/r/programming/comments/122b3b/heres_some_working_code_to_sort_one_million/">on Reddit</a>. In particular, commenter <a href="http://preshing.com/20121025/heres-some-working-code-to-sort-one-million-8-digit-numbers-in-1mb-of-ram#comment-26447">Ben Wilhelm</a> (aka ZorbaTHut) already managed to explain most of it (Nice work!), and by now, I think quite a few people already get it. Nonetheless, I&#8217;ll write up another explanation as promised.</p>
<h2>Data Structures</h2>
<p>In this implementation, there is a staging area and a circular buffer. The staging area is just big enough to hold 8000 plain 32-bit integers. The circular buffer always holds a <em>sorted</em> sequence of numbers, using a compact representation which we&#8217;ll get to shortly.</p>
<p><span id="more-4348"></span><img src="http://preshing.com/wp-content/uploads/2012/10/data-structures2.png" alt="" title="" width="426" height="87" class="aligncenter size-full wp-image-4827" /></p>
<p>We read numbers into the staging area until it&#8217;s full. Once it&#8217;s full, we sort it in-place, then merge it with the sorted contents of the circular buffer. The merge is basically the same merge operation performed in <a href="http://en.wikipedia.org/wiki/Merge_sort">Mergesort</a>, but we conserve memory using a trick: The new sequence is written to the buffer immediately following the previous one, and the previous sequence is erased while it&#8217;s read. This being a circular buffer, we wrap around to the beginning of the buffer once we&#8217;ve reached the end.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/circular-trick.png" alt="" title="" width="357" height="59" class="aligncenter size-full wp-image-4448" /></p>
<p>These steps are repeated until all of the input is read and the final staging area is merged. At that point, we have the entire sorted sequence stored in our circular buffer.</p>
<p>The staging area is really just an optimization. Theoretically, we don&#8217;t need it &mdash; we <em>could</em> merge each input number directly into the circular buffer as it comes in. The only problem is that as the sorted sequence grows, the merge operation becomes very slow, mainly due to the decoding and re-encoding process that&#8217;s involved. In fact, if you increase the staging area to hold 20000 integers instead of just 8000, the program runs twice as fast. That&#8217;s why I made the staging area as large as possible in the available memory.</p>
<h2>A Sequence of Deltas</h2>
<p>Next, we need a way to fit the sorted number sequence into a circular buffer significantly less than 1MB (1048576 bytes) in size. Many people on Stack Overflow and Reddit came up with this idea, which is indeed where I got it: Instead of storing the actual numbers, we store a <strong>delta sequence</strong> &mdash; the differences from one number to the next. Because the sequence is already sorted, these delta values tend to be quite small. In fact, given a sequence of <em>N</em> 8-digit numbers, the average delta value is just</p>
<p style="text-align: center;"><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B100000000%7D%7BN%7D&#038;bg=ffffff&#038;fg=000&#038;s=2' alt='&#92;frac{100000000}{N}' title='&#92;frac{100000000}{N}' class='latex' /></p>
<p>By the time we&#8217;ve reached a million numbers, the average delta shrinks to just <strong>100</strong>. Deltas around this size can be represented using 7 bits, since 2<sup>7</sup> = 128, so intuitively, you can begin to see the potential to save memory. It&#8217;s also OK to use a lot more bits to represent larger delta values, since those appear far less frequently. The bigger the delta, the sooner we start to reach the end of the list!</p>
<h2 id="encoding-each">Encoding Each Delta</h2>
<p>All that&#8217;s left to explain is how we actually encode each delta value. Since each one will use a variable number of bits, it helps to look at the circular buffer as a bit stream, which is what <code><a href="https://gist.github.com/3952090#L65">BitReader</a></code> and <code><a href="https://gist.github.com/3952090#L178">BitWriter</a></code> are for.</p>
<p>There are many ways to encode them, and it&#8217;s tough to find a strategy which respects our extremely limited memory budget in every case. For example, <a href="http://www.reddit.com/r/programming/comments/11uali/sort_1_million_8digit_decimal_numbers_in_1mb_of/c6pnl8c">this comment</a> describes an encoding strategy which takes 1000000 bytes in the case where every delta is less than 128, but up to 1564452 bytes when other delta values are mixed in.</p>
<p>It turns out that the optimal encoding strategy is <a href="http://en.wikipedia.org/wiki/Arithmetic_coding">arithmetic coding</a>. But before jumping into that, let&#8217;s first take a look at an <strong>alternate encoding strategy</strong> which requires a bit more memory than we&#8217;re allowed, but is much easier to understand. This alternate coding also has a few similarities to the arithmetic coding implementation I wrote, so it&#8217;ll serve as a pretty good warm-up to that one.</p>
<p>In this alternate strategy, we decode the next delta value as follows. Define an integer variable <code>accumulator</code>, initialize it to 0, then look at the incoming bit stream:</p>
<ul>
<li>If the next bit is <strong>1</strong>, add 64 to <code>accumulator</code> and repeat.</li>
<li>If the next bit is <strong>0</strong>, read a 6-bit integer from the bitstream, add it to <code>accumulator</code> and return <code>accumulator</code> as the delta.
</ul>
<p>How much memory does <em>this</em> strategy need to represent an entire number sequence? Well, each delta value will require at least one <strong>0</strong> followed by 6 bits, and we&#8217;re going to have a million of those. On top of that, each leading <strong>1</strong> causes our sequence value to increase by 64, which obviously can&#8217;t happen more than <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B100000000%7D%7B64%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;frac{100000000}{64}' title='&#92;frac{100000000}{64}' class='latex' /> times. Using this strategy, every possible sequence should fit within</p>
<p style="text-align: center;"><img src='http://s0.wp.com/latex.php?latex=7+%5Ctimes+1000000+%2B+%5Cdisplaystyle%5Cfrac%7B100000000%7D%7B64%7D+%3D+8562500+%5Ctext%7B+bits%7D+%5Capprox+%5Cbold%7B1070312.5%7D+%5Ctext%7B+bytes%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='7 &#92;times 1000000 + &#92;displaystyle&#92;frac{100000000}{64} = 8562500 &#92;text{ bits} &#92;approx &#92;bold{1070312.5} &#92;text{ bytes}' title='7 &#92;times 1000000 + &#92;displaystyle&#92;frac{100000000}{64} = 8562500 &#92;text{ bits} &#92;approx &#92;bold{1070312.5} &#92;text{ bytes}' class='latex' /></p>
<p>Indeed, a sequence where all deltas are zero except for one 99999999 requires exactly that many bits. That&#8217;s more than 21KB over budget, so we have to do better.</p>
<p>Effectively, the above strategy follows a path determined by the bit stream to arrive at the next delta value. Here&#8217;s a graph to illustrate several possible bit paths:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/huffman-tree.png" alt="" title="" width="433" height="373" class="aligncenter size-full wp-image-4376" /></p>
<p>For those who don&#8217;t already recognize it, this turns out to be a <a href="http://en.wikipedia.org/wiki/Huffman_coding">Huffman encoding</a> tree. Huffman coding is really just a special case of arithmetic coding. (And as commenter Stergios Stergiou points out, this particular style of Huffman coding is known as <a href="http://en.wikipedia.org/wiki/Golomb_coding">Golomb coding</a>.)</p>
<p>When arithmetic coding is used, the entire sequence fits safely within < 1013000 bytes, all the time. This strategy proved kind of challenging to implement, so it took the most time to get right. In the <a href="http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem">next two posts</a>, I&#8217;ll describe how it was implemented using all that fancy bit magic you see in <code>Decoder::decode</code> and <code>Encoder::encode</code>.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20121026/1mb-sorting-explained/feed</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Here&#8217;s Some Working Code to Sort One Million 8-Digit Numbers in 1MB of RAM</title>
		<link>http://preshing.com/20121025/heres-some-working-code-to-sort-one-million-8-digit-numbers-in-1mb-of-ram</link>
		<comments>http://preshing.com/20121025/heres-some-working-code-to-sort-one-million-8-digit-numbers-in-1mb-of-ram#comments</comments>
		<pubDate>Thu, 25 Oct 2012 11:49:46 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=4271</guid>
		<description><![CDATA[Earlier this week, this Stack Overflow question was featured on Reddit under /r/programming. Apparently, it was once a Google interview question &#8212; or at least, a variation of one: You have a computer with 1M of RAM and no other &#8230; <a href="http://preshing.com/20121025/heres-some-working-code-to-sort-one-million-8-digit-numbers-in-1mb-of-ram">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Earlier this week, <a href="http://stackoverflow.com/questions/12748246/sorting-1-million-8-digit-numbers-in-1mb-of-ram">this Stack Overflow</a> question was featured on Reddit under <a href="http://www.reddit.com/r/programming/">/r/programming</a>. Apparently, it was once a Google interview question &mdash; or at least, a variation of one:</p>
<blockquote><p>You have a computer with 1M of RAM and no other local storage. You must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the sorted list out over another TCP connection. The list of numbers may contain duplicates, which you may not discard. Your code will be placed in ROM, so you need not subtract the size of your code from the 1M. You have been given code to drive the Ethernet port and handle TCP/IP connections, and it requires 2k for its state data, including a 1k buffer via which your code will read and write data.</p></blockquote>
<p>The challenge here, of course, is that you can&#8217;t fit all the data in memory as raw integers. There are 100 million possible 8-digit decimal numbers, so even if you pack each number into 27 bits (2<sup>27</sup> = ~134 million), that would take <strong>3375000</strong> bytes of storage. The hypothetical machine available to you has only <strong>1048576</strong> bytes of storage. At first glance, it seems to defy the laws of mathematics.</p>
<p>The question received enough attention that a lot of people proposed solutions on Stack Overflow and in the <a href="http://www.reddit.com/r/programming/comments/11uali/sort_1_million_8digit_decimal_numbers_in_1mb_of/">Reddit discussion</a>, but so far, nothing that runs. There&#8217;s some code for a solution which doesn&#8217;t work, many proposals without code, and some hilarious out-of-the-box answers. A lot of answers come close, but won&#8217;t work on every possible input.</p>
<p><span id="more-4271"></span>In thinking about this problem, it occurred to me that I&#8217;ve already written a blog post containing a clue which leads to the ideal solution. I won&#8217;t give away which post, but after that, I couldn&#8217;t resist cobbling together a working implementation in C++.</p>
<p>The <a href="https://gist.github.com/3952090">source code listing</a> is just 339 lines with very few comments. I thought I&#8217;d post it here to see if anybody can figure out which algorithm was used:</p>
<p><a href="https://gist.github.com/3952090"><img src="http://preshing.com/wp-content/uploads/2012/10/gist.png" alt="" title="" width="274" height="215" class="aligncenter size-full wp-image-4323" /></a></p>
<p>Proof that the memory constraints are satisified:</p>
<pre lang="cpp" cssfile="none">
typedef unsigned int u32;

namespace WorkArea
{
    static const u32 circularSize = 253250;
    u32 circular[circularSize] = { 0 };         // consumes 1013000 bytes

    static const u32 stageSize = 8000;
    u32 stage[stageSize];                       // consumes 32000 bytes

    ...
</pre>
<p>Together, these two arrays take <strong>1045000</strong> bytes of storage. Tack on the hypothetical 2KB overhead for the TCP connection, and you&#8217;re left with a healthy margin of 1048576 &#8211; 1045000 &#8211; 2&times;1024 = <strong>1528</strong> bytes for remaining variables and stack space.</p>
<p>It runs in about 23 seconds on my Xeon W3520. You can verify that the program works using the following Python script, assuming a program name of <code>sort1mb.exe</code>. See if you can find any inputs which break it!</p>
<pre>
from subprocess import *
import random

sequence = [random.randint(0, 99999999) for i in xrange(1000000)]

sorter = Popen('sort1mb.exe', stdin=PIPE, stdout=PIPE)
for value in sequence:
    sorter.stdin.write('%08d\n' % value)
sorter.stdin.close()

result = [int(line) for line in sorter.stdout]
print('OK!' if result == sorted(sequence) else 'Error!')
</pre>
<p>I submitted <a href="http://stackoverflow.com/a/13067807/710717">this answer to the Stack Overflow question</a>, but it&#8217;s pretty late to the party. <strong>Update:</strong> You&#8217;ll find a detailed explanation in my next post, <a href="http://preshing.com/20121026/1mb-sorting-explained">1MB Sorting Explained</a>.</p>
<p><a href="http://nick.cleaton.net/ramsortsol.html">Nick Cleaton</a> originally proposed the currently accepted solution. His approach is different, and not in any runnable form, but the idea also looks valid.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20121025/heres-some-working-code-to-sort-one-million-8-digit-numbers-in-1mb-of-ram/feed</wfw:commentRss>
		<slash:comments>27</slash:comments>
		</item>
		<item>
		<title>This Is Why They Call It a Weakly-Ordered CPU</title>
		<link>http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu</link>
		<comments>http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu#comments</comments>
		<pubDate>Fri, 19 Oct 2012 10:34:03 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=4065</guid>
		<description><![CDATA[On this blog, I&#8217;ve been rambling on about lock-free programming subjects such as acquire and release semantics and weakly-ordered CPUs. I&#8217;ve tried to make these subjects approachable and understandable, but at the end of the day, talk is cheap! Nothing &#8230; <a href="http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>On this blog, I&#8217;ve been rambling on about <a href="http://preshing.com/20120612/an-introduction-to-lock-free-programming">lock-free programming</a> subjects such as <a href="http://preshing.com/20120913/acquire-and-release-semantics">acquire and release semantics</a> and <a href="http://preshing.com/20120930/weak-vs-strong-memory-models">weakly-ordered CPUs</a>. I&#8217;ve tried to make these subjects approachable and understandable, but at the end of the day, talk is cheap! Nothing drives the point home better than a concrete example.</p>
<p>If there&#8217;s one thing that characterizes a weakly-ordered CPU, it&#8217;s that one CPU core can see values change in shared memory in a different order than another core wrote them. That&#8217;s what I&#8217;d like to demonstrate in this post using pure C++11.</p>
<p>For normal applications, the x86/64 processor families from Intel and AMD do not have this characteristic. So we can forget about demonstrating this phenomenon on pretty much every modern desktop or notebook computer in the world. What we really need is a weakly-ordered multicore device. Fortunately, I happen to have one right here in my pocket:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/iphone-4s.jpg" alt="" title="" width="310" height="207" class="aligncenter size-full wp-image-4067" /></p>
<p><span id="more-4065"></span>The iPhone 4S fits the bill. It runs on a <strong>dual-core ARM-based</strong> processor, and the ARM architecture is, in fact, weakly-ordered.</p>
<h2>The Experiment</h2>
<p>Our experiment will consist of an single integer, <code>sharedValue</code>, protected by a mutex. We&#8217;ll spawn two threads, and each thread will run until it has incremented <code>sharedValue</code> 10000000 times.</p>
<p>We won&#8217;t let our threads block waiting on the mutex. Instead, each thread will loop repeatedly doing busy work (ie. just wasting CPU time) and attempting to lock the mutex at random moments. If the lock succeeds, the thread will increment <code>sharedValue</code>, then unlock. If the lock fails, it will just go back to doing busy work. Here&#8217;s some pseudocode:</p>
<pre>
count = 0
while count < 10000000:
    doRandomAmountOfBusyWork()
    if tryLockMutex():
        // The lock succeeded
        sharedValue++
        unlockMutex()
        count++
    endif
endwhile
</pre>
<p>With each thread running on a separate CPU core, the timeline should look something like this. Each red section represents a successful lock and increment, while the dark blue ticks represent lock attempts which failed because the other thread was already holding the mutex.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/experiment-timeline.png" alt="" title="" width="471" height="141" class="aligncenter size-full wp-image-4066" /></p>
<p>It bears repeating that <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">a mutex is just a concept</a>, and there are <a href="http://preshing.com/20120226/roll-your-own-lightweight-mutex">many</a> <a href="http://preshing.com/20120305/implementing-a-recursive-mutex">ways</a> to implement one. We could use the implementation provided by <code>std::mutex</code>, and of course, everything will function correctly. But then I'd have nothing to show you. Instead, let's implement a custom mutex &mdash; then let's break it to demonstrate the consequences of <a href="http://preshing.com/20120930/weak-vs-strong-memory-models">weak hardware ordering</a>. Intuitively, the potential for memory reordering will be highest at those moments when there is a "close shave" between threads &mdash; for example, at the moment circled in the above diagram, when one thread acquires the lock <em>just</em> as the other thread releases it.</p>
<p>The latest version of Xcode has terrific support for C++11 threads and atomic types, so let's use those. All C++11 identifiers are defined in the <code>std</code> namespace, so let's assume <code>using namespace std;</code> was placed somewhere earlier in the code.</p>
<h2>A Ridiculously Simple Mutex</h2>
<p>Our mutex will consist of a single integer <code>flag</code>, where 1 indicates that the mutex is held, and 0 means it isn't. To ensure mutual exclusivity, a thread can only set <code>flag</code> to 1 if the previous value was 0, and it must do so atomically. To achieve this, we'll define <code>flag</code> as a C++11 atomic type, <code>atomic&lt;int&gt;</code>, and use a <a href="http://preshing.com/20120612/an-introduction-to-lock-free-programming#atomic-rmw">read-modify-write</a> operation:</p>
<pre lang="cpp" cssfile="none">
int expected = 0;
if (flag.compare_exchange_strong(expected, 1, memory_order_acquire))
{
    // The lock succeeded
}
</pre>
<p>The <code>memory_order_acquire</code> argument used above is considered an <em>ordering constraint</em>. We're placing acquire semantics on the operation, to help guarantee that we receive the latest shared values from the previous thread which held the lock.</p>
<p>To release the lock, we perform the following:</p>
<pre lang="cpp" cssfile="none">
flag.store(0, memory_order_release);
</pre>
<p>This sets <code>flag</code> back to 0 using the <code>memory_order_release</code> ordering constraint, which applies release semantics. <a href="http://preshing.com/20120913/acquire-and-release-semantics">Acquire and release semantics</a> must be used as a pair to ensure that shared values propagate completely from one thread to the next.</p>
<h2>If We Don't Use Acquire and Release Semantics...</h2>
<p>Now, let's write the experiment in C++11, but instead of specifying the correct ordering constraints, let's put <code>memory_order_relaxed</code> in both places. This means no particular memory ordering will be enforced by the C++11 compiler, and <a href="http://preshing.com/20120710/memory-barriers-are-like-source-control-operations">any kind of reordering</a> is permitted.</p>
<pre lang="cpp" cssfile="none">
void IncrementSharedValue10000000Times(RandomDelay&#038; randomDelay)
{
    int count = 0;
    while (count < 10000000)
    {
        randomDelay.doBusyWork();
        int expected = 0;
        if (flag.compare_exchange_strong(expected, 1, memory_order_relaxed))
        {
            // Lock was successful
            sharedValue++;
            flag.store(0, memory_order_relaxed);
            count++;
        }
    }
}
</pre>
<p>At this point, it's informative to look at the resulting ARM assembly code generated by the compiler, in Release, using the Disassembly view in Xcode:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/disasm-no-barriers.png" alt="" title="" width="507" height="249" class="aligncenter size-full wp-image-4071" /></p>
<p>If you aren't very familiar with assembly language, don't worry. All we want to know is whether the compiler has reordered any operations on shared variables. This would include the two operations on <code>flag</code>, and the increment of <code>sharedValue</code> in between. Above, I've annotated the corresponding sections of assembly code. As you can see, we got lucky: The compiler chose <em>not</em> to reorder those operations, even though the <code>memory_order_relaxed</code> argument means that, in all fairness, it could have.</p>
<p>I've put together a sample application which repeats this experiment indefinitely, printing the final value of <code>sharedValue</code> at the end of each trial run. It's <a href="https://github.com/preshing/AcquireRelease">available on GitHub</a> if you'd like to view the source code or run it yourself.</p>
<p>Here's the iPhone, hard at work, running the experiment:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/iphone-running.jpg" alt="" title="" width="263" height="121" class="aligncenter size-full wp-image-4068" /></p>
<p>And here's the output from the Output panel in Xcode:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/output-no-barriers.png" alt="" title="" width="308" height="123" class="aligncenter size-full wp-image-4069" /></p>
<p>Check it out! The final value of <code>sharedValue</code> is consistently less than 20000000, even though both threads perform exactly 10000000 increments, and the order of assembly instructions exactly matches the order of operations on shared variables as specified in C++.</p>
<p>As you might have guessed, these results are entirely due to memory reordering <strong>on the CPU</strong>. To point out just one possible reordering &mdash; and there are several &mdash; the memory interaction of <code>str.w r0, [r11]</code> (the store to <code>sharedValue</code>) could be reordered with that of <code>str r5, [r6]</code> (the store of 0 to <code>flag</code>). In other words, the mutex could be effectively unlocked before we're finished with it! As a result, the other thread would be free to wipe out the change made by this one, resulting in a mismatched <code>sharedValue</code> count at the end of the experiment, just as we're seeing here.</p>
<h2>Using Acquire and Release Semantics Correctly</h2>
<p>Fixing our sample application, of course, means putting the correct C++11 memory ordering constraints back in place:</p>
<pre>
void IncrementSharedValue10000000Times(RandomDelay&#038; randomDelay)
{
    int count = 0;
    while (count < 10000000)
    {
        randomDelay.doBusyWork();
        int expected = 0;
        if (flag.compare_exchange_strong(expected, 1, <span class="highlight">memory_order_acquire</span>))
        {
            // Lock was successful
            sharedValue++;
            flag.store(0, <span class="highlight">memory_order_release</span>);
            count++;
        }
    }
}
</pre>
<p>As a result, the compiler now inserts a couple of <code>dmb ish</code> instructions, which act as memory barriers in the ARMv7 instruction set. I'm not an ARM expert &mdash; comments are welcome &mdash; but it's safe to assume this instruction, much like <code>lwsync</code> on PowerPC, provides all the <a href="http://preshing.com/20120710/memory-barriers-are-like-source-control-operations">memory barrier types</a> needed for acquire semantics on <code>compare_exchange_strong</code>, and release semantics on <code>store</code>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/disasm-with-barriers.png" alt="" title="" width="507" height="275" class="aligncenter size-full wp-image-4072" /></p>
<p>This time, our little home-grown mutex really does protect <code>sharedValue</code>, ensuring all modifications are passed safely from one thread to the next each time the mutex is locked.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/output-with-barriers.png" alt="" title="" width="308" height="123" class="aligncenter size-full wp-image-4070" /></p>
<p>If you still don't grasp intuitively what's going on in this experiment, I'd suggest a review of my <a href="http://preshing.com/20120710/memory-barriers-are-like-source-control-operations">source control analogy</a> post. In terms of that analogy, you can imagine two workstations each having local copies of <code>sharedValue</code> and <code>flag</code>, with some effort required to keep them in sync. Personally, I find visualizing it this way very helpful.</p>
<p>I'd just like to reiterate that the memory reordering we saw here can only be observed on a <strong>multicore</strong> or multiprocessor device. If you take the same compiled application and run it on an iPhone 3GS or first-generation iPad, which use the same ARMv7 architecture but have only a single CPU core, you won't see any mismatch in the final count of <code>sharedValue</code>.</p>
<h2>Interesting Notes</h2>
<p>You can build and run <a href="https://github.com/preshing/AcquireRelease">this sample application</a> on any Windows, MacOS or Linux machine with a x86/64 CPU, but unless your compiler performs reordering on specific instructions, you won't witness any memory reordering at runtime &mdash; even on a multicore system! Indeed, when I tested it using Visual Studio 2012, no memory reordering occurred. That's because x86/64 processors are what is usually considered <a href="http://preshing.com/20120930/weak-vs-strong-memory-models#strong">strongly-ordered</a>: When one CPU core performs a sequence of writes, every other CPU core sees those values change in the same order that they were written.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/10/no-barrier-on-intel.png" alt="" title="" width="407" height="149" class="aligncenter size-full wp-image-4078" /></p>
<p>This goes to show how easy it is to use C++11 atomics incorrectly without knowing it, simply because it appears to work correctly on a specific processor and toolchain.</p>
<p>Incidentally, the release candidate of Visual Studio 2012 generates rather poor x86 machine code for this sample. It's nowhere near as efficient as the ARM code generated by Xcode. Meanwhile, performance is the main reason to use lock-free programming on multicore in the first place! It's enough to turn me off using C++11 atomics on Windows for the time being. [<strong>Update Feb. 2013</strong>: As mentioned in the <a href="http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu#comment-61574">comments</a>, the latest version of VS2012 Professional now generates much better machine code.]</p>
<p>This post is a followup to an earlier post where I demonstrated <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act">StoreLoad reordering</a> on x86/64. In my experience, however, the need for <code>#StoreLoad</code> does not come up quite as often in practice as the ordering constraints demonstrated here.</p>
<p>Finally, I'm not the first person to demonstrate weak hardware ordering in practice, though I might be the first to demonstrate it using C++11. There are earlier posts by <a href="http://wanderingcoder.net/2011/04/01/arm-memory-ordering/">Pierre Lebeaupin</a> and <a href="http://ridiculousfish.com/blog/posts/barrier.html">ridiculousfish</a> which use different experiments to demonstrate the same phenomenon.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/feed</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>Weak vs. Strong Memory Models</title>
		<link>http://preshing.com/20120930/weak-vs-strong-memory-models</link>
		<comments>http://preshing.com/20120930/weak-vs-strong-memory-models#comments</comments>
		<pubDate>Sun, 30 Sep 2012 22:11:19 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=3951</guid>
		<description><![CDATA[There are many types of memory reordering, and not all types of reordering occur equally often. It all depends on processor you&#8217;re targeting and/or the toolchain you&#8217;re using for development. A memory model tells you, for a given processor or &#8230; <a href="http://preshing.com/20120930/weak-vs-strong-memory-models">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>There are many types of memory reordering, and not all types of reordering occur equally often. It all depends on processor you&#8217;re targeting and/or the toolchain you&#8217;re using for development.</p>
<p>A <strong>memory model</strong> tells you, for a given processor or toolchain, exactly what types of memory reordering to expect at runtime relative to a given source code listing. Keep in mind that the effects of memory reordering can only be observed when <a href="http://preshing.com/20120612/an-introduction-to-lock-free-programming">lock-free programming</a> techniques are used.</p>
<p>After studying memory models for a while &#8212; mostly by reading various online sources and verifying through experimentation &#8212; I&#8217;ve gone ahead and organized them into the following four categories. Below, each memory model makes all the guarantees of the ones to the left, plus some additional ones. I&#8217;ve drawn a clear line between weak memory models and strong ones, to capture the way most people appear to use these terms. Read on for my justification for doing so.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/09/weak-strong-table.png" alt="" title="" width="592" height="312" class="aligncenter size-full wp-image-3952" /></p>
<p><span id="more-3951"></span>Each physical device pictured above represents a <strong>hardware</strong> memory model. A hardware memory model tells you what kind of memory ordering to expect at runtime relative to an <em>assembly</em> (or machine) code listing.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/09/hardware-matters.png" alt="" title="" width="532" height="212" class="aligncenter size-full wp-image-4011" /></p>
<p>Every processor family has different habits when it comes to memory reordering, and those habits can only be observed in multicore or multiprocessor configurations. Given that <a href="http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance">multicore is now mainstream</a>, it&#8217;s worth having some familiarity with them.</p>
<p>There are <strong>software</strong> memory models as well. Technically, once you&#8217;ve written (and debugged) portable lock-free code in C11, C++11 or Java, only the software memory model is supposed to matter. Nonetheless, a general understanding of hardware memory models may come in handy. It can help you explain unexpected behavior while debugging, and — perhaps just as importantly — appreciate how incorrect code may function correctly on a specific processor and toolchain out of luck.</p>
<h2>Weak Memory Models</h2>
<p><a href="http://preshing.com/20120710/memory-barriers-are-like-source-control-operations"><img src="http://preshing.com/wp-content/uploads/2012/09/analogy-small.png" alt="" title="" width="157" height="106" class="alignright size-full wp-image-3872" /></a>In the weakest memory model, it&#8217;s possible to experience <a href="http://preshing.com/20120710/memory-barriers-are-like-source-control-operations">all four types of memory reordering</a> I described using a source control analogy in a previous post. Any load or store operation can effectively be reordered with any other load or store operation, as long as it would never modify the behavior of a single, isolated thread. In reality, the reordering may be due to either <a href="http://preshing.com/20120625/memory-ordering-at-compile-time">compiler reordering</a> of instructions, or memory reordering on the processor itself.</p>
<p>When a processor has a weak hardware memory model, we tend to say it&#8217;s <em>weakly-ordered</em> or that it has <em>weak ordering</em>. We may also say it has a <em>relaxed</em> memory model. The venerable <strong>DEC Alpha</strong> is everybody&#8217;s <a href="http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#2277">favorite example</a> of a weakly-ordered processor. There&#8217;s really no mainstream processor with weaker ordering.</p>
<p>The C11 and C++11 programming languages expose a weak software memory model which was in many ways influenced by the Alpha. When using low-level atomic operations in these languages, it doesn&#8217;t matter if you&#8217;re actually targeting a strong processor family such as x86/64. As I demonstrated previously, you must still specify the <a href="http://preshing.com/20120913/acquire-and-release-semantics">correct memory ordering constraints</a>, if only to prevent compiler reordering.</p>
<h3>Weak With Data Dependency Ordering</h3>
<p>Though the Alpha has become less relevant with time, we still have several modern CPU families which carry on in the same tradition of weak hardware ordering:</p>
<ul>
<li><strong>ARM</strong>, which is currently found in hundreds of millions of smartphones and tablets, and is increasingly popular in multicore configurations.</li>
<li><strong>PowerPC</strong>, which the Xbox 360 in particular has already delivered to 70 million living rooms in a multicore configuration.</li>
<li><strong>Itanium</strong>, which Microsoft no longer supports in Windows, but which is still supported in Linux and found in HP servers.</li>
</ul>
<p>These families have memory models which are, in various ways, almost as weak as the Alpha&#8217;s, except for one common detail of particular interest to programmers: they maintain <a href="http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#305">data dependency ordering</a>. What does that mean? It means that if you write <code>A->B</code> in C/C++, you are always guaranteed to load a value of <code>B</code> which is at least as new as the value of <code>A</code>. The Alpha doesn&#8217;t guarantee that. I won&#8217;t dwell on data dependency ordering too much here, except to mention that the <a href="http://lwn.net/Articles/262464/">Linux RCU mechanism</a> relies on it heavily.</p>
<h2 id="strong">Strong Memory Models</h2>
<p>Let&#8217;s look at hardware memory models first. What, exactly, is the difference between a strong one and a weak one? There is actually <a href="http://herbsutter.com/2012/08/02/strong-and-weak-hardware-memory-models/#comment-5903">a little disagreement</a> over this question, but my feeling is that in 80% of the cases, most people mean the same thing. Therefore, I&#8217;d like to propose the following definition:</p>
<blockquote><p>
A <strong>strong hardware memory model</strong> is one in which every machine instruction comes implicitly with <a href="http://preshing.com/20120913/acquire-and-release-semantics" title="Acquire and Release Semantics">acquire and release semantics</a>. As a result, when one CPU core performs a sequence of writes, every other CPU core sees those values change in the same order that they were written.
</p></blockquote>
<p>It&#8217;s not too hard to visualize. Just imagine a refinement of the <a href="http://preshing.com/20120710/memory-barriers-are-like-source-control-operations">source control analogy</a> where all modifications are committed to shared memory in-order (no StoreStore reordering), pulled from shared memory in-order (no LoadLoad reordering), and instructions are always executed in-order (no LoadStore reordering). StoreLoad reordering, however, <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act">still remains possible</a>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/09/strong-hardware.png" alt="" title="" width="283" height="71" class="aligncenter size-full wp-image-3988" /></p>
<p>Under the above definition, the <strong>x86/64</strong> family of processors is <em>usually</em> strongly-ordered. There are certain cases in which some of x86/64&#8242;s <a href="http://preshing.com/20120913/acquire-and-release-semantics#comment-20810">strong ordering guarantees are lost</a>, but for the most part, as application programmers, we can ignore those cases. It&#8217;s true that a x86/64 processor can <a href="http://en.wikipedia.org/wiki/Out-of-order_execution">execute instructions out-of-order</a>, but that&#8217;s a hardware implementation detail &#8212; what matters is that it still keeps its <em>memory interactions</em> in-order, so in a multicore environment, we can still consider it strongly-ordered. Historically, there has also been a little confusion due to <a href="http://jakob.engbloms.se/archives/1435">evolving specs</a>. </p>
<p>Apparently <strong>SPARC</strong> processors, when running in <strong>TSO</strong> mode, are another example of a strong hardware ordering. TSO stands for &#8220;total store order&#8221;, which in a subtle way, is different from the definition I gave above. It means that there is always a single, global order of writes to shared memory from all cores. The x86/64 has this property too: See Volume 3, sections 8.2.3.6-8 of <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">Intel&#8217;s x86/64 Architecture Specification</a> for some examples. From what I can tell, the TSO property isn&#8217;t usually of direct interest to low-level lock-free programmers, but it is a step towards sequential consistency.</p>
<h3>Sequential Consistency</h3>
<p>In a <a href="http://preshing.com/20120612/an-introduction-to-lock-free-programming#sequential-consistency">sequentially consistent</a> memory model, there is no memory reordering. It&#8217;s as if the entire program execution is reduced to a sequential interleaving of instructions from each thread. In particular, the result r1 = r2 = 0 from <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act">Memory Reordering Caught in the Act</a> becomes impossible.</p>
<p>These days, you won&#8217;t easily find a modern multicore device which guarantees sequential consistency at the hardware level. However, it seems at least one sequentially consistent, dual-processor machine existed back in 1989: The 386-based <a href="http://vogons.zetafleet.com/viewtopic.php?t=23842#178666">Compaq SystemPro</a>. According to Intel&#8217;s docs, the 386 wasn&#8217;t advanced enough to perform any memory reordering at runtime.</p>
<p><a href="http://www.amazon.com/gp/product/0123973376/ref=as_li_ss_tl?ie=UTF8&#038;tag=preshonprogr-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0123973376"><img src="http://preshing.com/wp-content/uploads/2012/06/art-of-multiprocessor.png" alt="" title="" width="107" height="132" class="alignright size-full wp-image-3435" /></a>In any case, sequential consistency only really becomes interesting as a <strong>software</strong> memory model, when working in higher-level programming languages. In Java 5 and higher, you can declare shared variables as <code>volatile</code>. In C++11, you can use the default ordering constraint, <code>memory_order_seq_cst</code>, when performing operations on atomic library types. If you do those things, the toolchain will restrict compiler reordering and emit CPU-specific instructions which act as the appropriate memory barrier types. In this way, a sequentially consistent memory model can be &#8220;emulated&#8221; even on weakly-ordered multicore devices. If you read Herlihy &#038; Shavit&#8217;s <a href="http://www.amazon.com/gp/product/0123973376/ref=as_li_ss_tl?ie=UTF8&#038;tag=preshonprogr-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0123973376">The Art of Multiprocessor Programming</a>, be aware that most of their examples assume a sequentially consistent software memory model.</p>
<h2>Further Details</h2>
<p>There are many other subtle details filling out the spectrum of memory models, but in my experience, they haven&#8217;t proved quite as interesting when writing lock-free code at the application level. There are things like control dependencies, causal consistency, and different memory types. Still, most discussions come back the four main categories I&#8217;ve outlined here.</p>
<p>If you really want to nitpick the fine details of processor memory models, and you enjoy eating formal logic for breakfast, you can check out the <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">admirably detailed work</a> done at the University of Cambridge. Paul McKinney has written an <a href="http://lwn.net/Articles/470681/">accessible overview</a> of some of their work and its associated tools.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20120930/weak-vs-strong-memory-models/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
