<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Preshing on Programming</title>
	<atom:link href="http://preshing.com/feed" rel="self" type="application/rss+xml" />
	<link>http://preshing.com</link>
	<description></description>
	<lastBuildDate>Thu, 17 May 2012 22:35:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Memory Reordering Caught in the Act</title>
		<link>http://preshing.com/20120515/memory-reordering-caught-in-the-act</link>
		<comments>http://preshing.com/20120515/memory-reordering-caught-in-the-act#comments</comments>
		<pubDate>Tue, 15 May 2012 10:41:21 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=3026</guid>
		<description><![CDATA[When writing lock-free code in C or C++, one must often take special care to enforce correct memory ordering. Otherwise, surprising things can happen. Intel lists several such surprises in Volume 3, Section 8.2.3 of their x86/64 Architecture Specification. Here&#8217;s &#8230; <a href="http://preshing.com/20120515/memory-reordering-caught-in-the-act">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>When writing lock-free code in C or C++, one must often take special care to enforce correct memory ordering. Otherwise, surprising things can happen.</p>
<p>Intel lists several such surprises in Volume 3, Section 8.2.3 of their <a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html">x86/64 Architecture Specification</a>. Here&#8217;s one of the simplest examples. Suppose you have two integers <code>X</code> and <code>Y</code> somewhere in memory, both initially 0. Two processors, running in parallel, execute the following machine code:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/05/marked-example2.png" alt="" title="" width="479" height="56" class="aligncenter size-full wp-image-3230" /></p>
<p>Don&#8217;t be thrown off by the use of assembly language in this example. It&#8217;s really the best way to illustrate CPU ordering. Each processor stores 1 into one of the integer variables, then loads the other integer into a register. (r1 and r2 are just placeholder names for actual x86 registers, such as <code>eax</code>.)</p>
<p><span id="more-3026"></span>Now, no matter which processor writes 1 to memory first, it&#8217;s natural to expect the <em>other</em> processor to read that value back, which means we should end up with either r1 = 1, r2 = 1, or perhaps both. But according to Intel&#8217;s specification, that won&#8217;t necessarily be the case. The specification says it&#8217;s legal for both r1 and r2 to equal 0 at the end of this example &#8212; a counterintuitive result, to say the least!</p>
<p>One way to understand this is that Intel x86/64 processors, like most processor families, are allowed to <strong>reorder</strong> the memory interactions of machine instructions according to certain rules, as long it never changes the execution of a single-threaded program. In particular, each processor is allowed to delay the effect of a store past any load from a different location. As a result, it might end up as though the instructions had executed in this order:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/05/reordered.png" alt="" title="" width="264" height="87" class="aligncenter size-full wp-image-3083" /></p>
<h2>Let&#8217;s Make It Happen</h2>
<p>It&#8217;s all well and good to be told this kind of thing <em>might</em> happen, but there&#8217;s nothing like seeing it with your own eyes. That&#8217;s why I&#8217;ve written a small sample program to show this type of reordering <em>actually happening</em>. You can download the source code <a href="http://preshing.com/files/ordering.zip">here</a>.</p>
<p>The sample comes in both a Win32 version and a POSIX version. It spawns two worker threads which repeat the above transaction indefinitely, while the main thread synchronizes their work and checks each result.</p>
<p>Here&#8217;s the source code for the first worker thread. <code>X</code>, <code>Y</code>, <code>r1</code> and <code>r2</code> are all globals, and POSIX semaphores are used to co-ordinate the beginning and end of each loop.</p>
<div class="cpp"><pre class="de1">sem_t beginSema1<span class="sy4">;</span>
sem_t endSema<span class="sy4">;</span>
&nbsp;
<span class="kw4">int</span> X, Y<span class="sy4">;</span>
<span class="kw4">int</span> r1, r2<span class="sy4">;</span>
&nbsp;
<span class="kw4">void</span> <span class="sy2">*</span>thread1Func<span class="br0">&#40;</span><span class="kw4">void</span> <span class="sy2">*</span>param<span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    MersenneTwister random<span class="br0">&#40;</span><span class="nu0">1</span><span class="br0">&#41;</span><span class="sy4">;</span>                <span class="co1">// Initialize random number generator</span>
    <span class="kw1">for</span> <span class="br0">&#40;</span><span class="sy4">;;</span><span class="br0">&#41;</span>                                  <span class="co1">// Loop indefinitely</span>
    <span class="br0">&#123;</span>
        sem_wait<span class="br0">&#40;</span><span class="sy3">&amp;</span>beginSema1<span class="br0">&#41;</span><span class="sy4">;</span>                <span class="co1">// Wait for signal from main thread</span>
        <span class="kw1">while</span> <span class="br0">&#40;</span>random.<span class="me1">integer</span><span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="sy2">%</span> <span class="nu0">8</span> <span class="sy3">!</span><span class="sy1">=</span> <span class="nu0">0</span><span class="br0">&#41;</span> <span class="br0">&#123;</span><span class="br0">&#125;</span>  <span class="co1">// Add a short, random delay</span>
&nbsp;
        <span class="co1">// ----- THE TRANSACTION! -----</span>
        X <span class="sy1">=</span> <span class="nu0">1</span><span class="sy4">;</span>
        asm <span class="kw4">volatile</span><span class="br0">&#40;</span><span class="st0">&quot;&quot;</span> <span class="sy4">:::</span> <span class="st0">&quot;memory&quot;</span><span class="br0">&#41;</span><span class="sy4">;</span>        <span class="co1">// Prevent compiler reordering</span>
        r1 <span class="sy1">=</span> Y<span class="sy4">;</span>
&nbsp;
        sem_post<span class="br0">&#40;</span><span class="sy3">&amp;</span>endSema<span class="br0">&#41;</span><span class="sy4">;</span>                   <span class="co1">// Notify transaction complete</span>
    <span class="br0">&#125;</span>
    <span class="kw1">return</span> <span class="kw2">NULL</span><span class="sy4">;</span>  <span class="co1">// Never returns</span>
<span class="br0">&#125;</span><span class="sy4">;</span></pre></div>
<p>A short, random delay is added before each transaction in order to stagger the timing of the thread. Remember, there are two worker threads, and we&#8217;re trying to get their instructions to overlap. The random delay is achieved using the same <code>MersenneTwister</code> implementation I&#8217;ve used in previous posts, such as when <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">measuring lock contention</a> and when <a href="http://preshing.com/20120305/implementing-a-recursive-mutex" title="Implementing a Recursive Mutex">validating that the recursive Benaphore worked</a>.</p>
<p>Don&#8217;t be spooked by the presence of the <code>asm volatile</code> line in the above code listing. This is just a directive <a href="http://en.wikipedia.org/wiki/Memory_ordering#Compiler_memory_barrier">telling the GCC compiler not to rearrange the store and the load</a> when generating machine code, just in case it starts to get any funny ideas during optimization. We can verify this by checking the assembly code listing, as seen below. As expected, the store and the load occur in the desired order. The instruction after that writes the resulting register <code>eax</code> back to the global variable <code>r1</code>.</p>
<pre>
$ gcc -O2 -c -S -masm=intel ordering.cpp
$ cat ordering.s
	...
	<span class="highlight">mov	DWORD PTR _X, 1</span>
	<span class="highlight">mov	eax, DWORD PTR _Y</span>
	mov	DWORD PTR _r1, eax
	...
</pre>
<p>The main thread source code is shown below. It performs all the administrative work. After initialization, it loops indefinitely, resetting <code>X</code> and <code>Y</code> back to 0 before kicking off the worker threads on each iteration.</p>
<p>Pay particular attention to the way all writes to shared memory occur before <code>sem_post</code>, and all reads from shared memory occur after <code>sem_wait</code>. The same rules are followed in the worker threads when communicating with the main thread. Semaphores give us <strong>acquire</strong> and <strong>release semantics</strong> on every platform. That means we are guaranteed that the initial values of <code>X = 0</code> and <code>Y = 0</code> will propogate completely to the worker threads, and that the resulting values of <code>r1</code> and <code>r2</code> will propogate fully back here. In other words, the semaphores prevent memory reordering issues in the framework, allowing us to focus entirely on the experiment itself!</p>
<div class="cpp"><pre class="de1"><span class="kw4">int</span> main<span class="br0">&#40;</span><span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    <span class="co1">// Initialize the semaphores</span>
    sem_init<span class="br0">&#40;</span><span class="sy3">&amp;</span>beginSema1, <span class="nu0">0</span>, <span class="nu0">0</span><span class="br0">&#41;</span><span class="sy4">;</span>
    sem_init<span class="br0">&#40;</span><span class="sy3">&amp;</span>beginSema2, <span class="nu0">0</span>, <span class="nu0">0</span><span class="br0">&#41;</span><span class="sy4">;</span>
    sem_init<span class="br0">&#40;</span><span class="sy3">&amp;</span>endSema, <span class="nu0">0</span>, <span class="nu0">0</span><span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    <span class="co1">// Spawn the threads</span>
    pthread_t thread1, thread2<span class="sy4">;</span>
    pthread_create<span class="br0">&#40;</span><span class="sy3">&amp;</span>thread1, <span class="kw2">NULL</span>, thread1Func, <span class="kw2">NULL</span><span class="br0">&#41;</span><span class="sy4">;</span>
    pthread_create<span class="br0">&#40;</span><span class="sy3">&amp;</span>thread2, <span class="kw2">NULL</span>, thread2Func, <span class="kw2">NULL</span><span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    <span class="co1">// Repeat the experiment ad infinitum</span>
    <span class="kw4">int</span> detected <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
    <span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> iterations <span class="sy1">=</span> <span class="nu0">1</span><span class="sy4">;</span> <span class="sy4">;</span> iterations<span class="sy2">++</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        <span class="co1">// Reset X and Y</span>
        X <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
        Y <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
        <span class="co1">// Signal both threads</span>
        sem_post<span class="br0">&#40;</span><span class="sy3">&amp;</span>beginSema1<span class="br0">&#41;</span><span class="sy4">;</span>
        sem_post<span class="br0">&#40;</span><span class="sy3">&amp;</span>beginSema2<span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="co1">// Wait for both threads</span>
        sem_wait<span class="br0">&#40;</span><span class="sy3">&amp;</span>endSema<span class="br0">&#41;</span><span class="sy4">;</span>
        sem_wait<span class="br0">&#40;</span><span class="sy3">&amp;</span>endSema<span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="co1">// Check if there was a simultaneous reorder</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>r1 <span class="sy1">==</span> <span class="nu0">0</span> <span class="sy3">&amp;&amp;</span> r2 <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
        <span class="br0">&#123;</span>
            detected<span class="sy2">++</span><span class="sy4">;</span>
            <span class="kw3">printf</span><span class="br0">&#40;</span><span class="st0">&quot;%d reorders detected after %d iterations<span class="es1">\n</span>&quot;</span>, detected, iterations<span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="br0">&#125;</span>
    <span class="br0">&#125;</span>
    <span class="kw1">return</span> <span class="nu0">0</span><span class="sy4">;</span>  <span class="co1">// Never returns</span>
<span class="br0">&#125;</span></pre></div>
<p>Finally, the moment of truth. Here&#8217;s some sample output while running in <a href="http://www.cygwin.com/">Cygwin</a> on an Intel Xeon W3520.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/05/cygwin-output.png" alt="" title="" width="490" height="292" class="aligncenter size-full wp-image-3037" /></p>
<p>And there you have it! During this run, a memory reordering was detected approximately once every <strong>6600</strong> iterations. When I tested in Ubuntu on a Core 2 Duo E6300, the occurrences were even more rare. One begins to appreciate how subtle timing bugs can creep undetected into lock-free code.</p>
<p>Now, suppose you wanted to eliminate those reorderings. There are at least two ways to do it. One way is to set thread affinities so that both worker threads run exclusively on the same CPU core. There&#8217;s no portable way to set affinities with Pthreads, but on Linux it can be accomplished as follows:</p>
<div class="cpp"><pre class="de1">    cpu_set_t cpus<span class="sy4">;</span>
    CPU_ZERO<span class="br0">&#40;</span><span class="sy3">&amp;</span>cpus<span class="br0">&#41;</span><span class="sy4">;</span>
    CPU_SET<span class="br0">&#40;</span><span class="nu0">0</span>, <span class="sy3">&amp;</span>cpus<span class="br0">&#41;</span><span class="sy4">;</span>
    pthread_setaffinity_np<span class="br0">&#40;</span>thread1, <span class="kw3">sizeof</span><span class="br0">&#40;</span>cpu_set_t<span class="br0">&#41;</span>, <span class="sy3">&amp;</span>cpus<span class="br0">&#41;</span><span class="sy4">;</span>
    pthread_setaffinity_np<span class="br0">&#40;</span>thread2, <span class="kw3">sizeof</span><span class="br0">&#40;</span>cpu_set_t<span class="br0">&#41;</span>, <span class="sy3">&amp;</span>cpus<span class="br0">&#41;</span><span class="sy4">;</span></pre></div>
<p>After this change, the reordering disappears. That&#8217;s because a single processor never sees its own operations out of order, even when threads are pre-empted and rescheduled at arbitrary times. Of course, by locking both threads to a single core, we&#8217;ve left the other cores unused.</p>
<p>On a related note, I compiled and ran this sample on Playstation 3, and no memory reordering was detected. This suggests that the <a href="http://en.wikipedia.org/wiki/Cell_(microprocessor)#Power_Processor_Element_.28PPE.29">two hardware threads</a> inside the PPU effectively act as a single processor, with very fine-grained hardware scheduling.</p>
<h2>Preventing It With a StoreLoad Barrier</h2>
<p>Another way to prevent memory reordering in this sample is to introduce a CPU barrier between the two instructions. Here, we&#8217;d like to prevent the reordering of a store followed by a load. In common barrier parlance, we need a <strong>StoreLoad</strong> barrier.</p>
<p>On x86/64 processors, there is no specific instruction which acts <em>only</em> as a StoreLoad barrier, but there are several instructions which do that and more. The <code>mfence</code> instruction is a full memory barrier, which prevents memory reordering of any kind. In GCC, it can be implemented as follows:</p>
<pre>
    for (;;)                                  // Loop indefinitely
    {
        sem_wait(&#038;beginSema1);                // Wait for signal from main thread
        while (random.integer() % 8 != 0) {}  // Add a short, random delay

        // ----- THE TRANSACTION! -----
        X = 1;
        <span class="highlight">asm volatile("mfence" ::: "memory");</span>  // Prevent CPU reordering
        r1 = Y;

        sem_post(&#038;endSema);                   // Notify transaction complete
    }
</pre>
<p>Again, you can verify its presence by looking at the assembly code listing.</p>
<pre>
	...
	mov	DWORD PTR _X, 1
	<span class="highlight">mfence</span>
	mov	eax, DWORD PTR _Y
	mov	DWORD PTR _r1, eax
	...
</pre>
<p>With this modification, the memory reordering disappears, and we&#8217;ve still allowed both threads to run on separate CPU cores.</p>
<h2>Similar Instructions and Different Platforms</h2>
<p>Interestingly, <code>mfence</code> isn&#8217;t the only instruction which acts as a full memory barrier on x86/64. On these processors, any locked instruction, such as <code>xchg</code>, also acts as a full memory barrier &#8212; provided you don&#8217;t use SSE instructions or write-combined memory, which this sample doesn&#8217;t. In fact, the Microsoft C++ compiler generates <code>xchg</code> when you use the <code><a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms684208(v=vs.85).aspx">MemoryBarrier</a></code> intrinsic, at least in Visual Studio 2008.</p>
<p>The <code>mfence</code> instruction is specific to x86/64. If you want to make the code more portable, you could wrap this intrinsic in a preprocessor macro. The Linux kernel has wrapped it in a macro named <code>smp_mb</code>, along with related macros such as <code>smp_rmb</code> and <code>smp_wmb</code>, and provided <a href="http://lxr.free-electrons.com/ident?i=smp_mb">alternate implementations on different architectures</a>. For example, on PowerPC, <code>smp_mb</code> is implemented as <code>sync</code>.</p>
<p>All these different CPU families, each having unique instructions to enforce memory ordering, with each compiler exposing them through different instrincs, and each cross-platform project implementing its own portability layer&#8230; none of this helps simplify lock-free programming! This is partially why the <a href="http://www.open-std.org/JTC1/sc22/wg21/docs/papers/2007/n2427.html">C++11 atomic library</a> was recently introduced. It&#8217;s an attempt to standardize things, and make it easier to write portable lock-free code.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20120515/memory-reordering-caught-in-the-act/feed</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>How to Remove Camera Shake from iPhone 4S Videos</title>
		<link>http://preshing.com/20120415/how-to-remove-camera-shake-from-iphone-4s-videos</link>
		<comments>http://preshing.com/20120415/how-to-remove-camera-shake-from-iphone-4s-videos#comments</comments>
		<pubDate>Mon, 16 Apr 2012 00:57:23 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2886</guid>
		<description><![CDATA[Cell phone videos are notoriously shaky. It&#8217;s always been difficult to get a steady picture. So when Apple introduced a video stabilization feature on the iPhone 4S, I was really interested. I knew that the state of the art in &#8230; <a href="http://preshing.com/20120415/how-to-remove-camera-shake-from-iphone-4s-videos">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Cell phone videos are notoriously shaky. It&#8217;s always been difficult to get a steady picture. So when Apple introduced a <a href="http://www.apple.com/iphone/built-in-apps/hd-video-recording.html">video stabilization</a> feature on the iPhone 4S, I was really interested. I knew that the state of the art in digital video stabilization was capable of some <a href="http://www.youtube.com/watch?v=GaPNf2Rk4qQ">amazing results</a>. Finally, a few weeks back, I upgraded my phone to an iPhone 4S and took it on a two-week trip to Costa Rica.</p>
<p>By the end of the trip, I had shot hundreds of photos and 80 video clips &#8212; about 20 minutes of video in total. The iPhone 4S has a great camera, but I quickly learned that its built-in video stabilization feature is, well, not state of the art. It can perform some modest stabilization in <a href="http://www.tuaw.com/2011/10/14/iphone-4s-video-image-stabilization-in-action/">cases where you attempt to hold the camera still and point it in a single direction</a>. But other times, it doesn&#8217;t seem to help at all.</p>
<p>That&#8217;s OK, because with a little patience, and a free utility called <a href="http://www.guthspot.se/video/deshaker.htm">Deshaker</a> by Gunnar Thalin, you can remove the camera shake on your PC when you get home. For example, here&#8217;s some footage I shot of a coati at the top of Cerro Chato. The first part of the video shows the original, shaky iPhone 4S video &#8212; I blame the tasty Costa Rican coffee &#8212; and the second part shows the result after running it through Deshaker. While the result may not be flawless, I find it to be a significant improvement:</p>
<p><span id="more-2886"></span></p>
<p><iframe src="http://player.vimeo.com/video/40415062" width="620" height="349" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
<p>Deshaker is a plugin for <a href="http://www.virtualdub.org/">VirtualDub</a>, a popular video editor by Avery Lee. The <a href="http://www.guthspot.se/video/deshaker.htm">Deshaker documentation</a> contains instructions on how to use it within VirtualDub, and there are additional guides available online. However, I had a lot of videos to stabilize, and following such guides would have been very time-consuming, as each video requires several manual steps. Not to mention I had trouble convincing VirtualDub to open the <code>.mov</code> files saved by the iPhone.</p>
<p>Therefore, I decided to automate the process by writing a Python script, which you can download <a href="http://preshing.com/files/stabilize.py">here</a>. This script makes use of an assortment of free tools for Windows: <a href="http://ffmpeg.org/">ffmpeg</a>, <a href="http://avisynth.org/mediawiki/Main_Page">AviSynth</a>, Deshaker and a few others. These tools are definitely for power users &#8212; ffmpeg even calls itself <a href="http://ffmpeg.org/download.html">a very experimental and developer-driven project</a>. They&#8217;re great tools, but I must admit: Every time I use them, I feel a little bit of dread. I know I&#8217;m about to lose a big chunk of my weekend googling for information, trying different combinations, experimenting with settings and working around bugs. And once I&#8217;ve finished, I usually forget everything I learned.</p>
<p>So this time, I decided to document the process. In this post, I&#8217;ll show you how to install the software needed to run this script yourself. I&#8217;ll also describe each step performed by the script, and explain why I chose to implement it the way I did. The script is currently customized for iPhone 4S videos, but I hope to provide enough information here so that you can adapt it to work with other video sources as well.</p>
<h2>Required Software</h2>
<p>You&#8217;ll need to install all of the following free software, if you haven&#8217;t already. It all works on any version of Windows (XP or higher).</p>
<h3>Install ffmpeg</h3>
<p><a href="http://ffmpeg.org/">ffmpeg</a> is an open-source, cross-platform command-line tool which knows how to encode and decode a ton of audio/video formats. We&#8217;ll use it to encode our stabilized video as <code>.mp4</code>. I chose MP4 as the target format because Windows, Mac and Ubuntu all seem to recognize it without too much fuss.</p>
<p>I installed the 32-bit Static build of version <code>git-41a097a</code> (April 3) from <a href="http://ffmpeg.zeranoe.com/builds/">this page</a>. The build is distributed as a 7-zip file, so you&#8217;ll need to install the <a href="http://www.7-zip.org/download.html">7-zip extractor</a> to open it. Installation is a simple matter of extracting the complete contents to a directory somewhere. I extracted it to <code>c:\util\ffmpeg</code>.</p>
<p>A small warning: In the past, I&#8217;ve found that the command-line arguments of ffmpeg may change from one version to the next. Therefore, if you search for examples of ffmpeg command lines, they may not work with the version you have. Hopefully the arguments won&#8217;t change too much in future versions, so that this guide will remain intact.</p>
<h3>Install AVISynth</h3>
<p><a href="http://avisynth.org/mediawiki/Main_Page">AVISynth</a> is a frameserver, designed to provide audio/video input to other tools. It&#8217;s based on a neat trick: Once you install AVISynth, Windows will believe that any file with the extension <code>.avs</code> is a video file. In reality, an <code>.avs</code> file is just a text file you write in a custom scripting language, telling AVISynth how to render the video. We&#8217;ll use AVISynth to open our <code>.mov</code> files, rotate them, pad and trim extra frames, resize them, and most importantly &#8212; to run Deshaker.</p>
<p>AVISynth comes with a regular Windows installer. I installed the 32-bit version 2.5.8 from <a href="http://sourceforge.net/projects/avisynth2/files/AviSynth%202.5/">here</a>. The 32-bit version is required if you want to use 32-bit plugins, which we do. When you install it, pay attention to the exact installation path, because you&#8217;ll need to remember it later.</p>
<h3>Install Quicktime</h3>
<p>Install Quicktime from Apple. Odds are, you already have it, but in case you don&#8217;t, here&#8217;s the <a href="http://www.apple.com/quicktime/download/">download link</a>. It&#8217;s required for QTSource to work.</p>
<h3>Install QTSource</h3>
<p><a href="http://forum.doom9.org/showthread.php?t=104293">QTSource</a> is a plugin for AVISynth. It gives AVISynth the ability to open <code>.mov</code> files, but it requires you to have Quicktime installed. There are alternative ways to open <code>.mov</code> files in AVISynth, but I couldn&#8217;t get any of them to work reliably with the iPhone 4S videos on my computer. So QTSource it is.</p>
<p>I used QTSource version 0.1.4, which you can currently download from the author&#8217;s <a href="http://tateu.net/software/">download page</a> (<a href="http://tateu.net/software/dl.php?f=QTSource">direct link</a>). Simply open the zip file and extract <code>QTSource.dll</code> to your AVISynth plugins folder. This is the only file needed. In my case, I extracted it to <code>C:\Program Files (x86)\AviSynth 2.5\plugins</code>.</p>
<h3>Install Deshaker</h3>
<p>As I mentioned, Deshaker is a plugin for VirtualDub &#8212; but you don&#8217;t actually need VirtualDub to use it. It turns out that AVISynth can use any plugin written for VirtualDub. I downloaded Deshaker 3.0 from the <a href="http://www.guthspot.se/video/deshaker.htm">author&#8217;s page</a> (<a href="http://www.guthspot.se/video/files/Deshaker30.zip">direct link</a>). It&#8217;s a zip file, so simply open it and extract <code>Deshaker.vdf</code> to the folder of your choice. I happen to have VirtualDub installed in <code>C:\Util\VirtualDub</code>, so I&#8217;ve extracted it to <code>C:\Util\VirtualDub\plugins</code>. But any folder will do.</p>
<h3>Install MediaInfo</h3>
<p><a href="http://mediainfo.sourceforge.net/en">MediaInfo</a> is small utility to extract metadata from video files. We&#8217;ll use it to determine the rotation of each <code>.mov</code> file. It&#8217;s especially important to know the rotation because of the rolling shutter, as I&#8217;ll explain later. We want the CLI (Command Line Interface) version of MediaInfo, so that Python can interact with it. I installed the 32-bit version 0.7.56 from <a href="http://mediainfo.sourceforge.net/en/Download/Windows">this page</a>. It&#8217;s distributed as a zip file &#8212; I extracted mine to <code>C:\Util\MediaInfo</code>.</p>
<h3>Install Python</h3>
<p>Of course, you need <a href="http://www.python.org/">Python</a>. I developed the script in Python 2.7, but it might work in earlier versions. I prefer installing <a href="http://www.activestate.com/activepython/downloads">ActiveState Python for Windows</a>, because it comes with PythonWin, but the <a href="http://www.python.org/download/releases/2.7.3/">regular installer</a> will work too.</p>
<h2>How to Run It</h2>
<p>First, copy the iPhone videos to a folder on your PC. Then save the Python script somewhere; for example, in the same folder as the videos to convert. Open the Python script in a text editor and modify the hardcoded path names to correctly reflect the installation paths on your machine:</p>
<div class="python"><pre class="de1">MEDIAINFO_FOLDER <span class="sy0">=</span> r<span class="st0">'C:<span class="es0">\U</span>til<span class="es0">\M</span>ediaInfo'</span>
DESHAKER_FOLDER <span class="sy0">=</span> r<span class="st0">'C:<span class="es0">\U</span>til<span class="es0">\V</span>irtualDub<span class="es0">\P</span>lugins'</span>
FFMPEG_FOLDER <span class="sy0">=</span> r<span class="st0">'C:<span class="es0">\U</span>til<span class="es0">\f</span>fmpeg<span class="es0">\b</span>in'</span></pre></div>
<p>Next, open a command prompt, navigate to the folder containing the videos you want to convert, and run any command similar to the following. If the Python script is in a different folder, make sure to specify its full path. You can also specify the name of a single video file, or any wildcard pattern accepted by Python&#8217;s <a href="http://docs.python.org/library/glob.html">glob</a> module, such as <code>*.mov</code>.</p>
<pre>python stabilize.py *.mov</pre>
<p>While the script is running, the Command Prompt title bar will change to reflect encoding progress. I recommend resizing the window to be 100 columns wide so that ffmpeg&#8217;s progress messages fit completely on one line. If they extend to two lines, your Command Prompt window will fill up with spam. To change the width, right-click on the window and choose Properties.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/04/command-prompt.jpg" alt="" title="" width="478" height="226" class="aligncenter size-full wp-image-2979" /></p>
<p>The script saves the processed video back to the same folder as the <code>.mov</code> files, with the extension <code>_stabilized.mp4</code>. During the conversion process, it also creates a bunch of temporary files in the current working directory, which are deleted afterwards.</p>
<p>Finally, be warned that the conversion process is very slow. On my underpowered Core 2 Duo E6300, every minute of video takes about an hour to process.</p>
<h2>What it Does</h2>
<p>Here are all the steps performed by the script.</p>
<h3>Determining the Video Rotation</h3>
<p>The iPhone 4S knows which way you are holding the phone while you shoot the video, but it handles it in a funny way: It always records the video from the point of view of a non-rotated phone, and stores some metadata in the file to remember which way the phone was held. On Windows, the QuickTime video player recognizes this metadata, and will play the video back with the correct rotation; but Media Player Classic, an alternative player installed by the <a href="http://www.codecguide.com/download_kl.htm">K-Lite Codec Pack</a>, does not.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/04/rotation.jpg" alt="" title="" width="418" height="272" class="aligncenter size-full wp-image-2951" /></p>
<p>By the way, I recommend against filming vertical videos. They&#8217;re terrible for sharing online. In any case, you can view the rotation metadata by running MediaInfo on the video:</p>
<pre>c:\util\mediainfo\mediainfo img_0128.mov</pre>
<p>If the phone was rotated during recording, it will be listed in the output as either 90, 180 or 270.</p>
<pre>
...
Bit rate                                 : 24.0 Mbps
Width                                    : 1 920 pixels
Height                                   : 1 080 pixels
Display aspect ratio                     : 16:9
<span class="highlight">Rotation                                 : 90°         </span>
Frame rate mode                          : Variable
Frame rate                               : 29.904 fps
...
</pre>
<p>The Python script automatically runs MediaInfo and extracts the rotation info from the output.</p>
<h3>Generate a Log File Containing Camera Movement Information</h3>
<p>The Deshaker plugin works in two passes. In the first pass, it determines camera movement information on a frame-by-frame basis. On the second pass, it corrects each video frame to produce a smoother video. Normally, you would perform both passes entirely within VirtualDub, but in our case, the Python script automates both passes using temporary AVISynth scripts.</p>
<p>The first pass is taken care of by <code>pass1.avs</code>, a temporary AVISynth script with contents similar to the following:</p>
<div class="javascript"><pre class="de1">LoadVirtualDubPlugin<span class="br0">&#40;</span><span class="st0">&quot;C:<span class="es0">\U</span>til<span class="es0">\V</span>irtualDub<span class="es0">\P</span>lugins<span class="es0">\D</span>eshaker.vdf&quot;</span><span class="sy0">,</span> <span class="st0">&quot;Deshaker&quot;</span><span class="br0">&#41;</span>
QTInput<span class="br0">&#40;</span><span class="st0">&quot;D:<span class="es0">\P</span>hotos<span class="es0">\C</span>osta Rica 2012<span class="es0">\V</span>ideos<span class="es0">\i</span>mg_0128.mov&quot;</span><span class="sy0">,</span> color<span class="sy0">=</span><span class="nu0">1</span><span class="br0">&#41;</span>
TurnLeft<span class="br0">&#40;</span><span class="br0">&#41;</span>
Deshaker<span class="br0">&#40;</span><span class="st0">&quot;18|1|30|4|1|0|1|0|640|480|1|1|650|650|1000|650|4|0|6|2|8|30|300|4|
D:<span class="es0">\P</span>hotos<span class="es0">\C</span>osta Rica 2012<span class="es0">\V</span>ideos<span class="es0">\i</span>mg_0128_deshaker.log|0|0|0|0|0|0|0|0|0|0|0|
0|0|1|12|12|10|5|1|1|10|10|0|0|1|0|1|0|0|10|1000|1|88|1|1|20|400|90|20|1&quot;</span><span class="br0">&#41;</span></pre></div>
<ul>
<li><code><a href="http://avisynth.org/mediawiki/LoadVirtualdubPlugin">LoadVirtualDubPlugin()</a></code> is required to make the Deshaker plugin work.</li>
<li><code>QTInput()</code> is the function which opens the .mov file using the QTSource plugin. The <code>color=1</code> option tells the plugin to return video frames in the RGB32 format. This is the only format which the Deshaker plugin accepts. (We could have converted the pixel format using a separate AVISynth function, <code><a href="http://avisynth.org/mediawiki/ConvertToRGB">ConvertToRGB32</a></code>, but the <code>color=1</code> option is faster.)</li>
<li>Since <code>QTInput()</code> uses QuickTime, which knows about the rotation metadata mentioned above, it will automatically rotate the video correctly according to the way the phone was held. We don&#8217;t actually want this; we want to process the video from the non-rotated camera&#8217;s point of view, for reasons explained below. That&#8217;s the what the <code><a href="http://avisynth.org/mediawiki/TurnLeft">TurnLeft()</a></code> call is for in this case. The exact function may differ depending on the rotation.</li>
<li>Finally, we pass each video frame to <code>Deshaker()</code>, which will determine the camera movement information for each frame. In this case, the information gets saved to the temporary log file <code>D:\Photos\Costa Rica 2012\Videos\img_0128_deshaker.log</code>.</li>
</ul>
<p>To actually run the video through the AVISynth script, and generate a log file, we use the following command line. I won&#8217;t describe what all the options do, but you can always look them up in the <a href="http://ffmpeg.org/ffmpeg.html">ffmpeg documentation</a>.</p>
<pre>C:\Util\ffmpeg\bin\ffmpeg.exe -y -i pass1.avs -vcodec copy temp.avi</pre>
<p>In the AVISynth script, you&#8217;ll notice a huge option string passed to <code>Deshaker()</code>. According to the author, Deshaker accepts its arguments as a long string because there is a limit to the number of separate arguments that can be passed from VirtualDub. There is a complete reference in the <a href="http://www.guthspot.se/video/deshaker.htm">documentation</a>. I&#8217;ve customized a few options in particular, which are highlighted below. Personally, I prefer these values over the defaults; your mileage may vary.</p>
<pre>
"18|<span class="highlight">1</span>|30|4|1|0|1|0|640|480|<span class="highlight">1|1</span>|<span class="highlight">650|650|1000|650</span>|4|<span class="highlight">0</span>|<span class="highlight">6</span>|2|8|30|300|4|
D:\Photos\Costa Rica 2012\Videos\img_0128_deshaker.log|0|0|0|0|0|0|0|0|0|0|0|
0|0|1|<span class="highlight">12|12|10|5</span>|<span class="highlight">1|1</span>|<span class="highlight">10|10</span>|0|0|<span class="highlight">1</span>|0|1|0|0|10|1000|1|88|1|1|20|<span class="highlight">400</span>|<span class="highlight">90</span>|20|1"
</pre>
<table class="grid">
<thead>
<tr>
<th>Option&nbsp;#</th>
<th>Description</th>
<th>Value</th>
</tr>
</thead>
<tr>
<td>2</td>
<td>Pass number</td>
<td>Set to <code><span class="highlight">1</span></code> on the first pass, <code>2</code> on the second.</td>
</tr>
<tr>
<td>11&nbsp;-&nbsp;12</td>
<td>Scale and Use pixels</td>
<td><code><span class="highlight">1|1</span></code> tells Deshaker to downsize each video frame to half-resolution and use every pixel to perform motion estimation. We could also tell it to work with full-resolution video frames, but it would run even more slowly.</td>
</tr>
<tr>
<td>13&nbsp;-&nbsp;16</td>
<td>Motion smoothness</td>
<td><code><span class="highlight">650|650|1000|650</span></code> The amount of smoothing to perform along each of four axes of motion (horizontal, vertical, rotate and zoom). I&#8217;ve lowered horizontal and vertical from their defaults of 1000. If you leave these values too high, and your video contains a lot of sudden pans, Deshaker tends to zoom the camera too much in the processed video.</td>
</tr>
<tr>
<td>18</td>
<td>Video output</td>
<td><code><span class="highlight">0</span></code> tells Deshaker to output an empty 8×8-pixel video during the first pass. This empty video is written to temp.avi in the ffmpeg command line above, and deleted later by the Python script. If we left this option at its default, <code>1</code>, Deshaker would output video frames with motion vectors superimposed, as seen below. This is what you normally see during the first pass when running Deshaker from VirtualDub.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/04/motion-vectors.jpg" alt="" title="" width="221" height="120" class="aligncenter size-full wp-image-2956" /></td>
</tr>
<tr>
<td>19</td>
<td>Edge compensation</td>
<td><code><span class="highlight">6</span></code> tells Deshaker to zoom the video frame as much as necessary to avoid blank space on the side of the frame. Option 63 tells it to dial it back a little, but overall, I prefer to minimize the amount of blank space.</td>
</tr>
<tr>
<td>40&nbsp;-&nbsp;43</td>
<td>Max. correction limits</td>
<td><code><span class="highlight">12|12|10|5</span></code> sets the maximum correction limits. I increased the rotation threshold, but lowered the other values. You don&#8217;t want to set these too low, because larger shakes will remain in the processed video.</td>
</tr>
<tr>
<td>44&nbsp;-&nbsp;45</td>
<td>Fill in borders</td>
<td><code><span class="highlight">1|1</span></code>  Use previous and future frames to fill in borders.</td>
</tr>
<tr>
<td>46&nbsp;-&nbsp;47</td>
<td>Previous and Future frames</td>
<td><code><span class="highlight">10|10</span></code> Allow Deshaker to use the previous 10 and future 10 frames to fill in borders. I lowered these from the default value of 30, as it runs much faster without making a noticeable quality difference.</td>
</tr>
<tr>
<td>50</td>
<td>Camera has a rolling shutter</td>
<td><code><span class="highlight">1</span></code> Yes. The iPhone has a rolling shutter which has a tendency to distort the video when the camera is panned rapidly. An example is shown in the video below. Deshaker is able to reduce the distortion effect when this option is enabled, but only if the video is oriented from the point of view of a non-rotated phone.</p>
<p><span style="text-align:center; display: block;"><a href="http://preshing.com/20120415/how-to-remove-camera-shake-from-iphone-4s-videos"><img src="http://img.youtube.com/vi/OVtihUIkqBM/2.jpg" alt="" /></a></span>
</td>
</tr>
<tr>
<td>62</td>
<td>Adaptive zoom smoothness</td>
<td><code><span class="highlight">400</span></code> I lowered this setting from the default of <code>5000</code> to allow Deshaker&#8217;s adaptive zoom to kick in and wear off more quickly.</td>
</tr>
<tr>
<td>63</td>
<td>Adaptive zoom amount</td>
<td>By setting this to <code><span class="highlight">90</span></code>, we allow a maximum 10% of the borders to appear during adaptive zoom.  </td>
</tr>
</table>
<h3>Correct Each Frame and Encode a New Video</h3>
<p>The second pass in the process of running Deshaker is taken care of by <code>pass2.avs</code>, a temporary AVISynth script with contents similar to the following:</p>
<div class="javascript"><pre class="de1">LoadVirtualDubPlugin<span class="br0">&#40;</span><span class="st0">&quot;C:<span class="es0">\U</span>til<span class="es0">\V</span>irtualDub<span class="es0">\P</span>lugins<span class="es0">\D</span>eshaker.vdf&quot;</span><span class="sy0">,</span> <span class="st0">&quot;Deshaker&quot;</span><span class="br0">&#41;</span>
QTInput<span class="br0">&#40;</span><span class="st0">&quot;D:<span class="es0">\P</span>hotos<span class="es0">\C</span>osta Rica 2012<span class="es0">\V</span>ideos<span class="es0">\i</span>mg_0128.mov&quot;</span><span class="sy0">,</span> color<span class="sy0">=</span><span class="nu0">1</span><span class="br0">&#41;</span>
clip <span class="sy0">=</span> TurnLeft<span class="br0">&#40;</span><span class="br0">&#41;</span>
clip <span class="sy0">+</span> BlankClip<span class="br0">&#40;</span>clip<span class="sy0">,</span> <span class="nu0">10</span><span class="br0">&#41;</span>
Deshaker<span class="br0">&#40;</span><span class="st0">&quot;18|2|30|4|1|0|1|0|640|480|1|1|650|650|1000|650|4|0|6|2|8|30|300|4|
D:<span class="es0">\P</span>hotos<span class="es0">\C</span>osta Rica 2012<span class="es0">\V</span>ideos<span class="es0">\i</span>mg_0128_deshaker.log|0|0|0|0|0|0|0|0|0|0|0|
0|0|1|12|12|10|5|1|1|10|10|0|0|1|0|1|0|0|10|1000|1|88|1|1|20|400|90|20|1&quot;</span><span class="br0">&#41;</span>
Trim<span class="br0">&#40;</span><span class="nu0">0</span><span class="sy0">,</span> FrameCount <span class="sy0">-</span> <span class="nu0">3</span><span class="br0">&#41;</span>
Width <span class="sy0">&gt;</span> Height <span class="sy0">?</span> Lanczos4Resize<span class="br0">&#40;</span><span class="nu0">960</span><span class="sy0">,</span> <span class="nu0">540</span><span class="br0">&#41;</span> <span class="sy0">:</span> Lanczos4Resize<span class="br0">&#40;</span><span class="nu0">540</span><span class="sy0">,</span> <span class="nu0">960</span><span class="br0">&#41;</span>
TurnRight<span class="br0">&#40;</span><span class="br0">&#41;</span></pre></div>
<p>The Deshaker options string is identical to the previous one, except the second option changes to <code>2</code> to indicate this is the second pass. This time, Deshaker will output the corrected video frames, but we need to perform a few extra steps in the script:</p>
<ul>
<li>Deshaker outputs 10 bogus frames at the beginning of the clip &#8212; the same number we specified for options #46 &#8211; 47 (Previous and Future frames) in the options string. We&#8217;ll trim those later, from the ffmpeg command line. It also eats the final 10 frames. To avoid losing real frames, we append 10 blank frames at the end of the input video, using the expression <code>clip + BlankClip(clip, 10)</code>.
</li>
<li>For some reason, Deshaker also adds two garbage frames at the end of the video. Those are removed using the <code><a href="http://avisynth.org/mediawiki/Trim">Trim()</a></code> function.
</li>
<li>I like to resize the video down to 960&#215;540 using <code><a href="http://avisynth.org/mediawiki/Resize">Lanczos4Resize()</a></code>. This makes the file size smaller and the processing time shorter, while still retaining a lot of detail. Besides, 1920&#215;1080 is way too big and slow to view on the average desktop.
</li>
<li>Finally, we rotate the video back to its correct orientation.
</li>
</ul>
<p>We run the video through the above AVISynth script using the following unwieldy command line:</p>
<pre>
c:\util\ffmpeg\ffmpeg.exe -y -i pass2.avs -itsoffset 0.33333 -i img_0128.mov -map 0 -map 1:1
-pix_fmt yuv420p -vcodec libx264 -preset veryslow -crf 15 -x264opts frameref=15:fast_pskip=0
-acodec copy -ss 0.33333 img_0128_inprogress.mp4
</pre>
<p>The idea behind this command line is to grab the processed video from AVISynth script, grab the audio from the original iPhone 4S video, splice them together while compensating for audio drift, and save the result as <code>img_0128_inprogress.mp4</code> using the best compression. I chose to let ffmpeg handle the audio because when I tried to let AVISynth handle it (by adding <code>audio=1</code> to <code>QTInput</code>), ffmpeg reported errors. You can look up each command-line argument in the <a href="http://ffmpeg.org/ffmpeg.html">ffmpeg documentation</a>, but here&#8217;s a detailed breakdown:</p>
<ul>
<li><code>-i pass2.avs</code> and <code>-i img_0128.mov</code> define the two input files for ffmpeg.
</li>
<li><code>-map 0 -map 1:1</code> tells ffmpeg to take the video from the first input file and the audio from the second input file.
</li>
<li><code>-itsoffset 0.33333</code> tells ffmpeg to delay the audio signal of <code>img_0128.mov</code> so that it matches the video frames coming from the <code>pass2.avs</code>. Remember, Deshaker added 10 bogus frames to the beginning of the video, and since iPhone videos are 30 frames/sec, 10 frames equals 0.33333 seconds.
</li>
<li><code>-ss 0.33333</code> tells it to discard the first 10 frames when writing the output, while still keeping the audio correctly aligned. This is the only way to correctly discard the bogus video frames. (I tried using the Trim command within AVISynth instead, but it doesn&#8217;t work: It only ends up trimming the input, and Deshaker still adds bogus frames at the start.)
</li>
<li><code>-pix_fmt yuv420p</code> converts the video frames to a color space compatible with H.264. The command line will still work without this option, but you&#8217;ll get a warning if you remove it.
</li>
<li><code>-vcodec libx264 -preset veryslow -crf 15 -x264opts frameref=15:fast_pskip=0</code> tells ffmpeg to use the highest quality H.264 compression. I basically just copied these settings from <a href="https://wiki.archlinux.org/index.php/FFmpeg#Single-pass_x264_.28very_high-quality.29">here</a>. I didn&#8217;t bother with 2-pass H.264 compression because I didn&#8217;t want to run Deshaker twice. These settings give excellent video quality, though the resulting bitrate tends to be very high and could probably be optimized. If anyone wants to share some tips here, I&#8217;m all ears.
</li>
<li><code>-acodec copy</code> tells ffmpeg to copy the compressed AAC audio data directly from the input to the output file without converting it. This way, we save processing time without degrading audio quality.
</li>
</ul>
<p>And that&#8217;s it! I&#8217;m not saying this is the best approach, but it worked well for me. If I was willing to spend more time on it, I would investigate using <a href="http://avisynth.org.ru/mvtools/mvtools2.html">MVTools2</a> to estimate the camera motion and generate the Deshaker log file in the first pass, as it seems to run faster based on past experience.</p>
<p>If you have any success using this script to remove camera shake from your own videos, I&#8217;d be interested to hear about it!</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20120415/how-to-remove-camera-shake-from-iphone-4s-videos/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Implementing a Recursive Mutex</title>
		<link>http://preshing.com/20120305/implementing-a-recursive-mutex</link>
		<comments>http://preshing.com/20120305/implementing-a-recursive-mutex#comments</comments>
		<pubDate>Mon, 05 Mar 2012 11:53:55 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2786</guid>
		<description><![CDATA[When optimizing code for multiple CPU cores, sometimes you need to write a new synchronization primitive. I don&#8217;t mean to encourage it, but it does happen. And if you&#8217;re going to do it, you might as well start by looking &#8230; <a href="http://preshing.com/20120305/implementing-a-recursive-mutex">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>When optimizing code for multiple CPU cores, sometimes you need to write a new synchronization primitive. I don&#8217;t mean to encourage it, but it does happen. And if you&#8217;re going to do it, you might as well start by looking at a few examples. This won&#8217;t save you from shooting yourself in the foot, but it may help reduce the number of times, so you can walk away with a few toes remaining.</p>
<p>In my previous post, I showed <a href="http://preshing.com/20120226/roll-your-own-lightweight-mutex">how to implement a synchronization primitive known as the Benaphore</a> in C++ on Win32. The Benaphore is not lock-free (being a lock itself), but it does serve as a simple yet instructive example of writing a synchronization primitive in user space. It also offers very low overhead when there&#8217;s no lock contention.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/03/recursive-mutex.png" alt="" title="" width="75" height="97" class="alignright size-full wp-image-2853" />One limitation of the implementation I showed was that it was non-recursive. This means that if the same thread attempts to obtain the same lock twice, it will deadlock. In this post, I&#8217;ll show how to extend the implementation to support recursive locking.</p>
<p><span id="more-2786"></span>Recursive locking is useful when you have a module which calls itself through its own public interface. For example, in a memory manager, you might encounter the following code:</p>
<div class="cpp"><pre class="de1"><span class="kw4">void</span><span class="sy2">*</span> MemoryManager<span class="sy4">::</span><span class="me2">Realloc</span><span class="br0">&#40;</span><span class="kw4">void</span><span class="sy2">*</span> ptr, <span class="kw4">size_t</span> size<span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    AUTO_LOCK_MACRO<span class="br0">&#40;</span>m_lock<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    <span class="kw1">if</span> <span class="br0">&#40;</span>ptr <span class="sy1">==</span> <span class="kw2">NULL</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        <span class="kw1">return</span> Alloc<span class="br0">&#40;</span>size<span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
    <span class="kw1">else</span> <span class="kw1">if</span> <span class="br0">&#40;</span>size <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        Free<span class="br0">&#40;</span>size<span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="kw1">return</span> <span class="kw2">NULL</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
    <span class="kw1">else</span>
        ...
<span class="br0">&#125;</span>
&nbsp;
<span class="kw4">void</span><span class="sy2">*</span> MemoryManager<span class="sy4">::</span><span class="me2">Alloc</span><span class="br0">&#40;</span><span class="kw4">size_t</span> size<span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    AUTO_LOCK_MACRO<span class="br0">&#40;</span>m_lock<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    ...
<span class="br0">&#125;</span></pre></div>
<p><code>AUTO_LOCK_MACRO</code> is, of course, one of those funky C++ macros which obtains the lock and automatically unlocks it when we exit the function scope.</p>
<p>As you can see, if we pass <code>NULL</code> to <code>Realloc</code>, the lock will be obtained once by the <code>Realloc</code> function, and a second time (recursively) when <code>Alloc</code> is called. Obviously, it would be very easy to modify this particular example to avoid the recursive lock, but in large, multithreaded projects, you&#8217;re likely to find other examples.</p>
<p>We can extend our Win32 implementation of the Benaphore to support recursive locking as follows. I&#8217;ve added two new members to the class: <code>m_owner</code>, which stores the thread ID (TID) of the current owner, and <code>m_recursion</code>, which stores the recursion count.</p>
<p>Expert readers will note that this code does <em>not</em> use the new <a href="http://www.open-std.org/JTC1/sc22/wg21/docs/papers/2007/n2427.html">C++11 atomic library standard</a>. As such, it&#8217;s destined to go out of style in the long run. However, this is the style we&#8217;ve been using in the game industry since the mid-2000&#8242;s. It will compile using any Microsoft compiler, and all Windows-specific calls have equivalents on other platforms.</p>
<div class="cpp"><pre class="de1"><span class="co1">// Define this to {} in a retail build:</span>
<span class="co2">#define LIGHT_ASSERT(x) { if (!(x)) DebugBreak(); }</span>
&nbsp;
<span class="kw2">class</span> RecursiveBenaphore
<span class="br0">&#123;</span>
<span class="kw2">private</span><span class="sy4">:</span>
    LONG m_counter<span class="sy4">;</span>
    DWORD m_owner<span class="sy4">;</span>
    DWORD m_recursion<span class="sy4">;</span>
    HANDLE m_semaphore<span class="sy4">;</span>
&nbsp;
<span class="kw2">public</span><span class="sy4">:</span>
    RecursiveBenaphore<span class="sy4">::</span><span class="me2">RecursiveBenaphore</span><span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        m_counter <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
        m_owner <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>            <span class="co1">// an invalid thread ID</span>
        m_recursion <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
        m_semaphore <span class="sy1">=</span> CreateSemaphore<span class="br0">&#40;</span><span class="kw2">NULL</span>, <span class="nu0">0</span>, <span class="nu0">1</span>, <span class="kw2">NULL</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    RecursiveBenaphore<span class="sy4">::</span>~RecursiveBenaphore<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        CloseHandle<span class="br0">&#40;</span>m_semaphore<span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="kw4">void</span> Lock<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        DWORD tid <span class="sy1">=</span> GetCurrentThreadId<span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>_InterlockedIncrement<span class="br0">&#40;</span><span class="sy3">&amp;</span>m_counter<span class="br0">&#41;</span> <span class="sy1">&gt;</span> <span class="nu0">1</span><span class="br0">&#41;</span> <span class="co1">// x86/64 guarantees acquire semantics</span>
        <span class="br0">&#123;</span>
            <span class="kw1">if</span> <span class="br0">&#40;</span>tid <span class="sy3">!</span><span class="sy1">=</span> m_owner<span class="br0">&#41;</span>
                WaitForSingleObject<span class="br0">&#40;</span>m_semaphore, INFINITE<span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="br0">&#125;</span>
        <span class="co1">//--- We are now inside the Lock ---</span>
        m_owner <span class="sy1">=</span> tid<span class="sy4">;</span>
        m_recursion<span class="sy2">++</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="kw4">void</span> Unlock<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        DWORD tid <span class="sy1">=</span> GetCurrentThreadId<span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>
        LIGHT_ASSERT<span class="br0">&#40;</span>tid <span class="sy1">==</span> m_owner<span class="br0">&#41;</span><span class="sy4">;</span>
        DWORD recur <span class="sy1">=</span> <span class="sy2">--</span>m_recursion<span class="sy4">;</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>recur <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
            m_owner <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
        DWORD result <span class="sy1">=</span> _InterlockedDecrement<span class="br0">&#40;</span><span class="sy3">&amp;</span>m_counter<span class="br0">&#41;</span><span class="sy4">;</span> <span class="co1">// x86/64 guarantees release semantics</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>result <span class="sy1">&gt;</span> <span class="nu0">0</span><span class="br0">&#41;</span>
        <span class="br0">&#123;</span>
            <span class="kw1">if</span> <span class="br0">&#40;</span>recur <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
                ReleaseSemaphore<span class="br0">&#40;</span>m_semaphore, <span class="nu0">1</span>, <span class="kw2">NULL</span><span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="br0">&#125;</span>
        <span class="co1">//--- We are now outside the Lock ---</span>
    <span class="br0">&#125;</span>
<span class="br0">&#125;</span><span class="sy4">;</span></pre></div>
<p>As in <a href="http://preshing.com/20120226/roll-your-own-lightweight-mutex">the original Benaphore</a>, the first thread to call <code>Lock</code> will take ownership without making any expensive kernel calls. It also performs some bookkeeping: It sets <code>m_owner</code> to its own TID, and <code>m_recursion</code> becomes 1. If the same thread calls <code>Lock</code> again, it will increment both <code>m_counter</code> and <code>m_recursion</code>.</p>
<p>Correspondingly, when the same thread calls <code>Unlock</code>, it will decrement both <code>m_counter</code> and <code>m_recursion</code>, but it will only call <code>ReleaseSemaphore</code> once <code>m_recursion</code> is decremented back down to 0. If <code>m_recursion</code> remains greater than 0, it means that the current thread is still holding the lock in an outer scope, so it&#8217;s not yet safe to relinquish ownership to other threads.</p>
<p>Now, if you scour the Internet, you&#8217;ll find it&#8217;s full of broken lock-free code and synchronization primitives. So why should you believe the code here is any different? For one thing, it&#8217;s been <strong>stress tested</strong>. In my opinion, that&#8217;s the most valuable thing. I&#8217;ve written a small test application which spawns various numbers of threads, each hammering on this lock at random times and with random recursion depths. It updates some shared data within each lock and performs various consistency checks. You can download the source code <a href="http://preshing.com/wp-content/uploads/2012/03/RecursiveBenaphoreTest.zip">here</a>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/03/stress-test.png" alt="" title="" width="527" height="270" class="aligncenter size-full wp-image-2833" /></p>
<p>And for good measure, here&#8217;s a <code>TryLock</code> method.</p>
<div class="cpp"><pre class="de1">    <span class="kw4">bool</span> TryLock<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        DWORD tid <span class="sy1">=</span> GetCurrentThreadId<span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>m_owner <span class="sy1">==</span> tid<span class="br0">&#41;</span>
        <span class="br0">&#123;</span>
            <span class="co1">// Already inside the lock</span>
            _InterlockedIncrement<span class="br0">&#40;</span><span class="sy3">&amp;</span>m_counter<span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="br0">&#125;</span>
        <span class="kw1">else</span>
        <span class="br0">&#123;</span>
            LONG result <span class="sy1">=</span> _InterlockedCompareExchange<span class="br0">&#40;</span><span class="sy3">&amp;</span>m_counter, <span class="nu0">1</span>, <span class="nu0">0</span><span class="br0">&#41;</span><span class="sy4">;</span>
            <span class="kw1">if</span> <span class="br0">&#40;</span>result <span class="sy3">!</span><span class="sy1">=</span> <span class="nu0">0</span><span class="br0">&#41;</span>
                <span class="kw1">return</span> <span class="kw2">false</span><span class="sy4">;</span>
            <span class="co1">//--- We are now inside the Lock ---</span>
            m_owner <span class="sy1">=</span> tid<span class="sy4">;</span>
        <span class="br0">&#125;</span>
        m_recursion<span class="sy2">++</span><span class="sy4">;</span>
        <span class="kw1">return</span> <span class="kw2">true</span><span class="sy4">;</span>
    <span class="br0">&#125;</span></pre></div>
<p>For those interested in the fine details, here are a few in particular:</p>
<ul>
<li>
In <code>RecursiveBenaphore::Unlock</code>, it&#8217;s important to set <code>m_owner</code> back to 0 before calling <code>_InterlockedDecrement</code>. Otherwise, data corruption is possible. For example, suppose there are two threads with TIDs 123 and 456. Thread 123 has just completed a call to <code>Unlock</code>, leaving <code>m_owner</code> set to 123. The following could happen next:</p>
<ol>
<li>Both threads simultaneously enter <code>RecursiveBenaphore::Lock</code>.</li>
<li>Thread 456 performs <code>_InterlockedIncrement</code>, gets 1 as the result, and therefore skips the <code>WaitForSingleObject</code>.</li>
<li>Thread 123 performs <code>_InterlockedIncrement</code> and gets 2 as the result.</li>
<li>Thread 123 checks and sees that <code>id == m_owner</code>, because thread 456 hasn&#8217;t changed it yet. Therefore, it also skips over <code>WaitForSingleObject</code>.</li>
</ol>
<p>Shortly thereafter, both threads will return from <code>Lock</code>, each believing it owns the lock. The data protected by the lock will likely become corrupted. Indeed, if you download the test application and delete this part of <code>RecursiveBenaphore::Unlock</code>, it will fail pretty quickly.<br />&nbsp;
</li>
<li>
Also in <code>RecursiveBenaphore::Unlock</code>, the value of <code>m_recursion</code> is copied to a local variable exactly once, and used locally from that point on. We would not, for example, want to re-read the value of <code>m_recursion</code> again after <code>_InterlockedDecrement</code>. At that point, another thread could have changed it.<br />&nbsp;
</li>
<li>
You may notice that <code>m_recursion</code> is modified without using any atomic operations. That&#8217;s because between the call to <code>_InterlockedIncrement</code> in <code>Lock</code> and <code>_InterlockedDecrement</code> in <code>Unlock</code>, the thread owning the lock has exclusive access to both <code>m_owner</code> and <code>m_recursion</code>, with all the necessary acquire and release semantics. Using atomics on <code>m_recursion</code> would be unnecessary and wasteful.
</li>
</ul>
<p>How is the last point guaranteed? It&#8217;s guaranteed by the semaphore in the slow case, and the atomic instructions in the uncontended case. On x86 and x64, the <code>_InterlockedIncrement</code> call generates a <code>lock xadd</code> instruction, which acts as a full memory barrier, guaranteeing both acquire and release semantics. This property is unique to x86/64. If you port this code to a dual-core iOS device, like the iPad 2, it wouldn&#8217;t be enough to call <code><a href="https://developer.apple.com/library/mac/#documentation/DriversKernelHardware/Reference/libkern_ref/OSAtomic_h/index.html#//apple_ref/c/func/OSAtomic.h/OSAtomicIncrement32">OSAtomicIncrement32</a></code> in place of <code>_InterlockedIncrement</code>. You&#8217;d have to call <code><a href="https://developer.apple.com/library/mac/#documentation/DriversKernelHardware/Reference/libkern_ref/OSAtomic_h/index.html#//apple_ref/c/func/OSAtomic.h/OSAtomicIncrement32Barrier">OSAtomicIncrement32Barrier</a></code> to have similar guarantees. Even on Xbox 360, which shares the Win32 API but runs on PowerPC, the correct function to call is actually <code><a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms683618(v=vs.85).aspx">InterlockedIncrementAcquire</a></code>.</p>
<p>I may have lost a few readers by this paragraph. Hopefully, it&#8217;s an <em>exciting</em> kind of lost. I&#8217;ll talk more about memory access semantics in the next post.</p>
<p>For those who haven&#8217;t delved into writing synchronization primitives, perhaps the <code>RecursiveBenaphore</code> has offered a glimpse into how delicate such code can be. Every small detail is there for a reason, ordering is critical and hidden guarantees are at play.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20120305/implementing-a-recursive-mutex/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Roll Your Own Lightweight Mutex</title>
		<link>http://preshing.com/20120226/roll-your-own-lightweight-mutex</link>
		<comments>http://preshing.com/20120226/roll-your-own-lightweight-mutex#comments</comments>
		<pubDate>Sun, 26 Feb 2012 16:56:41 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2749</guid>
		<description><![CDATA[In an earlier post, I pointed out the importance of using a lightweight mutex. I also mentioned it was possible to write your own, provided you can live with certain limitations. Why would you do such a thing? Well, in &#8230; <a href="http://preshing.com/20120226/roll-your-own-lightweight-mutex">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In an earlier post, I pointed out the <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">importance of using a lightweight mutex</a>. I also mentioned it was possible to write your own, provided you can live with certain limitations.</p>
<p>Why would you do such a thing? Well, in the past, some platforms (like BeOS) didn&#8217;t provide a lightweight mutex in the native API. Today, that&#8217;s not really a concern. I&#8217;m mainly showing this because it&#8217;s an interesting look at implementing synchronization primitives in general. As a bonus, it just so happens this implementation shaves almost 50% off the overhead of the Windows Critical Section in the uncontended case.</p>
<p>For the record, there are numerous ways to write your own mutex &#8212; or lock &#8212; entirely in user space, each with its own tradeoffs:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Spinlock">Spin locks</a>. These employ a busy-wait strategy which has the potential to waste CPU time, and in the worst case, can lead to livelock when competing threads run on the same core. Still, some programmers have found measurable speed improvements switching to spin locks in certain cases.</li>
<li><a href="http://en.wikipedia.org/wiki/Peterson%27s_algorithm">Peterson&#8217;s algorithm</a> is like a spinlock for two threads. A neat trick, but seems useless on today&#8217;s platforms. I find it noteworthy because Bartosz Milewski used this algorithm as a case study while <a href="http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/">discussing the finer points of the x86 memory model</a>.</li>
<li>Charles Bloom has a <a href="http://cbloomrants.blogspot.com/2011/07/07-15-11-review-of-many-mutex.html">long writeup describing various mutex implementations</a>. Excellent information, but possibly greek to anyone unfamiliar with C++11&#8242;s atomics library and <a href="http://www.1024cores.net/home/relacy-race-detector">Relacy</a>&#8216;s ($) notation.</li>
</ul>
<p><span id="more-2749"></span>Some of those implementations are pretty advanced. Here&#8217;s a relatively simple technique, using a semaphore and some atomic operations. I came up with it while writing my <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">post about lock contention</a>, but soon afterwards learned it was already in use as far back as 1996, when some engineers referred to it as the <a href="http://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html#Engineering1-26">Benaphore</a>. Here&#8217;s a C++ implementation for Win32:</p>
<div class="cpp"><pre class="de1"><span class="co2">#include &lt;windows.h&gt;</span>
<span class="co2">#include &lt;intrin.h&gt;</span>
&nbsp;
<span class="kw2">class</span> Benaphore
<span class="br0">&#123;</span>
<span class="kw2">private</span><span class="sy4">:</span>
    LONG m_counter<span class="sy4">;</span>
    HANDLE m_semaphore<span class="sy4">;</span>
&nbsp;
<span class="kw2">public</span><span class="sy4">:</span>
    Benaphore<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        m_counter <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
        m_semaphore <span class="sy1">=</span> CreateSemaphore<span class="br0">&#40;</span><span class="kw2">NULL</span>, <span class="nu0">0</span>, <span class="nu0">1</span>, <span class="kw2">NULL</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    ~Benaphore<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        CloseHandle<span class="br0">&#40;</span>m_semaphore<span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="kw4">void</span> Lock<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>_InterlockedIncrement<span class="br0">&#40;</span><span class="sy3">&amp;</span>m_counter<span class="br0">&#41;</span> <span class="sy1">&gt;</span> <span class="nu0">1</span><span class="br0">&#41;</span>
        <span class="br0">&#123;</span>
            WaitForSingleObject<span class="br0">&#40;</span>m_semaphore, INFINITE<span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="br0">&#125;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="kw4">void</span> Unlock<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>_InterlockedDecrement<span class="br0">&#40;</span><span class="sy3">&amp;</span>m_counter<span class="br0">&#41;</span> <span class="sy1">&gt;</span> <span class="nu0">0</span><span class="br0">&#41;</span>
        <span class="br0">&#123;</span>
            ReleaseSemaphore<span class="br0">&#40;</span>m_semaphore, <span class="nu0">1</span>, <span class="kw2">NULL</span><span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="br0">&#125;</span>
    <span class="br0">&#125;</span>
<span class="br0">&#125;</span><span class="sy4">;</span></pre></div>
<p>This implementation also serves as a convenient introduction to atomics, which are at the heart of many lock-free algorithms.</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/atomic-turnstile.png" alt="" title="" width="161" height="135" class="alignright size-full wp-image-2761" /><code><a href="http://msdn.microsoft.com/en-us/library/2ddez55b(v=vs.71).aspx">_InterlockedIncrement</a></code> is an <strong>atomic operation</strong> on Win32. If multiple threads attempt an atomic operation at the same time, on the same piece of data, they will all line up in a row and execute one-at-a-time. This makes it possible to reason about what happens, and ensure correctness. It even works on multicore and multiprocessor systems. (For more information about atomics, check out <a href="http://jfdube.wordpress.com/2011/11/30/understanding-atomic-operations/">this post</a> by JF Dub&eacute;.)</p>
<p>Every modern processor supports atomic operations, though the APIs may differ in the meaning of the return values. On Win32, <code>_InterlockedIncrement</code> adds 1 to the specified integer and returns the <em>new</em> value. Since <code>m_counter</code> is initialized to 0, the first thread to call <code>Lock</code> will receive a return value of 1 from <code>_InterlockedIncrement</code>. As such, it skips over the <code>WaitForSingleObject</code> call and returns immediately. The lock now belongs to this thread, and life is peachy.</p>
<p>If another thread calls <code>Lock</code> while the first thread is still holding it, it will receive a return value of 2 from <code>_InterlockedIncrement</code>. This is a clue that the lock is already busy. At this point, it&#8217;s not safe to continue, so we bite the bullet and jump into one of those expensive kernel calls: <code><a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx">WaitForSingleObject</a></code>. This performs a decrement on the semaphore. We specified an initial count of 0 in <code>CreateSemaphore</code>, so the thread is now forced to wait until someone else comes along and increments this semaphore before it can proceed.</p>
<p>Next, suppose the first thread calls <code>Unlock</code>. The return value of <code>_InterlockedDecrement</code> will be 1. This is a clue that another thread is waiting for the lock, and that we should increment the semaphore using <code><a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms685071(v=vs.85).aspx">ReleaseSemaphore</a></code>. The second thread is then able to continue, and it effectively obtains ownership of the lock.</p>
<p>Even if the timing is very tight, and the first thread calls <code>ReleaseSemaphore</code> <em>before</em> the second calls <code>WaitForSingleObject</code>, everything will function normally. And if you add a third, fourth or any number of other threads into the picture, that&#8217;s fine too. For good measure, you can even add a <code>TryLock</code> function to the implementation:</p>
<div class="cpp"><pre class="de1">    <span class="kw4">bool</span> TryLock<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        LONG result <span class="sy1">=</span> _InterlockedCompareExchange<span class="br0">&#40;</span><span class="sy3">&amp;</span>m_counter, <span class="nu0">1</span>, <span class="nu0">0</span><span class="br0">&#41;</span><span class="sy4">;</span>
        <span class="kw1">return</span> <span class="br0">&#40;</span>result <span class="sy3">!</span><span class="sy1">=</span> <span class="nu0">0</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span></pre></div>
<h2>Performance and Caveats</h2>
<p>You might have noticed the underscore in front of <code>_InterlockedIncrement</code>. This is the <a href="http://en.wikipedia.org/wiki/Intrinsic_function">compiler intrinsic</a> version of <code><a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms683614(v=vs.85).aspx">InterlockedIncrement</a></code>. It outputs a <code>lock xadd</code> instruction in place. And since <code>Lock</code> is defined right inside the class definition of <code>Benaphore</code>, the compiler treats it as an inline function. A call <code>Benaphore::Lock</code> compiles down to 10 instructions using default Release settings, and there are no function calls in the uncontended case. Here&#8217;s the disassembly:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/benaphore-disasm.png" alt="" title="" width="576" height="179" class="aligncenter size-full wp-image-2754" /></p>
<p>In the uncontended case, the Benaphore even outperforms a Critical Section on Win32. I timed a pair of uncontended lock/unlock operations, just as I <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">previously did for the Mutex and Critical Section</a>, and found the following timings:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/benaphore-timing.png" alt="" title="" width="508" height="107" class="aligncenter size-full wp-image-2769" /></p>
<p>If you have an application which is hitting a lock millions of times per second on Win32, this Benaphore implementation just might boost your overall performance by a couple of percent. If you forego the intrinsics and just use regular, non-intrinsic version of the atomics, an indirect call into kernel32.dll is involved, so the Benaphore loses some of its performance edge: <strong>49.8 ns</strong> on my Core 2 Duo.</p>
<p>Furthermore, with some (fairly heavy) code modifications, you could even share this Benaphore between processes &#8212; something Critical Section isn&#8217;t capable of. You&#8217;d have to put <code>m_counter</code> in <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/aa366551(v=vs.85).aspx">shared memory</a>, and use a named semaphore.</p>
<p>There are some limitations to be aware of. First, this implementation is non-recursive. If the same thread attempts to lock the same Benaphore twice, it will deadlock. In my next post, I&#8217;ll <a href="http://preshing.com/20120305/implementing-a-recursive-mutex">extend the implementation to allow recursion</a>.</p>
<p>Finally, an even more subtle caveat: If you port this code directly to other platforms, such as MacOS X and Linux, then depending how you use it, your code may become susceptible to <a href="http://en.wikipedia.org/wiki/Priority_inversion">priority inversion</a>. MacOS X and Linux avoid priority inversion by performing <a href="http://en.wikipedia.org/wiki/Priority_inheritance">priority inheritance</a> when you take a POSIX lock. If you use a Benaphore, you&#8217;ll bypass this OS mechanism and it won&#8217;t be able to help you. Things are different on Windows, though: Windows fights priority inversion by <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms684831(v=vs.85).aspx">randomly boosting the priority of starving threads</a>, which behaves the same regardless of your choice of synchronization primitive.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20120226/roll-your-own-lightweight-mutex/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>A Look Back at Single-Threaded CPU Performance</title>
		<link>http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance</link>
		<comments>http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance#comments</comments>
		<pubDate>Wed, 08 Feb 2012 11:28:47 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2558</guid>
		<description><![CDATA[Throughout the 80&#8242;s and 90&#8242;s, CPUs were able to run virtually any kind of software twice as fast every 18-20 months. The rate of change was incredible. Your 486SX-16 was almost obsolete by the time you got it through the &#8230; <a href="http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Throughout the 80&#8242;s and 90&#8242;s, CPUs were able to run virtually any kind of software twice as fast every 18-20 months. The rate of change was incredible. Your <a href="http://www.x86-guide.com/en/cpu/Intel-486SX-16-PGA-cpu-no24.html">486SX-16</a> was almost obsolete by the time you got it through the door. But eventually, at some point in the mid-2000&#8242;s, progress slowed down considerably for single-threaded software &#8212; which was most software.</p>
<p>Perhaps the turning point came in May 2004, when Intel <a href="http://www.eetimes.com/electronics-news/4048847/Intel-cancels-Tejas-moves-to-dual-core-designs">canceled its latest single-core development effort</a> to focus on multicore designs. Later that year, Herb Sutter wrote his now-famous article, <a href="http://www.gotw.ca/publications/concurrency-ddj.htm">The Free Lunch Is Over</a>. Not all software will run remarkably faster year-over-year anymore, he warned us. Concurrent software would continue its meteoric rise, but single-threaded software was about to get left in the dust.</p>
<p>So, what&#8217;s happened since 2004? Clearly, multicore computing has become mainstream. Everybody acknowledges that single-threaded CPU performance no longer increases as quickly as it previously did &#8212; but at what rate is it <em>actually</em> increasing?</p>
<p>It’s tough to find an answer. Bill Dally of nVidia threw out a few numbers in a recent <a href="http://mediasite.colostate.edu/Mediasite/SilverlightPlayer/Default.aspx?peid=22c9d4e9c8cf474a8f887157581c458a1d#">presentation</a>: He had predicted 19% per year, but says it&#8217;s turned out closer to 5%. Last year, Chuck Moore of AMD <a href="http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf">presented</a> this graph, suggesting that single-threaded CPU performance recently started going backwards:</p>
<p><a href="http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf"><img src="http://preshing.com/wp-content/uploads/2012/01/dally-slide.png" alt="" title="" width="361" height="224" class="aligncenter size-full wp-image-2559" /></a></p>
<p><span id="more-2558"></span>These figures aren&#8217;t really consistent, and both struck me as a little low. Moreover, I couldn&#8217;t find another source to corroborate them. So I decided to crunch the numbers myself. I turned to <a href="http://www.spec.org/">SPEC</a>, an industry-standard benchmark that&#8217;s been going strong since 1989. It&#8217;s the same benchmark used to plot a few data points on the above graph.</p>
<p>SPEC licenses their benchmarking software to various companies, collects results back from those licensees, and makes those results available on their website. One of their benchmark series, <a href="http://en.wikipedia.org/wiki/SPECint">SPECint</a>, was designed to measure the single-threaded integer performance of a machine. That sounds perfect, except for one catch: many licensees use <a href="http://en.wikipedia.org/wiki/Automatic_parallelization">automatic parallelization</a>. I took some pains to remove those results from the dataset. I&#8217;ll share the method at the end of this post, and you can let me know if you think it&#8217;s valid.</p>
<p>I fetched SPEC&#8217;s data on Feb. 7, grouped the results by CPU brand, and generated the following graph. It consists of 5052 test results from 715 different CPU models, all gathered over the last 17 years:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/integer-perf.png" alt="" title="" width="556" height="454" class="aligncenter size-full wp-image-2624" /></p>
<p>Each test result is plotted according its hardware availability date, and the vertical axis uses a <a href="http://en.wikipedia.org/wiki/Logarithmic_scale">logarithmic scale</a>. The graph incorporates results from three different benchmark suites (CPU95, CPU2000 and CPU2006), but I&#8217;ve <a href="http://www.spec.org/fairuse.html#NormalizedHistoricalComparisons">normalized the results</a> in order to see historic trends.</p>
<p>The red line is meant to represent <strong>mainstream</strong> CPU performance. I drew it manually, using the less-than-scientific method of eyeballing the points for Pentium, PowerPC, Athlon and Core. If you&#8217;re willing to trust this line, it seems that in the eight years since January 2004, mainstream performance has increased by a factor of about <strong>4.6x</strong>, which works out to 21% per year. Compare that to the <strong>28x</strong> increase between 1996 and 2004! Things have really slowed down.</p>
<p>Here are a few machines located along the red line in the graph:</p>
<table class="grid">
<tr>
<th>Hardware<br />Availability</th>
<th>Adjusted<br />Result</th>
<th>CPU Model</th>
<th>Clock<br />Rate</th>
<th>CPU Cache</th>
<tr>
<td>Feb 2004</td>
<td>8.1</td>
<td><a href="http://www.spec.org/cpu2000/results/res2004q1/cpu2000-20040126-02769.html">Intel Pentium 4</a></td>
<td>3200 MHz</td>
<td>28KB L1, 1MB L2</td>
</tr>
<tr>
<td>Jun 2005</td>
<td>10.5</td>
<td><a href="http://www.spec.org/cpu2000/results/res2005q2/cpu2000-20050613-04262.html">AMD Athlon 64 FX-57</a></td>
<td>2800 MHz</td>
<td>128KB L1, 1MB L2</td>
</tr>
<tr>
<td>Jul 2006</td>
<td>11.4</td>
<td><a href="http://www.spec.org/cpu2000/results/res2006q3/cpu2000-20060904-07202.html">Intel Core 2 Duo E6300</a></td>
<td>1867 MHz</td>
<td>64KB L1, 2MB L2</td>
</tr>
<tr>
<td>Jul 2007</td>
<td>13.3</td>
<td><a href="http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070806-01732.html">Intel Core 2 Duo T7700</a></td>
<td>2400 MHz</td>
<td>64KB L1, 4MB L2</td>
</tr>
<tr>
<td>Sep 2008</td>
<td>17.9</td>
<td><a href="http://www.spec.org/cpu2006/results/res2008q3/cpu2006-20080902-05222.html">Intel Core 2 Duo T9600</a></td>
<td>2800 MHz</td>
<td>64KB L1, 6MB L2</td>
</tr>
<tr>
<td>May 2009</td>
<td>21.8</td>
<td><a href="http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090608-07726.html">Intel Core 2 Duo E7600</a></td>
<td>3066 MHz</td>
<td>64KB L1, 3MB L2</td>
</tr>
<tr>
<td>Jul 2010</td>
<td>24.3</td>
<td><a href="http://www.spec.org/cpu2006/results/res2010q3/cpu2006-20100812-12853.html">Intel Core i3-540</a></td>
<td>3067 MHz</td>
<td>64KB L1, 256KB L2, 4MB L3</td>
</tr>
<tr>
<td>Jun 2011</td>
<td>31.7</td>
<td><a href="http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111010-18687.html">Intel Pentium G850</a></td>
<td>2900 MHz</td>
<td>64KB L1, 256KB L2, 3MB L3</td>
</tr>
</table>
<p>As you can see, Intel deserves credit for squeezing out the most single-threaded performance since 2004. If you remove all Intel CPUs from the data, a different picture emerges:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/no-intel.png" alt="" title="" width="280" height="190" class="aligncenter size-full wp-image-2636" /></p>
<p>This is not too surprising, as AMD is <a href="http://blogs.amd.com/play/2011/10/13/our-take-on-amd-fx/">pretty open</a> about their stance on single-threaded performance. <a href="http://en.wikipedia.org/wiki/Bulldozer_(processor)">Bulldozer</a>, their latest microarchitecture, is meant to shine in multithreaded workloads.</p>
<p>So far we&#8217;ve only looked at integer performance. SPEC also publishes <a href="http://en.wikipedia.org/wiki/SPECfp">SPECfp</a>, an equivalent benchmark for floating-point performance. Floating-point performance has always been important for heavy-duty computation such as scientific simulation or 3D rendering. Here are the results, which I&#8217;ve also adjusted to eliminate autoparallelization:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/float-point-perf.png" alt="" title="" width="556" height="454" class="aligncenter size-full wp-image-2623" /></p>
<p>Prior to 2004, it climbed even faster than integer performance, at 64% per year: a doubling period of 73 weeks. After that, it leveled off at the same 21% per year.</p>
<p>Up until 2002, we see a huge difference in floating-point performance between mainstream and workstation CPUs. The Alpha, SPARC and MIPS all ran up to 8x faster. Of course, you had to pay $10000 or more to get your hands on such a workstation. This is an interesting reminder that CPUs are, in fact, things created by businesses to make money! They don&#8217;t become faster entirely by technological forces. They become faster by economic forces.</p>
<p>Which brings us back to the present day. For reasons which others understand better than me, involving <a href="http://en.wikipedia.org/wiki/Thermal_design_power">thermal design power</a> and <a href="http://en.wikipedia.org/wiki/Instruction_level_parallelism">ILP</a>, it&#8217;s now more cost-effective for manufacturers to pack additional cores onto a die than to push the single-threaded performance envelope much further.</p>
<p>Given the significance of this shift away from single-threaded performance, I was surprised to not find more information about the actual trajectory of performance since 2004. At the same time, I can&#8217;t guarantee that the data I&#8217;ve presented perfectly reflects single-threaded CPU performance. I think my conclusions are fair, but any feedback or criticism about the approach is more than welcome.</p>
<h2>How These Graphs Were Generated</h2>
<p>All Python scripts are <a href="https://github.com/preshing/analyze-spec-benchmarks">available on github</a>. These scripts will download, analyze and adjust SPEC&#8217;s data, and render the graphs. If you&#8217;d like to run them yourself, see the README file for exact instructions.</p>
<p>As already mentioned, recent compilers like <a href="http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/">Intel C++</a> and <a href="http://www-01.ibm.com/software/awdtools/xlcpp/aix/features/?S_CMP=rnav">IBM XL</a> feature <a href="http://en.wikipedia.org/wiki/Automatic_parallelization">automatic parallelization</a>, and it greatly skews the results towards certain benchmarks. For example, check out the performance of <code>462.libquantum</code> in <a href="http://www.spec.org/cpu2006/results/res2012q1/cpu2006-20111219-19210.html">this</a> result! SPEC permits the use of autoparallelization as long as it&#8217;s clearly indicated. Unfortunately, this compiler feature is so widely enabled, I couldn&#8217;t simply exclude all such results. If I had done so, I would be left with zero results for Intel&#8217;s Core i3, i5 and i7 processor families.</p>
<p>The compromise I chose was to identify the top six benchmarks which seem to benefit from automatic parallelization, disqualify those benchmarks from the test suite, and take the geometric mean of the remaining ones. This approach assumes that automatic parallelization does not work on every benchmark. For the list of disqualified benchmarks, and the algorithm which identifies them, check the github files.</p>
<p>In the end, you&#8217;ll find that even if you leave the disqualified benchmarks in the results, it doesn&#8217;t significantly change the conclusions in this post. It shifts most of the CPU2006 results upwards &#8212; up to 25% &#8212; which simultaneously shifts the conversion ratios from CPU95 and CPU2000 upwards, keeping everything roughly in line.</p>
<p>In the future, it would be interesting for licensees to submit more results without automatic parallelization. That would help us more easily observe the performance trend of single CPU cores.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A C++ Profiling Module for Multithreaded APIs</title>
		<link>http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis</link>
		<comments>http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis#comments</comments>
		<pubDate>Sat, 03 Dec 2011 23:14:59 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2378</guid>
		<description><![CDATA[In my post about lock contention, I gave some statistics for the memory allocator in a multithreaded game engine: 15000 calls per second coming from 3 threads, taking around 2% CPU. To collect those statistics, I wrote a small profiling &#8230; <a href="http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In my post about <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">lock contention</a>, I gave some statistics for the memory allocator in a multithreaded game engine: 15000 calls per second coming from 3 threads, taking around 2% CPU. To collect those statistics, I wrote a small profiling module, which I&#8217;ll share here.</p>
<p>A profiling module is different from conventional profilers like <a href="http://blogs.msdn.com/b/pigscanfly/archive/2008/03/02/using-the-windows-sample-profiler-with-xperf.aspx">xperf</a> or <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">VTune</a> in that no third-party tools are required. You just drop the module into any C++ application, and the process collects and reports performance data by itself.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/12/api-profiler.png" alt="" title="" width="157" height="127" class="alignright size-full wp-image-2545" />This particular profiling module is meant to act on one or more <em>target modules</em> in the application. A target module can be anything which exposes a well-defined <a href="http://en.wikipedia.org/wiki/Application_programming_interface">API</a>, such as a memory allocator. To make it work, you must insert a macro named <code>API_PROFILER</code> into every public function exposed by that API. Below, I&#8217;ve added it to <code>dlmalloc</code>, one of the functions in the <a href="http://g.oswego.edu/dl/html/malloc.html">Doug Lea Malloc</a> API. The same macro should be added to <code>dlrealloc</code>, <code>dlfree</code>, and other public functions as well.</p>
<pre>
DEFINE_API_PROFILER(dlmalloc);

void* dlmalloc(size_t bytes)
{
    <span class="highlight">API_PROFILER(dlmalloc);</span>

#if USE_LOCKS
    ensure_initialization();
#endif

    if (!PREACTION(gm))
    {
        void* mem;
        size_t nb;
        if (bytes <= MAX_SMALL_REQUEST)
        {
            ...
</pre>
<p><span id="more-2378"></span>The macro takes a single argument, which is just an identifier for the target module being profiled. For this to be a valid identifier, you must place exactly one <code>DEFINE_API_PROFILER</code> macro at global scope, as seen above. You can also insert <code>DECLARE_API_PROFILER</code> anywhere at global scope, perhaps in a header file, in the same way that you'd forward declare a global variable or function.</p>
<p>When the application runs, each thread will automatically log performance statistics once per second, including the thread identifier (TID), time spent inside the target module, and the number of calls. Here, we see performance statistics across six different threads:</p>
<pre>
TID 0x13bc time spent in "dlmalloc": 7/1001 ms 0.7% 6481x
TID 0x1244 time spent in "dlmalloc": 6/1000 ms 0.6% 6166x
TID 0x198 time spent in "dlmalloc": 0/3072 ms 0.0% 2x
TID 0x11d0 time spent in "dlmalloc": 0/1113 ms 0.0% 6x
TID 0x12a4 time spent in "dlmalloc": 0/1000 ms 0.0% 20x
TID 0xc14 time spent in "dlmalloc": 4/1011 ms 0.4% 3243x
</pre>
<p>To identify each thread, simply break in the debugger and look for the TID in the Threads view.</p>
<p>Most of the profiling module is implemented in a single header file, as follows. For simplicity, I've only provided the Windows version, but you could easily port the code to other platforms.</p>
<div class="cpp"><pre class="de1"><span class="co2">#define ENABLE_API_PROFILER 1     // Comment this line to disable the profiler</span>
&nbsp;
<span class="co2">#if ENABLE_API_PROFILER</span>
&nbsp;
<span class="co1">//------------------------------------------------------------------</span>
<span class="co1">// A class for local variables created on the stack by the API_PROFILER macro:</span>
<span class="co1">//------------------------------------------------------------------</span>
<span class="kw2">class</span> APIProfiler
<span class="br0">&#123;</span>
<span class="kw2">public</span><span class="sy4">:</span>
    <span class="co1">//------------------------------------------------------------------</span>
    <span class="co1">// A structure for each thread to store information about an API:</span>
    <span class="co1">//------------------------------------------------------------------</span>
    <span class="kw4">struct</span> ThreadInfo
    <span class="br0">&#123;</span>
        INT64 lastReportTime<span class="sy4">;</span>
        INT64 accumulator<span class="sy4">;</span>   <span class="co1">// total time spent in target module since the last report</span>
        INT64 hitCount<span class="sy4">;</span>      <span class="co1">// number of times the target module was called since last report</span>
        <span class="kw4">const</span> <span class="kw4">char</span> <span class="sy2">*</span>name<span class="sy4">;</span>    <span class="co1">// the name of the target module</span>
    <span class="br0">&#125;</span><span class="sy4">;</span>
&nbsp;
<span class="kw2">private</span><span class="sy4">:</span>
    INT64 m_start<span class="sy4">;</span>
    ThreadInfo <span class="sy2">*</span>m_threadInfo<span class="sy4">;</span>
&nbsp;
    <span class="kw4">static</span> <span class="kw4">float</span> s_ooFrequency<span class="sy4">;</span>      <span class="co1">// 1.0 divided by QueryPerformanceFrequency()</span>
    <span class="kw4">static</span> INT64 s_reportInterval<span class="sy4">;</span>   <span class="co1">// length of time between reports</span>
    <span class="kw4">void</span> Flush<span class="br0">&#40;</span>INT64 end<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
<span class="kw2">public</span><span class="sy4">:</span>
    __forceinline APIProfiler<span class="br0">&#40;</span>ThreadInfo <span class="sy2">*</span>threadInfo<span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        LARGE_INTEGER start<span class="sy4">;</span>
        QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>start<span class="br0">&#41;</span><span class="sy4">;</span>
        m_start <span class="sy1">=</span> start.<span class="me1">QuadPart</span><span class="sy4">;</span>
        m_threadInfo <span class="sy1">=</span> threadInfo<span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    __forceinline ~APIProfiler<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        LARGE_INTEGER end<span class="sy4">;</span>
        QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>end<span class="br0">&#41;</span><span class="sy4">;</span>
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>accumulator <span class="sy2">+</span><span class="sy1">=</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> m_start<span class="br0">&#41;</span><span class="sy4">;</span>
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>hitCount<span class="sy2">++</span><span class="sy4">;</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">&gt;</span> s_reportInterval<span class="br0">&#41;</span>
            Flush<span class="br0">&#40;</span>end.<span class="me1">QuadPart</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
<span class="br0">&#125;</span><span class="sy4">;</span>
&nbsp;
<span class="co1">//----------------------</span>
<span class="co1">// Profiler is enabled</span>
<span class="co1">//----------------------</span>
<span class="co2">#define DECLARE_API_PROFILER(name) \
    extern __declspec(thread) APIProfiler::ThreadInfo __APIProfiler_##name;</span>
&nbsp;
<span class="co2">#define DEFINE_API_PROFILER(name) \
    __declspec(thread) APIProfiler::ThreadInfo __APIProfiler_##name = { 0, 0, 0, #name };</span>
&nbsp;
<span class="co2">#define TOKENPASTE2(x, y) x ## y</span>
<span class="co2">#define TOKENPASTE(x, y) TOKENPASTE2(x, y)</span>
<span class="co2">#define API_PROFILER(name) \
    APIProfiler TOKENPASTE(__APIProfiler_##name, __LINE__)(&amp;__APIProfiler_##name)</span>
&nbsp;
<span class="co2">#else</span>
&nbsp;
<span class="co1">//----------------------</span>
<span class="co1">// Profiler is disabled</span>
<span class="co1">//----------------------</span>
<span class="co2">#define DECLARE_API_PROFILER(name)</span>
<span class="co2">#define DEFINE_API_PROFILER(name)</span>
<span class="co2">#define API_PROFILER(name)</span>
&nbsp;
<span class="co2">#endif</span></pre></div>
<p>The <code>DEFINE_API_PROFILER</code> macro defines a thread-local variable using the <code><a href="http://msdn.microsoft.com/en-us/library/9w1sdazb%28v=vs.80%29.aspx">__declspec(thread)</a></code> modifier. This gives each thread its own private data, independent of other threads, so the whole system works in a multithreaded environment with little performance penalty. In GCC, the equivalent storage class modifier would be <code><a href="http://gcc.gnu.org/onlinedocs/gcc-3.3.1/gcc/Thread-Local.html">__thread</a></code>. The overhead for such storage is low, but on Windows, there's one catch: <a href="http://msdn.microsoft.com/en-us/library/2s9wt68x.aspx">you can't use it across DLLs</a>.</p>
<p>The <code>API_PROFILER</code> macro creates a C++ object on the stack, taking advantage of the constructor to signal the beginning and the destructor to signal the end of the section being measured. The macro uses a <a href="http://stackoverflow.com/a/1597129">token-pasting trick</a>, using the current line number, to create unique local variable names.</p>
<p>It's important not to call this macro recursively. In other words, don't insert <code>API_PROFILER</code> anywhere that might be called within the scope of another <code>API_PROFILER</code> marker, using the same identifier. If you do, you'll end up counting the time spent inside the target module twice! If absolutely necessary, you could modify the profiling module to circumvent this limitation, at the cost of a little extra overhead.</p>
<p>The destructor sometimes calls a function named <code>Flush</code>. It's a heavier function, so we define it in a separate <code>.cpp</code> file, and make sure it's only called once per second:</p>
<div class="cpp"><pre class="de1"><span class="co2">#if ENABLE_API_PROFILER</span>
&nbsp;
<span class="kw4">static</span> <span class="kw4">const</span> <span class="kw4">float</span> APIProfiler_ReportIntervalSecs <span class="sy1">=</span> <span class="nu17">1.0f</span><span class="sy4">;</span>
&nbsp;
<span class="kw4">float</span> APIProfiler<span class="sy4">::</span><span class="me2">s_ooFrequency</span> <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
INT64 APIProfiler<span class="sy4">::</span><span class="me2">s_reportInterval</span> <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
&nbsp;
<span class="co1">//------------------------------------------------------------------</span>
<span class="co1">// Flush is called at the rate determined by APIProfiler_ReportIntervalSecs</span>
<span class="co1">//------------------------------------------------------------------</span>
<span class="kw4">void</span> APIProfiler<span class="sy4">::</span><span class="me2">Flush</span><span class="br0">&#40;</span>INT64 end<span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    <span class="co1">// Auto-initialize globals based on timer frequency:</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>s_reportInterval <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        LARGE_INTEGER freq<span class="sy4">;</span>
        QueryPerformanceFrequency<span class="br0">&#40;</span><span class="sy3">&amp;</span>freq<span class="br0">&#41;</span><span class="sy4">;</span>
        s_ooFrequency <span class="sy1">=</span> <span class="nu17">1.0f</span> <span class="sy2">/</span> freq.<span class="me1">QuadPart</span><span class="sy4">;</span>
        MemoryBarrier<span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>
        s_reportInterval <span class="sy1">=</span> <span class="br0">&#40;</span>INT64<span class="br0">&#41;</span> <span class="br0">&#40;</span>freq.<span class="me1">QuadPart</span> <span class="sy2">*</span> APIProfiler_ReportIntervalSecs<span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="co1">// Avoid garbage timing on first call by initializing a new interval:</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">=</span> m_start<span class="sy4">;</span>
        <span class="kw1">return</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="co1">// Enough time has elapsed. Print statistics to console:</span>
    <span class="kw4">float</span> interval <span class="sy1">=</span> <span class="br0">&#40;</span>end <span class="sy2">-</span> m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime<span class="br0">&#41;</span> <span class="sy2">*</span> s_ooFrequency<span class="sy4">;</span>
    <span class="kw4">float</span> measured <span class="sy1">=</span> m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>accumulator <span class="sy2">*</span> s_ooFrequency<span class="sy4">;</span>
    <span class="kw3">printf</span><span class="br0">&#40;</span><span class="st0">&quot;TID 0x%x time spent in <span class="es1">\&quot;</span>%s<span class="es1">\&quot;</span>: %.0f/%.0f ms %.1f%% %dx<span class="es1">\n</span>&quot;</span>,
        GetCurrentThreadId<span class="br0">&#40;</span><span class="br0">&#41;</span>,
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>name,
        measured <span class="sy2">*</span> <span class="nu0">1000</span>,
        interval <span class="sy2">*</span> <span class="nu0">1000</span>,
        <span class="nu0">100</span>.<span class="me1">f</span> <span class="sy2">*</span> measured <span class="sy2">/</span> interval,
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>hitCount<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    <span class="co1">// Reset statistics and begin next timing interval:</span>
    m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">=</span> end<span class="sy4">;</span>
    m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>accumulator <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
    m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>hitCount <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
<span class="br0">&#125;</span>
&nbsp;
<span class="co2">#endif</span></pre></div>
<p>In the above code, <code>printf</code> is used for logging, but you could easily replace it with calls to <code>sprintf</code> and <code>OutputDebugString</code>, or anything else. The nice thing about logging to a console is that it works even when there is no graphical display, such as during the loading screen of a game, or when the application is starting up. Those are moments when you might be particularly interested in profiling a specific API.</p>
<p>Another convenient thing about this profiling module is that no explicit initialization is required. The very first time the macro is hit, it will call <code>Flush</code>. The first thread to enter <code>Flush</code> will see that <code>s_reportInterval</code> is not yet initialized, and will initialize itself. It doesn't matter if two threads end up trying to initialize the globals at the same time; they will both write the same result.</p>
<p>I measured the overhead introduced by the <code>API_PROFILER</code> macro on two processors: <strong>99 ns</strong> on a 1.86 GHz Core 2 Duo, and <strong>30.8 ns</strong> on a 2.66 GHz Xeon. That's just a little slower than an <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">uncontended Windows Critical Section</a>, making this a pretty good technique for fine-grained profiling. You could reduce the overhead further by calling <code><a href="http://msdn.microsoft.com/en-us/library/twchhe95%28v=vs.80%29.aspx">__rdtsc</a></code> instead of <code>QueryPerformanceCounter</code>, but the resulting numbers would be <a href="http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx">less reliable on multicore systems</a>, so I chose not to mess with that.</p>
<p>Built-in profiling modules are nothing new &mdash; Jeff Everett describes another in-game profiler in <a href="http://www.amazon.com/Game-Programming-Gems-CD-Vol/dp/1584500549">Game Programming Gems 2</a>. Hopefully, I've at least presented a few twists on the idea. I'd be interested to hear about any twists of your own. As far as I know, no third-party profiler is capable of profiling a multithreaded API as easily &#038; accurately as the method I've described here &mdash; whether it's <a href="http://valgrind.org/">Valgrind</a>, <a href="http://blogs.msdn.com/b/pigscanfly/archive/2008/03/02/using-the-windows-sample-profiler-with-xperf.aspx">xperf</a>, <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">VTune</a>, <a href="http://developer.apple.com/technologies/tools/">Shark</a>, <a href="http://msdn.microsoft.com/en-us/library/ee417062%28v=VS.85%29.aspx">PIX</a>, <a href="http://www.snsys.com/ps3/prodg.asp#tuner">Tuner</a>, <a href="http://msdn.microsoft.com/en-us/magazine/cc337887.aspx">Visual Studio Profiler</a>, or any other. Readers, correct me if I'm wrong!</p>
<p>Such profilers can, on the other hand, show you when a particular module becomes heavy &mdash; the module's internal functions will appear near the top of <a href="http://en.wikipedia.org/wiki/Profiling_%28computer_programming%29#Statistical_profilers">PC sampling</a> summaries, for example. Sometimes, even <a href="http://preshing.com/20110723/finding-bottlenecks-by-random-breaking">random breaking</a> offers a similar clue. At that point, you might be compelled to use a built-in profiling module like this one, to drill deeper and to measure the impact of subsequent code changes.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Always Use a Lightweight Mutex</title>
		<link>http://preshing.com/20111124/always-use-a-lightweight-mutex</link>
		<comments>http://preshing.com/20111124/always-use-a-lightweight-mutex#comments</comments>
		<pubDate>Thu, 24 Nov 2011 14:34:15 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2248</guid>
		<description><![CDATA[In multithreaded programming, we often speak of locks (also known as mutexes). But a lock is only a concept. To actually use that concept, you need an implementation. As it turns out, there are many ways to implement a lock, &#8230; <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In multithreaded programming, we often speak of <a href="http://en.wikipedia.org/wiki/Lock_(computer_science)">locks</a> (also known as mutexes). But a lock is only a concept. To actually <em>use</em> that concept, you need an implementation. As it turns out, there are many ways to implement a lock, and those implementations vary wildly in performance.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/lightweight-mutex.png" alt="" title="" width="120" height="92" class="alignleft size-full wp-image-2542" />The Windows SDK provides two lock implementations for C/C++: the <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms684266%28v=vs.85%29.aspx">Mutex</a> and the <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms682530%28v=vs.85%29.aspx">Critical Section</a>. (As Ned Batchelder <a href="http://nedbatchelder.com/blog/200304/mutexes_and_critical_sections.html">points out</a>, <em>Critical Section</em> is probably not the best name to give to the lock itself, but we&#8217;ll forgive that here.)</p>
<p>The Windows Critical Section is what we call a <strong>lightweight mutex</strong>. It&#8217;s optimized for the case when there are no other threads competing for the lock. To demonstrate using a simple example, here&#8217;s a single thread which locks and unlocks a Windows Mutex exactly one million times.</p>
<pre>
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
for (int i = 0; i < 1000000; i++)
{
    WaitForSingleObject(mutex, INFINITE);
    ReleaseMutex(mutex);
}
CloseHandle(mutex);
</pre>
<p><span id="more-2248"></span>Here's the same experiment using a Windows Critical Section.</p>
<pre>
CRITICAL_SECTION critSec;
InitializeCriticalSection(&#038;critSec);
for (int i = 0; i < 1000000; i++)
{
    EnterCriticalSection(&#038;critSec);
    LeaveCriticalSection(&#038;critSec);
}
DeleteCriticalSection(&#038;critSec);
</pre>
<p>If you insert some timing code around the inner loop, and divide the result by one million, you'll find the average time required for a pair of lock/unlock operations in both cases. I did that, and ran the experiment on two different processors. The results:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/mutex-vs-critical-section.png" alt="" title="" width="508" height="80" class="aligncenter size-full wp-image-2322" /></p>
<p>The Critical Section is <strong>25 times</strong> faster. As <a href="http://blogs.msdn.com/b/larryosterman/archive/2005/08/24/455741.aspx">Larry Osterman explains</a>, the Windows Mutex enters the kernel every time you use it, while the Critical Section does not. The tradeoff is that you can't share a Critical Section between processes. But who cares? Most of the time, you just want to protect some data within a single process. (It is actually possible to share a lightweight mutex between processes - just not using a Critical Section.)</p>
<p>Now, suppose you have a thread which acquires a Critical Section 100000 times per second, and there are no other threads competing for the lock. Based on the above figures, you can expect to pay between 0.2% and 0.6% in lock overhead. Not too bad! At lower frequencies, the overhead becomes negligible. I'm ignoring the hidden cost of synchronizing the processor's cache, which is something I'll write about in a future post, but it doesn't make a big difference.</p>
<h2>Other Platforms</h2>
<p>In MacOS 10.6.6, a lock implementation is provided using the <a href="http://en.wikipedia.org/wiki/POSIX_Threads">POSIX Threads</a> API. It's a lightweight mutex which doesn't enter the kernel unless there's contention. A pair of uncontended calls to <code>pthread_mutex_lock</code> and <code>pthread_mutex_unlock</code> takes about <strong>92 ns</strong> on my 1.86 GHz Core 2 Duo. Interestingly, it detects when there's only one thread running, and in that case switches to a trivial codepath taking only 38 ns.</p>
<p>MacOS also offers <code><a href="http://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSLock_Class/Reference/Reference.html">NSLock</a></code>, an Objective-C class, but this is really just a wrapper around the aforementioned POSIX mutex. Because each operation must wind its way through <code>objc_msgSend</code>, the overhead is a little higher: <strong>155 ns</strong> on my Core 2 Duo, or 98 ns if there's only a single thread.</p>
<p>Naturally, Ubuntu 11.10 provides a lock implementation using the POSIX Threads API as well. It's another lightweight mutex, based on a Linux-specific construct known as a <a href="http://en.wikipedia.org/wiki/Futex">futex</a>. A pair of <code>pthread_mutex_lock</code>/<code>pthread_mutex_unlock</code> calls takes about <strong>66 ns</strong> on my Core 2 Duo. You can even share this implementation between processes, but I didn't test that.</p>
<p>Even the Playstation 3 SDK offers a choice between a lightweight mutex and a heavy one. Back in 2007, early in the development of a Playstation 3 game I worked on, we were using the heavy mutex. Switching to the lightweight mutex made the game start <strong>17</strong> seconds faster! For me, that's when the difference really hit home.</p>
<p>In my previous post, I <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">argued against the misconception that locks are slow</a> and provided some data to support the argument. At this point, it should be clear that if you aren't using a lightweight mutex, the entire argument goes out the window. I'm fairly sure that the existence of heavy lock implementations has only added to this misconception over the years.</p>
<p>Some of you old-timers may point out ancient platforms where a heavy lock was the only implementation available, or when a <a href="http://en.wikipedia.org/wiki/Semaphore_%28programming%29">semaphore</a> had to be used for the job. But it seems all modern platforms offer a lightweight mutex. And even if they didn't, you could write your own lightweight mutex at the application level, even sharing it between processes, provided you're willing to live with certain caveats. You'll find one example in my followup post, <a href="http://preshing.com/20120226/roll-your-own-lightweight-mutex">Roll Your Own Lightweight Mutex</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111124/always-use-a-lightweight-mutex/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Locks Aren&#8217;t Slow; Lock Contention Is</title>
		<link>http://preshing.com/20111118/locks-arent-slow-lock-contention-is</link>
		<comments>http://preshing.com/20111118/locks-arent-slow-lock-contention-is#comments</comments>
		<pubDate>Fri, 18 Nov 2011 13:46:35 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2159</guid>
		<description><![CDATA[Locks (also known as mutexes) have a history of being misjudged. Back in 1986, in a Usenet discussion on multithreading, Matthew Dillon wrote, &#8220;Most people have the misconception that locks are slow.&#8221; 25 years later, this misconception still seems to &#8230; <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Lock_(computer_science)">Locks</a> (also known as <strong>mutexes</strong>) have a history of being misjudged. Back in 1986, in a Usenet discussion on multithreading, Matthew Dillon <a href="http://groups.google.com/group/net.micro.mac/msg/752d18de371bd65c?dmode=source">wrote</a>, &#8220;Most people have the misconception that locks are slow.&#8221; 25 years later, this misconception still seems to <a href="http://www.cs.washington.edu/education/courses/cse451/03wi/section/prodcons.htm">pop up</a> once in a while.</p>
<p>It&#8217;s true that locking is slow on some platforms, or when the lock is highly contended. And when you&#8217;re developing a multithreaded application, it&#8217;s very common to find a huge performance bottleneck caused by a single lock. But that doesn&#8217;t mean all locks are slow. As I&#8217;ll show in this post, sometimes a locking strategy achieves excellent performance.</p>
<p>Perhaps the most easily-overlooked source of this misconception: Not all programmers may be aware of the difference between a lightweight mutex and a &#8220;kernel mutex&#8221;. I&#8217;ll talk about that in my next post, <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">Always Use a Lightweight Mutex</a>. For now, let&#8217;s just say that if you&#8217;re programming in C/C++ on Windows, the <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms682530.aspx">Critical Section</a> object is the one you want.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/lock-competition-thumbnail.png" alt="" title="" width="154" height="95" class="alignright size-full wp-image-2539" />Other times, the conclusion that locks are slow is supported by a benchmark. For example, <a href="http://ridiculousfish.com/blog/posts/barrier.html">this post</a> measures the performance of a lock under heavy conditions: each thread must hold the lock to do any work (high contention), and the lock is held for an extremely short interval of time (high frequency). It&#8217;s a good read, but in a real application, you generally want to avoid using locks in that way. To put things in context, I&#8217;ve devised a benchmark which includes both best-case and worst-case usage scenarios for locks.</p>
<p><span id="more-2159"></span>Locks may be frowned upon for other reasons. There&#8217;s a whole other family of techniques out there known as lock-free (or <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx">lockless</a>) programming. Lock-free programming is extremely challenging, but delivers huge performance gains in a lot of real-world scenarios. I know programmers who spent days, even weeks fine-tuning a lock-free algorithm, subjecting it to a battery of tests, only to discover hidden timing bugs several months later. The combination of danger and reward can be very enticing to a certain kind of programmer &#8212; and this includes me, as you&#8217;ll see in future posts! With lock-free techniques beckoning us to use them, locks can begin to feel boring, slow and busted.</p>
<p>But don&#8217;t disregard locks yet. One good example of a place where locks perform admirably, in real software, is when protecting the memory allocator. <a href="http://g.oswego.edu/dl/html/malloc.html">Doug Lea&#8217;s Malloc</a> is a popular memory allocator in video game development, but it&#8217;s single threaded, so we need to protect it using a lock. During gameplay, it&#8217;s not uncommon to see multiple threads hammering the memory allocator, say around 15000 times per second. While loading, this figure can climb to 100000 times per second or more. It&#8217;s not a big problem, though. As you&#8217;ll see, locks handle the workload like a champ.</p>
<h2>Lock Contention Benchmark</h2>
<p>In this test, we spawn a thread which generates random numbers, using a custom <a href="http://en.wikipedia.org/wiki/Mersenne_twister">Mersenne Twister</a> implementation. Every once in a while, it acquires and releases a lock. The lengths of time between acquiring and releasing the lock are random, but they tend towards average values which we decide ahead of time. For example, suppose we want to acquire the lock 15000 times per second, and keep it held 50% of the time. Here&#8217;s what part of the timeline would look like. Red means the lock is held, grey means it&#8217;s released:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/single-thread-timeline.png" alt="" title="" width="466" height="16" class="aligncenter size-full wp-image-2241" /></p>
<p>This is essentially a Poisson process. If we know the average amount of time to generate a single random number &#8212; <strong>6.349 ns</strong> on a 2.66 GHz quad-core Xeon &#8212; we can measure time in <em>work units</em>, rather than seconds. We can then use the technique described in my previous post, <a href="http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process">How to Generate Random Timings for a Poisson Process</a>, to decide how many work units to perform between acquiring and releasing the lock. Here&#8217;s the implementation in C++. I&#8217;ve left out a few details, but if you like, you can download the complete source code <a href="http://preshing.com/files/LockCompetition.zip">here</a>.</p>
<div class="cpp"><pre class="de1">QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>start<span class="br0">&#41;</span><span class="sy4">;</span>
<span class="kw1">for</span> <span class="br0">&#40;</span><span class="sy4">;;</span><span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    <span class="co1">// Do some work without holding the lock</span>
    workunits <span class="sy1">=</span> <span class="br0">&#40;</span><span class="kw4">int</span><span class="br0">&#41;</span> <span class="br0">&#40;</span>random.<span class="me1">poissonInterval</span><span class="br0">&#40;</span>averageUnlockedCount<span class="br0">&#41;</span> <span class="sy2">+</span> <span class="nu17">0.5f</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> i <span class="sy1">=</span> <span class="nu0">1</span><span class="sy4">;</span> i <span class="sy1">&lt;</span> workunits<span class="sy4">;</span> i<span class="sy2">++</span><span class="br0">&#41;</span>
        random.<span class="me1">integer</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>       <span class="co1">// Do one work unit</span>
    workDone <span class="sy2">+</span><span class="sy1">=</span> workunits<span class="sy4">;</span>
&nbsp;
    QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>end<span class="br0">&#41;</span><span class="sy4">;</span>
    elapsedTime <span class="sy1">=</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> start.<span class="me1">QuadPart</span><span class="br0">&#41;</span> <span class="sy2">*</span> ooFreq<span class="sy4">;</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>elapsedTime <span class="sy1">&gt;=</span> timeLimit<span class="br0">&#41;</span>
        <span class="kw1">break</span><span class="sy4">;</span>
&nbsp;
    <span class="co1">// Do some work while holding the lock</span>
    EnterCriticalSection<span class="br0">&#40;</span><span class="sy3">&amp;</span>criticalSection<span class="br0">&#41;</span><span class="sy4">;</span>
    workunits <span class="sy1">=</span> <span class="br0">&#40;</span><span class="kw4">int</span><span class="br0">&#41;</span> <span class="br0">&#40;</span>random.<span class="me1">poissonInterval</span><span class="br0">&#40;</span>averageLockedCount<span class="br0">&#41;</span> <span class="sy2">+</span> <span class="nu17">0.5f</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> i <span class="sy1">=</span> <span class="nu0">1</span><span class="sy4">;</span> i <span class="sy1">&lt;</span> workunits<span class="sy4">;</span> i<span class="sy2">++</span><span class="br0">&#41;</span>
        random.<span class="me1">integer</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>       <span class="co1">// Do one work unit</span>
    workDone <span class="sy2">+</span><span class="sy1">=</span> workunits<span class="sy4">;</span>
    LeaveCriticalSection<span class="br0">&#40;</span><span class="sy3">&amp;</span>criticalSection<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>end<span class="br0">&#41;</span><span class="sy4">;</span>
    elapsedTime <span class="sy1">=</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> start.<span class="me1">QuadPart</span><span class="br0">&#41;</span> <span class="sy2">*</span> ooFreq<span class="sy4">;</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>elapsedTime <span class="sy1">&gt;=</span> timeLimit<span class="br0">&#41;</span>
        <span class="kw1">break</span><span class="sy4">;</span>
<span class="br0">&#125;</span></pre></div>
<p>Now suppose we launch two such threads, each running on a different core. Each thread will hold the lock during 50% <em>of the time when it can perform work</em>, but if one thread tries to acquire the lock while the other thread is holding it, it will be forced to wait. This is known as <strong>lock contention</strong>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/double-thread-timeline.png" alt="" title="" width="473" height="40" class="aligncenter size-full wp-image-2239" /></p>
<p>In my opinion, this is a pretty good simulation of the way a lock might be used in a real application. When we run the above scenario, we find that each thread spends roughly 25% of its time waiting, and 75% of its time doing actual work. Together, both threads achieve a net performance of <strong>1.5x</strong> compared to the single-threaded case.</p>
<p>I ran several variations of the test on a 2.66 GHz quad-core Xeon, from 1 thread, 2 threads, all the way up to 4 threads, each running on its own core. I also varied the duration of the lock, from the trivial case where the the lock is never held, all the way up to the maximum where each thread must hold the lock for 100% of its workload. In all cases, the lock frequency remained constant &#8212; threads acquired the lock 15000 times for each second of work performed.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/thread-parallelism.png" alt="" title="" width="440" height="274" class="aligncenter size-full wp-image-2236" /></p>
<p>The results were interesting. For short lock durations, up to say 10%, the system achieved very high parallelism. Not perfect parallelism, but close. Locks are fast!</p>
<p>To put the results in perspective, I analyzed the memory allocator lock in a multithreaded game engine. During gameplay, with 15000 locks per second coming from 3 threads, the lock duration was in the neighborhood of just <strong>2%</strong>. That&#8217;s well within the comfort zone on the left side of the diagram.</p>
<p>These results also show that once the lock duration passes 90%, there&#8217;s no point using multiple threads anymore. A single thread performs better. Most surprising is the way the performance of 4 threads drops off a cliff around the 60% mark! This looked like an anomaly, so I re-ran the tests several additional times, even trying a different testing order. The same behavior happened consistently. My best hypothesis is that the experiment hits some kind of snag in the Windows scheduler, but I didn&#8217;t investigate further.</p>
<h2>Lock Frequency Benchmark</h2>
<p>Even a lightweight mutex has overhead. As my <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">next post</a> shows, a pair of lock/unlock operations on a Windows Critical Section takes about <strong>23.5 ns</strong> on the CPU used in these tests. Therefore, 15000 locks per second is low enough that lock overhead does not significantly impact the results. But what happens as we turn up the dial on lock frequency?</p>
<p>The algorithm offers very fine control over the amount of work performed between one lock and the next, so I performed a new batch of tests using smaller amounts: from a very fine-grained 10 ns between locks, all the way up to 31 &mu;s, which corresponds to roughly 32000 acquires per second. Each test used exactly two threads:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/two-thread-granularities.png" alt="" title="" width="473" height="185" class="aligncenter size-full wp-image-2283" /></p>
<p>As you might expect, for very high lock frequencies, the overhead of the lock itself begins to dwarf the actual work being done. Several benchmarks you&#8217;ll find online, including the one linked earlier, fall into the bottom-right corner of this chart. At such frequencies, you&#8217;re talking about some seriously short lock times &#8212; on the scale of a few CPU instructions. The good news is that, when the work between locks is that simple, a lock-free implementation is more likely to be feasible.</p>
<p>At the same time, the results show that locking up to 320000 times per second (3.1 &mu;s between successive locks) is not unreasonable. In game development, the memory allocator may flirt with this frequency during load times. You can still achieve more than 1.5x parallelism if the lock duration is short.</p>
<p>We&#8217;ve now seen a wide spectrum of lock performance: cases where it performs great, and cases where the application slows to a crawl. I&#8217;ve argued that the lock around the memory allocator in a game engine will often achieve excellent performance. Given this example from the real world, it cannot be said that <em>all</em> locks are slow. Admittedly, it&#8217;s very easy to abuse locks, but one shouldn&#8217;t live in too much fear &#8212; any resulting bottlenecks will show up during careful profiling. When you consider how reliable locks are, and the relative ease of understanding them (compared to lock-free techniques), locks are actually pretty awesome sometimes.</p>
<p>The goal of this post was to give locks a little respect where deserved &#8212; corrections are welcome. I also realize that locks are used in a wide variety of industries and applications, and it may not always be so easy to strike a good balance in lock performance. If you&#8217;ve found that to be the case in your own experience, I would love to hear from you in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111118/locks-arent-slow-lock-contention-is/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>How to Generate Random Timings for a Poisson Process</title>
		<link>http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process</link>
		<comments>http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process#comments</comments>
		<pubDate>Fri, 07 Oct 2011 06:01:09 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1948</guid>
		<description><![CDATA[What&#8217;s a Poisson process, and how is it useful? Any time you have events which occur individually at random moments, but which tend to occur at an average rate when viewed as a group, you have a Poisson process. For &#8230; <a href="http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What&#8217;s a Poisson process, and how is it useful?</p>
<p>Any time you have events which occur individually at random moments, but which tend to occur at an average rate when viewed as a group, you have a Poisson process.</p>
<p>For example, the <a href="http://earthquake.usgs.gov/earthquakes/eqarchives/year/eqstats.php">USGS</a> estimates that each year, there are approximately 13000 earthquakes of magnitude 4+ around the world. Those earthquakes are scattered randomly throughout the year, but there are more or less 13000 per year. That&#8217;s one example of a Poisson process. The <a href="http://en.wikipedia.org/wiki/Poisson_process#Examples">Wikipedia page</a> lists several others.</p>
<p>In statistics, there are a bunch of functions and equations to help model a Poisson process. I&#8217;ll present one of those functions in this post, and demonstrate its use in writing a simulation. </p>
<h2>The Exponential Distribution</h2>
<p>If 13000 such earthquakes happen every year, it means that, on average, one earthquake happens every 40 minutes. So, let&#8217;s define a variable &lambda; = <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B40%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;frac{1}{40}' title='&#92;frac{1}{40}' class='latex' /> and call it the <em>rate parameter</em>. The rate parameter &lambda; is a measure of frequency: the average rate of events (in this case, earthquakes) per unit of time (in this case, minutes).</p>
<p><span id="more-1948"></span>Knowing this, we can ask questions like, what is the probability that an earthquake will happen within the next minute? What&#8217;s the probability within the next 10 minutes? There&#8217;s a well-known function to answer such questions. It&#8217;s called the <a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative distribution function</a> for the <a href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>, and it looks like this:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=F%28x%29+%3D+1+-+e%5E%7B-%5Clambda+x%7D&#038;bg=ffffff&#038;fg=000&#038;s=2' alt='F(x) = 1 - e^{-&#92;lambda x}' title='F(x) = 1 - e^{-&#92;lambda x}' class='latex' /></center></p>
<p><img src="http://preshing.com/wp-content/uploads/2011/10/exponential-curve.png" alt="" title="" width="425" height="227" class="aligncenter size-full wp-image-2104" /></p>
<p>Basically, the more time passes, the more likely it is that, somewhere in the world, an earthquake will occur. The word &#8220;exponential&#8221;, in this context, actually refers to <a href="http://en.wikipedia.org/wiki/Exponential_decay">exponential decay</a>. As time passes, the probability of having <em>no</em> earthquake decays towards zero &#8212; and correspondingly, the probability of having at least one earthquake increases towards one.</p>
<p>Plugging in a few values, we find that:</p>
<ul>
<li>The probability of having an earthquake within the next minute is <img src='http://s0.wp.com/latex.php?latex=F%281%29+%5Capprox+0.0247&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(1) &#92;approx 0.0247' title='F(1) &#92;approx 0.0247' class='latex' />. This value is pretty close to <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B40%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;frac{1}{40}' title='&#92;frac{1}{40}' class='latex' />, our prescribed earthquake frequency, but it&#8217;s not equal.</li>
<li>The probability of having an earthquake within the next 10 minutes is <img src='http://s0.wp.com/latex.php?latex=F%2810%29+%5Capprox+0.221&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(10) &#92;approx 0.221' title='F(10) &#92;approx 0.221' class='latex' />.</li>
</ul>
<p>In particular, note that after 40 minutes &#8212; the prescribed average time between earthquakes &#8212; the probability is only <img src='http://s0.wp.com/latex.php?latex=F%2840%29+%5Capprox+0.632&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(40) &#92;approx 0.632' title='F(40) &#92;approx 0.632' class='latex' />. So, given any 40 minute interval of time, it&#8217;s pretty likely that we&#8217;ll have an earthquake within that time interval, but it won&#8217;t always happen.</p>
<h2>Writing a Simulation</h2>
<p>Now, suppose we want to simulate the occurrence of earthquakes in a game engine, or some other kind of program. First, we need to figure out when each earthquake should begin.</p>
<p>One approach is to loop, and after each interval of X minutes, sample a random floating-point value between 0 and 1. If this number is less than <img src='http://s0.wp.com/latex.php?latex=F%28X%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(X)' title='F(X)' class='latex' />, then start an earthquake! X could even be a fractional value, so you could sample several times per minute, or even several times per second. This approach will probably work just fine, as long as your random number generator is uniform and offers enough numerical precision. However, if you intend to sample 60 times per second, with &lambda; = <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B40%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;frac{1}{40}' title='&#92;frac{1}{40}' class='latex' />, you&#8217;ll need at least 18 bits of precision from the random number generator, which the Standard C Runtime Library doesn&#8217;t always offer.</p>
<p>Another approach is to sidestep the whole sampling strategy, and simply write a function to determine the exact amount of time until the next earthquake. This function should return random numbers, but not the uniform kind of random number produced by most generators. We want to generate random numbers in a way that follows our exponential distribution.</p>
<p>Donald Knuth describes a way to generate such values in section 3.4.1 (D) of <a href="http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">The Art of Computer Programming</a>. Simply choose a random point on the y-axis between 0 and 1, distributed uniformly, and locate the corresponding time value on the x-axis. For example, if we choose the point 0.2 from the top of the graph, the time until our next earthquake would be 64.38 minutes.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/10/inverse-lookup.png" alt="" title="" width="288" height="140" class="aligncenter size-full wp-image-2103" /></p>
<p>Given that the inverse of the exponential function is ln, it&#8217;s pretty easy to write this analytically, where U is the random value between 0 and 1:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%5Cmathrm%7BnextTime%7D+%3D+%5Cdfrac%7B-%5Cln+U%7D%7B%5Clambda%7D&#038;bg=ffffff&#038;fg=000&#038;s=2' alt='&#92;mathrm{nextTime} = &#92;dfrac{-&#92;ln U}{&#92;lambda}' title='&#92;mathrm{nextTime} = &#92;dfrac{-&#92;ln U}{&#92;lambda}' class='latex' /></center></p>
<h2>The Implementation</h2>
<p>Here&#8217;s one way to implement it in Python. Note that you can&#8217;t pass zero to <code>math.log</code>, but we avoid that by subtracting the result of <code><a href="http://docs.python.org/library/random.html#random.random">random.random</a></code>, which is always less than one, from one.</p>
<pre>
import math
import random

def nextTime(rateParameter):
    return -math.log(1.0 - random.random()) / rateParameter
</pre>
<p><center></p>
<div style="border:1px solid #eeeeee;background-color:#fffff4;text-align:center;width:80%;padding-bottom:3px;"><strong>Update:</strong> After writing this post, I learned that Python has a standard library function which does exactly the same thing as <code>nextTime</code>. It&#8217;s called <code><a href="http://docs.python.org/library/random.html#random.expovariate">random.expovariate</a></code>.</div>
<p></center></p>
<p>Here are a few sample calls. The values look pretty reasonable:</p>
<pre>
>>> nextTime(1/40.0)
91.074923814190498
>>> nextTime(1/40.0)
46.88573030224817
>>> nextTime(1/40.0)
14.965086245136733
>>> nextTime(1/40.0)
26.902965535881194
</pre>
<p>Let&#8217;s run some tests to make sure that the average time returned by this function really is 40. The following expression calculates the average of one million calls, and the results are pretty consistent. I&#8217;m always amazed to see randomness behaving the way we want!</p>
<pre>
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
39.985564565743751
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
40.029018385760551
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
40.016843319423266
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
39.965097296560664
</pre>
<p>Just for fun, here&#8217;s a series of points spaced according to the output of <code>nextTime</code>. This is basically what a Poisson process looks like when plotted along a timeline:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/10/timeline.png" alt="" title="timeline" width="485" height="11" class="aligncenter size-full wp-image-2024" /></p>
<p>And here&#8217;s an implementation of <code>nextTime</code> in C, using the standard library&#8217;s random number generator. Again, we&#8217;re careful not to pass zero to <code>logf</code>.</p>
<pre>
#include &lt;math.h>
#include &lt;stdlib.h>

float nextTime(float rateParameter)
{
    return -logf(1.0f - (float) random() / (RAND_MAX + 1)) / rateParameter;
}
</pre>
<p>This technique could have various applications in a game engine, such as spawning particles from a particle emitter, or choosing moments when an AI could take a decision. I also use it in my <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">next post</a>, to measure the performance of threads which hold a lock for various intervals of time.</p>
<p>Any stats experts out there? If I&#8217;ve abused any terminology, or if you see any way to improve this post, I&#8217;d be interested in your comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>High-Resolution Mandelbrot in Obfuscated Python</title>
		<link>http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python</link>
		<comments>http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python#comments</comments>
		<pubDate>Mon, 26 Sep 2011 10:23:17 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1846</guid>
		<description><![CDATA[Here&#8217;s a followup to last month&#8217;s post about Penrose Tiling in Obfuscated Python. The Mandelbrot set is a traditional favorite among authors of obfuscated code. You can find obfuscated code in C, Perl, Haskell, Python and many other languages. Nearly &#8230; <a href="http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a followup to last month&#8217;s post about <a href="http://preshing.com/20110822/penrose-tiling-in-obfuscated-python">Penrose Tiling in Obfuscated Python</a>.</p>
<p>The <a href="http://en.wikipedia.org/wiki/Mandelbrot_set">Mandelbrot set</a> is a traditional favorite among authors of obfuscated code. You can find obfuscated code in <a href="http://www.iwriteiam.nl/SigProgM.html">C</a>, <a href="http://www.maths.tcd.ie/~mkerrin/Programs/usr/others/mandelbrot">Perl</a>, <a href="http://snakelemma.blogspot.com/2009/08/mandelbrot-set-in-haskell.html">Haskell</a>, <a href="http://forums.thedailywtf.com/forums/p/5518/118328.aspx#118328">Python</a> and many other languages. Nearly all examples render the Mandelbrot set as ASCII art.</p>
<p>The following Python script, on the other hand, begins as ASCII art:</p>
<pre>
_                                      =   (
                                        255,
                                      lambda
                               V       ,B,c
                             :c   and Y(V*V+B,B,  c
                               -1)if(abs(V)&lt;6)else
               (              2+c-4*abs(V)**-0.4)/i
                 )  ;v,      x=1500,1000;C=range(v*x
                  );import  struct;P=struct.pack;M,\
            j  ='&lt;QIIHHHH',open('M.bmp','wb').write
for X in j('BM'+P(M,v*x*3+26,26,12,v,x,1,24))or C:
            i  ,Y=_;j(P('BBB',*(lambda T:(T*80+T**9
                  *i-950*T  **99,T*70-880*T**18+701*
                 T  **9     ,T*i**(1-T**45*2)))(sum(
               [              Y(0,(A%3/3.+X%v+(X/v+
                               A/3/3.-x/2)/1j)*2.5
                             /x   -2.7,i)**2 for  \
                               A       in C
                                      [:9]])
                                        /9)
                                       )   )
</pre>
<p><span id="more-1846"></span>It renders the Mandelbrot set as a full-color, anti-aliased, 1500&#215;1000 image. Click to enlarge:</p>
<p><a href="http://preshing.com/wp-content/uploads/2011/09/M.jpg"><img src="http://preshing.com/wp-content/uploads/2011/09/M-small.jpg" alt="" title="" width="535" height="357" class="aligncenter size-full wp-image-1851" /></a></p>
<p>No third-party libraries are required &#8212; just pure Python. However, it will only run on Python 2.5 &#8211; 2.7; Python 3 is not supported. The output file is written to <code>M.bmp</code>, in Windows bitmap format.</p>
<p>It runs very slowly, taking about 18 minutes on my 1.86 GHz Core 2 Duo (or 9 minutes using <a href="http://pypy.org/">PyPy</a>). With some modifications, it&#8217;s possible to make this code run up to 20 times faster. However, doing so requires sacrificing either code size or image quality.</p>
<p>If you&#8217;re willing to leave the script running for a few hours, you can increase the image resolution on line 8. (Just make sure the width is divisible by 4.) The resulting detail is quite nice. Here are some 1:1 pixel excerpts from an image rendered at 7200&#215;4800:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/09/detail.jpg" alt="" title="" width="535" height="357" class="aligncenter size-full wp-image-1861" /></p>
<p><img src="http://preshing.com/wp-content/uploads/2011/09/detail2.jpg" alt="" title="" width="535" height="357" class="aligncenter size-full wp-image-1862" /></p>
<p>The entire 7200&#215;4800 image is too large to share here, but it&#8217;s perfect for making prints. So that&#8217;s what I did! Notice the Python script superimposed in the lower-left corner. Is this the first poster to include its own source code?</p>
<p><a href="http://www.cafepress.com/preshing"><img src="http://preshing.com/wp-content/uploads/2011/09/poster-wall.jpg" alt="" title="" width="320" height="275" class="aligncenter size-full wp-image-1915" /></a></p>
<p>If this kind of thing gives you kicks, you can order your own print (or a coffee mug) at <a href="http://www.cafepress.com/preshing">CafePress</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python/feed</wfw:commentRss>
		<slash:comments>32</slash:comments>
		</item>
	</channel>
</rss>

