<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Preshing on Programming</title>
	<atom:link href="http://preshing.com/feed" rel="self" type="application/rss+xml" />
	<link>http://preshing.com</link>
	<description></description>
	<lastBuildDate>Sat, 11 Feb 2012 21:38:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>A Look Back at Single-Threaded CPU Performance</title>
		<link>http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance</link>
		<comments>http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance#comments</comments>
		<pubDate>Wed, 08 Feb 2012 11:28:47 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2558</guid>
		<description><![CDATA[Throughout the 80&#8242;s and 90&#8242;s, CPUs were able to run virtually any kind of software twice as fast every 18-20 months. The rate of change was incredible. Your 486SX-16 was almost obsolete by the time you got it through the &#8230; <a href="http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Throughout the 80&#8242;s and 90&#8242;s, CPUs were able to run virtually any kind of software twice as fast every 18-20 months. The rate of change was incredible. Your <a href="http://www.x86-guide.com/en/cpu/Intel-486SX-16-PGA-cpu-no24.html">486SX-16</a> was almost obsolete by the time you got it through the door. But eventually, at some point in the mid-2000&#8242;s, progress slowed down considerably for single-threaded software &#8212; which was most software.</p>
<p>Perhaps the turning point came in May 2004, when Intel <a href="http://www.eetimes.com/electronics-news/4048847/Intel-cancels-Tejas-moves-to-dual-core-designs">canceled its latest single-core development effort</a> to focus on multicore designs. Later that year, Herb Sutter wrote his now-famous article, <a href="http://www.gotw.ca/publications/concurrency-ddj.htm">The Free Lunch Is Over</a>. Not all software will run remarkably faster year-over-year anymore, he warned us. Concurrent software would continue its meteoric rise, but single-threaded software was about to get left in the dust.</p>
<p>So, what&#8217;s happened since 2004? Clearly, multicore computing has become mainstream. Everybody acknowledges that single-threaded CPU performance no longer increases as quickly as it previously did &#8212; but at what rate is it <em>actually</em> increasing?</p>
<p>It’s tough to find an answer. Bill Dally of nVidia threw out a few numbers in a recent <a href="http://mediasite.colostate.edu/Mediasite/SilverlightPlayer/Default.aspx?peid=22c9d4e9c8cf474a8f887157581c458a1d#">presentation</a>: He had predicted 19% per year, but says it&#8217;s turned out closer to 5%. Last year, Chuck Moore of AMD <a href="http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf">presented</a> this graph, suggesting that single-threaded CPU performance recently started going backwards:</p>
<p><a href="http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf"><img src="http://preshing.com/wp-content/uploads/2012/01/dally-slide.png" alt="" title="" width="361" height="224" class="aligncenter size-full wp-image-2559" /></a></p>
<p><span id="more-2558"></span>These figures aren&#8217;t really consistent, and both struck me as a little low. Moreover, I couldn&#8217;t find another source to corroborate them. So I decided to crunch the numbers myself. I turned to <a href="http://www.spec.org/">SPEC</a>, an industry-standard benchmark that&#8217;s been going strong since 1989. It&#8217;s the same benchmark used to plot a few data points on the above graph.</p>
<p>SPEC licenses their benchmarking software to various companies, collects results back from those licensees, and makes those results available on their website. One of their benchmark series, <a href="http://en.wikipedia.org/wiki/SPECint">SPECint</a>, was designed to measure the single-threaded integer performance of a machine. That sounds perfect, except for one catch: many licensees use <a href="http://en.wikipedia.org/wiki/Automatic_parallelization">automatic parallelization</a>. I took some pains to remove those results from the dataset. I&#8217;ll share the method at the end of this post, and you can let me know if you think it&#8217;s valid.</p>
<p>I fetched SPEC&#8217;s data on Feb. 7, grouped the results by CPU brand, and generated the following graph. It consists of 5052 test results from 715 different CPU models, all gathered over the last 17 years:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/integer-perf.png" alt="" title="" width="556" height="454" class="aligncenter size-full wp-image-2624" /></p>
<p>Each test result is plotted according its hardware availability date, and the vertical axis uses a <a href="http://en.wikipedia.org/wiki/Logarithmic_scale">logarithmic scale</a>. The graph incorporates results from three different benchmark suites (CPU95, CPU2000 and CPU2006), but I&#8217;ve <a href="http://www.spec.org/fairuse.html#NormalizedHistoricalComparisons">normalized the results</a> in order to see historic trends.</p>
<p>The red line is meant to represent <strong>mainstream</strong> CPU performance. I drew it manually, using the less-than-scientific method of eyeballing the points for Pentium, PowerPC, Athlon and Core. If you&#8217;re willing to trust this line, it seems that in the eight years since January 2004, mainstream performance has increased by a factor of about <strong>4.6x</strong>, which works out to 21% per year. Compare that to the <strong>28x</strong> increase between 1996 and 2004! Things have really slowed down.</p>
<p>Here are a few machines located along the red line in the graph:</p>
<table class="grid">
<tr>
<th>Hardware<br />Availability</th>
<th>Adjusted<br />Result</th>
<th>CPU Model</th>
<th>Clock<br />Rate</th>
<th>CPU Cache</th>
<tr>
<td>Feb 2004</td>
<td>8.1</td>
<td><a href="http://www.spec.org/cpu2000/results/res2004q1/cpu2000-20040126-02769.html">Intel Pentium 4</a></td>
<td>3200 MHz</td>
<td>28KB L1, 1MB L2</td>
</tr>
<tr>
<td>Jun 2005</td>
<td>10.5</td>
<td><a href="http://www.spec.org/cpu2000/results/res2005q2/cpu2000-20050613-04262.html">AMD Athlon 64 FX-57</a></td>
<td>2800 MHz</td>
<td>128KB L1, 1MB L2</td>
</tr>
<tr>
<td>Jul 2006</td>
<td>11.4</td>
<td><a href="http://www.spec.org/cpu2000/results/res2006q3/cpu2000-20060904-07202.html">Intel Core 2 Duo E6300</a></td>
<td>1867 MHz</td>
<td>64KB L1, 2MB L2</td>
</tr>
<tr>
<td>Jul 2007</td>
<td>13.3</td>
<td><a href="http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070806-01732.html">Intel Core 2 Duo T7700</a></td>
<td>2400 MHz</td>
<td>64KB L1, 4MB L2</td>
</tr>
<tr>
<td>Sep 2008</td>
<td>17.9</td>
<td><a href="http://www.spec.org/cpu2006/results/res2008q3/cpu2006-20080902-05222.html">Intel Core 2 Duo T9600</a></td>
<td>2800 MHz</td>
<td>64KB L1, 6MB L2</td>
</tr>
<tr>
<td>May 2009</td>
<td>21.8</td>
<td><a href="http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090608-07726.html">Intel Core 2 Duo E7600</a></td>
<td>3066 MHz</td>
<td>64KB L1, 3MB L2</td>
</tr>
<tr>
<td>Jul 2010</td>
<td>24.3</td>
<td><a href="http://www.spec.org/cpu2006/results/res2010q3/cpu2006-20100812-12853.html">Intel Core i3-540</a></td>
<td>3067 MHz</td>
<td>64KB L1, 256KB L2, 4MB L3</td>
</tr>
<tr>
<td>Jun 2011</td>
<td>31.7</td>
<td><a href="http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111010-18687.html">Intel Pentium G850</a></td>
<td>2900 MHz</td>
<td>64KB L1, 256KB L2, 3MB L3</td>
</tr>
</table>
<p>As you can see, Intel deserves credit for squeezing out the most single-threaded performance since 2004. If you remove all Intel CPUs from the data, a different picture emerges:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/no-intel.png" alt="" title="" width="280" height="190" class="aligncenter size-full wp-image-2636" /></p>
<p>This is not too surprising, as AMD is <a href="http://blogs.amd.com/play/2011/10/13/our-take-on-amd-fx/">pretty open</a> about their stance on single-threaded performance. <a href="http://en.wikipedia.org/wiki/Bulldozer_(processor)">Bulldozer</a>, their latest microarchitecture, is meant to shine in multithreaded workloads.</p>
<p>So far we&#8217;ve only looked at integer performance. SPEC also publishes <a href="http://en.wikipedia.org/wiki/SPECfp">SPECfp</a>, an equivalent benchmark for floating-point performance. Floating-point performance has always been important for heavy-duty computation such as scientific simulation or 3D rendering. Here are the results, which I&#8217;ve also adjusted to eliminate autoparallelization:</p>
<p><img src="http://preshing.com/wp-content/uploads/2012/02/float-point-perf.png" alt="" title="" width="556" height="454" class="aligncenter size-full wp-image-2623" /></p>
<p>Prior to 2004, it climbed even faster than integer performance, at 64% per year: a doubling period of 73 weeks. After that, it leveled off at the same 21% per year.</p>
<p>Up until 2002, we see a huge difference in floating-point performance between mainstream and workstation CPUs. The Alpha, SPARC and MIPS all ran up to 8x faster. Of course, you had to pay $10000 or more to get your hands on such a workstation. This is an interesting reminder that CPUs are, in fact, things created by businesses to make money! They don&#8217;t become faster entirely by technological forces. They become faster by economic forces.</p>
<p>Which brings us back to the present day. For reasons which others understand better than me, involving <a href="http://en.wikipedia.org/wiki/Thermal_design_power">thermal design power</a> and <a href="http://en.wikipedia.org/wiki/Instruction_level_parallelism">ILP</a>, it&#8217;s now more cost-effective for manufacturers to pack additional cores onto a die than to push the single-threaded performance envelope much further.</p>
<p>Given the significance of this shift away from single-threaded performance, I was surprised to not find more information about the actual trajectory of performance since 2004. At the same time, I can&#8217;t guarantee that the data I&#8217;ve presented perfectly reflects single-threaded CPU performance. I think my conclusions are fair, but any feedback or criticism about the approach is more than welcome.</p>
<h2>How These Graphs Were Generated</h2>
<p>All Python scripts are <a href="https://github.com/preshing/analyze-spec-benchmarks">available on github</a>. These scripts will download, analyze and adjust SPEC&#8217;s data, and render the graphs. If you&#8217;d like to run them yourself, see the README file for exact instructions.</p>
<p>As already mentioned, recent compilers like <a href="http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/">Intel C++</a> and <a href="http://www-01.ibm.com/software/awdtools/xlcpp/aix/features/?S_CMP=rnav">IBM XL</a> feature <a href="http://en.wikipedia.org/wiki/Automatic_parallelization">automatic parallelization</a>, and it greatly skews the results towards certain benchmarks. For example, check out the performance of <code>462.libquantum</code> in <a href="http://www.spec.org/cpu2006/results/res2012q1/cpu2006-20111219-19210.html">this</a> result! SPEC permits the use of autoparallelization as long as it&#8217;s clearly indicated. Unfortunately, this compiler feature is so widely enabled, I couldn&#8217;t simply exclude all such results. If I had done so, I would be left with zero results for Intel&#8217;s Core i3, i5 and i7 processor families.</p>
<p>The compromise I chose was to identify the top six benchmarks which seem to benefit from automatic parallelization, disqualify those benchmarks from the test suite, and take the geometric mean of the remaining ones. This approach assumes that automatic parallelization does not work on every benchmark. For the list of disqualified benchmarks, and the algorithm which identifies them, check the github files.</p>
<p>In the end, you&#8217;ll find that even if you leave the disqualified benchmarks in the results, it doesn&#8217;t significantly change the conclusions in this post. It shifts most of the CPU2006 results upwards &#8212; up to 25% &#8212; which simultaneously shifts the conversion ratios from CPU95 and CPU2000 upwards, keeping everything roughly in line.</p>
<p>In the future, it would be interesting for licensees to submit more results without automatic parallelization. That would help us more easily observe the performance trend of single CPU cores.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>A C++ Profiling Module for Multithreaded APIs</title>
		<link>http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis</link>
		<comments>http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis#comments</comments>
		<pubDate>Sat, 03 Dec 2011 23:14:59 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2378</guid>
		<description><![CDATA[In my post about lock contention, I gave some statistics for the memory allocator in a multithreaded game engine: 15000 calls per second coming from 3 threads, taking around 2% CPU. To collect those statistics, I wrote a small profiling &#8230; <a href="http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In my post about <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">lock contention</a>, I gave some statistics for the memory allocator in a multithreaded game engine: 15000 calls per second coming from 3 threads, taking around 2% CPU. To collect those statistics, I wrote a small profiling module, which I&#8217;ll share here.</p>
<p>A profiling module is different from conventional profilers like <a href="http://blogs.msdn.com/b/pigscanfly/archive/2008/03/02/using-the-windows-sample-profiler-with-xperf.aspx">xperf</a> or <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">VTune</a> in that no third-party tools are required. You just drop the module into any C++ application, and the process collects and reports performance data by itself.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/12/api-profiler.png" alt="" title="" width="157" height="127" class="alignright size-full wp-image-2545" />This particular profiling module is meant to act on one or more <em>target modules</em> in the application. A target module can be anything which exposes a well-defined <a href="http://en.wikipedia.org/wiki/Application_programming_interface">API</a>, such as a memory allocator. To make it work, you must insert a macro named <code>API_PROFILER</code> into every public function exposed by that API. Below, I&#8217;ve added it to <code>dlmalloc</code>, one of the functions in the <a href="http://g.oswego.edu/dl/html/malloc.html">Doug Lea Malloc</a> API. The same macro should be added to <code>dlrealloc</code>, <code>dlfree</code>, and other public functions as well.</p>
<pre>
DEFINE_API_PROFILER(dlmalloc);

void* dlmalloc(size_t bytes)
{
    <span class="highlight">API_PROFILER(dlmalloc);</span>

#if USE_LOCKS
    ensure_initialization();
#endif

    if (!PREACTION(gm))
    {
        void* mem;
        size_t nb;
        if (bytes <= MAX_SMALL_REQUEST)
        {
            ...
</pre>
<p><span id="more-2378"></span>The macro takes a single argument, which is just an identifier for the target module being profiled. For this to be a valid identifier, you must place exactly one <code>DEFINE_API_PROFILER</code> macro at global scope, as seen above. You can also insert <code>DECLARE_API_PROFILER</code> anywhere at global scope, perhaps in a header file, in the same way that you'd forward declare a global variable or function.</p>
<p>When the application runs, each thread will automatically log performance statistics once per second, including the thread identifier (TID), time spent inside the target module, and the number of calls. Here, we see performance statistics across six different threads:</p>
<pre>
TID 0x13bc time spent in "dlmalloc": 7/1001 ms 0.7% 6481x
TID 0x1244 time spent in "dlmalloc": 6/1000 ms 0.6% 6166x
TID 0x198 time spent in "dlmalloc": 0/3072 ms 0.0% 2x
TID 0x11d0 time spent in "dlmalloc": 0/1113 ms 0.0% 6x
TID 0x12a4 time spent in "dlmalloc": 0/1000 ms 0.0% 20x
TID 0xc14 time spent in "dlmalloc": 4/1011 ms 0.4% 3243x
</pre>
<p>To identify each thread, simply break in the debugger and look for the TID in the Threads view.</p>
<p>Most of the profiling module is implemented in a single header file, as follows. For simplicity, I've only provided the Windows version, but you could easily port the code to other platforms.</p>
<div class="cpp"><pre class="de1"><span class="co2">#define ENABLE_API_PROFILER 1     // Comment this line to disable the profiler</span>
&nbsp;
<span class="co2">#if ENABLE_API_PROFILER</span>
&nbsp;
<span class="co1">//------------------------------------------------------------------</span>
<span class="co1">// A class for local variables created on the stack by the API_PROFILER macro:</span>
<span class="co1">//------------------------------------------------------------------</span>
<span class="kw2">class</span> APIProfiler
<span class="br0">&#123;</span>
<span class="kw2">public</span><span class="sy4">:</span>
    <span class="co1">//------------------------------------------------------------------</span>
    <span class="co1">// A structure for each thread to store information about an API:</span>
    <span class="co1">//------------------------------------------------------------------</span>
    <span class="kw4">struct</span> ThreadInfo
    <span class="br0">&#123;</span>
        INT64 lastReportTime<span class="sy4">;</span>
        INT64 accumulator<span class="sy4">;</span>   <span class="co1">// total time spent in target module since the last report</span>
        INT64 hitCount<span class="sy4">;</span>      <span class="co1">// number of times the target module was called since last report</span>
        <span class="kw4">const</span> <span class="kw4">char</span> <span class="sy2">*</span>name<span class="sy4">;</span>    <span class="co1">// the name of the target module</span>
    <span class="br0">&#125;</span><span class="sy4">;</span>
&nbsp;
<span class="kw2">private</span><span class="sy4">:</span>
    INT64 m_start<span class="sy4">;</span>
    ThreadInfo <span class="sy2">*</span>m_threadInfo<span class="sy4">;</span>
&nbsp;
    <span class="kw4">static</span> <span class="kw4">float</span> s_ooFrequency<span class="sy4">;</span>      <span class="co1">// 1.0 divided by QueryPerformanceFrequency()</span>
    <span class="kw4">static</span> INT64 s_reportInterval<span class="sy4">;</span>   <span class="co1">// length of time between reports</span>
    <span class="kw4">void</span> Flush<span class="br0">&#40;</span>INT64 end<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
<span class="kw2">public</span><span class="sy4">:</span>
    __forceinline APIProfiler<span class="br0">&#40;</span>ThreadInfo <span class="sy2">*</span>threadInfo<span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        LARGE_INTEGER start<span class="sy4">;</span>
        QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>start<span class="br0">&#41;</span><span class="sy4">;</span>
        m_start <span class="sy1">=</span> start.<span class="me1">QuadPart</span><span class="sy4">;</span>
        m_threadInfo <span class="sy1">=</span> threadInfo<span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    __forceinline ~APIProfiler<span class="br0">&#40;</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        LARGE_INTEGER end<span class="sy4">;</span>
        QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>end<span class="br0">&#41;</span><span class="sy4">;</span>
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>accumulator <span class="sy2">+</span><span class="sy1">=</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> m_start<span class="br0">&#41;</span><span class="sy4">;</span>
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>hitCount<span class="sy2">++</span><span class="sy4">;</span>
        <span class="kw1">if</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">&gt;</span> s_reportInterval<span class="br0">&#41;</span>
            Flush<span class="br0">&#40;</span>end.<span class="me1">QuadPart</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
<span class="br0">&#125;</span><span class="sy4">;</span>
&nbsp;
<span class="co1">//----------------------</span>
<span class="co1">// Profiler is enabled</span>
<span class="co1">//----------------------</span>
<span class="co2">#define DECLARE_API_PROFILER(name) \
    extern __declspec(thread) APIProfiler::ThreadInfo __APIProfiler_##name;</span>
&nbsp;
<span class="co2">#define DEFINE_API_PROFILER(name) \
    __declspec(thread) APIProfiler::ThreadInfo __APIProfiler_##name = { 0, 0, 0, #name };</span>
&nbsp;
<span class="co2">#define TOKENPASTE2(x, y) x ## y</span>
<span class="co2">#define TOKENPASTE(x, y) TOKENPASTE2(x, y)</span>
<span class="co2">#define API_PROFILER(name) \
    APIProfiler TOKENPASTE(__APIProfiler_##name, __LINE__)(&amp;__APIProfiler_##name)</span>
&nbsp;
<span class="co2">#else</span>
&nbsp;
<span class="co1">//----------------------</span>
<span class="co1">// Profiler is disabled</span>
<span class="co1">//----------------------</span>
<span class="co2">#define DECLARE_API_PROFILER(name)</span>
<span class="co2">#define DEFINE_API_PROFILER(name)</span>
<span class="co2">#define API_PROFILER(name)</span>
&nbsp;
<span class="co2">#endif</span></pre></div>
<p>The <code>DEFINE_API_PROFILER</code> macro defines a thread-local variable using the <code><a href="http://msdn.microsoft.com/en-us/library/9w1sdazb%28v=vs.80%29.aspx">__declspec(thread)</a></code> modifier. This gives each thread its own private data, independent of other threads, so the whole system works in a multithreaded environment with little performance penalty. In GCC, the equivalent storage class modifier would be <code><a href="http://gcc.gnu.org/onlinedocs/gcc-3.3.1/gcc/Thread-Local.html">__thread</a></code>. The overhead for such storage is low, but on Windows, there's one catch: <a href="http://msdn.microsoft.com/en-us/library/2s9wt68x.aspx">you can't use it across DLLs</a>.</p>
<p>The <code>API_PROFILER</code> macro creates a C++ object on the stack, taking advantage of the constructor to signal the beginning and the destructor to signal the end of the section being measured. The macro uses a <a href="http://stackoverflow.com/a/1597129">token-pasting trick</a>, using the current line number, to create unique local variable names.</p>
<p>It's important not to call this macro recursively. In other words, don't insert <code>API_PROFILER</code> anywhere that might be called within the scope of another <code>API_PROFILER</code> marker, using the same identifier. If you do, you'll end up counting the time spent inside the target module twice! If absolutely necessary, you could modify the profiling module to circumvent this limitation, at the cost of a little extra overhead.</p>
<p>The destructor sometimes calls a function named <code>Flush</code>. It's a heavier function, so we define it in a separate <code>.cpp</code> file, and make sure it's only called once per second:</p>
<div class="cpp"><pre class="de1"><span class="co2">#if ENABLE_API_PROFILER</span>
&nbsp;
<span class="kw4">static</span> <span class="kw4">const</span> <span class="kw4">float</span> APIProfiler_ReportIntervalSecs <span class="sy1">=</span> <span class="nu17">1.0f</span><span class="sy4">;</span>
&nbsp;
<span class="kw4">float</span> APIProfiler<span class="sy4">::</span><span class="me2">s_ooFrequency</span> <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
INT64 APIProfiler<span class="sy4">::</span><span class="me2">s_reportInterval</span> <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
&nbsp;
<span class="co1">//------------------------------------------------------------------</span>
<span class="co1">// Flush is called at the rate determined by APIProfiler_ReportIntervalSecs</span>
<span class="co1">//------------------------------------------------------------------</span>
<span class="kw4">void</span> APIProfiler<span class="sy4">::</span><span class="me2">Flush</span><span class="br0">&#40;</span>INT64 end<span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    <span class="co1">// Auto-initialize globals based on timer frequency:</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>s_reportInterval <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        LARGE_INTEGER freq<span class="sy4">;</span>
        QueryPerformanceFrequency<span class="br0">&#40;</span><span class="sy3">&amp;</span>freq<span class="br0">&#41;</span><span class="sy4">;</span>
        s_ooFrequency <span class="sy1">=</span> <span class="nu17">1.0f</span> <span class="sy2">/</span> freq.<span class="me1">QuadPart</span><span class="sy4">;</span>
        MemoryBarrier<span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>
        s_reportInterval <span class="sy1">=</span> <span class="br0">&#40;</span>INT64<span class="br0">&#41;</span> <span class="br0">&#40;</span>freq.<span class="me1">QuadPart</span> <span class="sy2">*</span> APIProfiler_ReportIntervalSecs<span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="co1">// Avoid garbage timing on first call by initializing a new interval:</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">==</span> <span class="nu0">0</span><span class="br0">&#41;</span>
    <span class="br0">&#123;</span>
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">=</span> m_start<span class="sy4">;</span>
        <span class="kw1">return</span><span class="sy4">;</span>
    <span class="br0">&#125;</span>
&nbsp;
    <span class="co1">// Enough time has elapsed. Print statistics to console:</span>
    <span class="kw4">float</span> interval <span class="sy1">=</span> <span class="br0">&#40;</span>end <span class="sy2">-</span> m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime<span class="br0">&#41;</span> <span class="sy2">*</span> s_ooFrequency<span class="sy4">;</span>
    <span class="kw4">float</span> measured <span class="sy1">=</span> m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>accumulator <span class="sy2">*</span> s_ooFrequency<span class="sy4">;</span>
    <span class="kw3">printf</span><span class="br0">&#40;</span><span class="st0">&quot;TID 0x%x time spent in <span class="es1">\&quot;</span>%s<span class="es1">\&quot;</span>: %.0f/%.0f ms %.1f%% %dx<span class="es1">\n</span>&quot;</span>,
        GetCurrentThreadId<span class="br0">&#40;</span><span class="br0">&#41;</span>,
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>name,
        measured <span class="sy2">*</span> <span class="nu0">1000</span>,
        interval <span class="sy2">*</span> <span class="nu0">1000</span>,
        <span class="nu0">100</span>.<span class="me1">f</span> <span class="sy2">*</span> measured <span class="sy2">/</span> interval,
        m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>hitCount<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    <span class="co1">// Reset statistics and begin next timing interval:</span>
    m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>lastReportTime <span class="sy1">=</span> end<span class="sy4">;</span>
    m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>accumulator <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
    m_threadInfo<span class="sy2">-</span><span class="sy1">&gt;</span>hitCount <span class="sy1">=</span> <span class="nu0">0</span><span class="sy4">;</span>
<span class="br0">&#125;</span>
&nbsp;
<span class="co2">#endif</span></pre></div>
<p>In the above code, <code>printf</code> is used for logging, but you could easily replace it with calls to <code>sprintf</code> and <code>OutputDebugString</code>, or anything else. The nice thing about logging to a console is that it works even when there is no graphical display, such as during the loading screen of a game, or when the application is starting up. Those are moments when you might be particularly interested in profiling a specific API.</p>
<p>Another convenient thing about this profiling module is that no explicit initialization is required. The very first time the macro is hit, it will call <code>Flush</code>. The first thread to enter <code>Flush</code> will see that <code>s_reportInterval</code> is not yet initialized, and will initialize itself. It doesn't matter if two threads end up trying to initialize the globals at the same time; they will both write the same result.</p>
<p>I measured the overhead introduced by the <code>API_PROFILER</code> macro on two processors: <strong>99 ns</strong> on a 1.86 GHz Core 2 Duo, and <strong>30.8 ns</strong> on a 2.66 GHz Xeon. That's just a little slower than an <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">uncontended Windows Critical Section</a>, making this a pretty good technique for fine-grained profiling. You could reduce the overhead further by calling <code><a href="http://msdn.microsoft.com/en-us/library/twchhe95%28v=vs.80%29.aspx">__rdtsc</a></code> instead of <code>QueryPerformanceCounter</code>, but the resulting numbers would be <a href="http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx">less reliable on multicore systems</a>, so I chose not to mess with that.</p>
<p>Built-in profiling modules are nothing new &mdash; Jeff Everett describes another in-game profiler in <a href="http://www.amazon.com/Game-Programming-Gems-CD-Vol/dp/1584500549">Game Programming Gems 2</a>. Hopefully, I've at least presented a few twists on the idea. I'd be interested to hear about any twists of your own. As far as I know, no third-party profiler is capable of profiling a multithreaded API as easily &#038; accurately as the method I've described here &mdash; whether it's <a href="http://valgrind.org/">Valgrind</a>, <a href="http://blogs.msdn.com/b/pigscanfly/archive/2008/03/02/using-the-windows-sample-profiler-with-xperf.aspx">xperf</a>, <a href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/">VTune</a>, <a href="http://developer.apple.com/technologies/tools/">Shark</a>, <a href="http://msdn.microsoft.com/en-us/library/ee417062%28v=VS.85%29.aspx">PIX</a>, <a href="http://www.snsys.com/ps3/prodg.asp#tuner">Tuner</a>, <a href="http://msdn.microsoft.com/en-us/magazine/cc337887.aspx">Visual Studio Profiler</a>, or any other. Readers, correct me if I'm wrong!</p>
<p>Such profilers can, on the other hand, show you when a particular module becomes heavy &mdash; the module's internal functions will appear near the top of <a href="http://en.wikipedia.org/wiki/Profiling_%28computer_programming%29#Statistical_profilers">PC sampling</a> summaries, for example. Sometimes, even <a href="http://preshing.com/20110723/finding-bottlenecks-by-random-breaking">random breaking</a> offers a similar clue. At that point, you might be compelled to use a built-in profiling module like this one, to drill deeper and to measure the impact of subsequent code changes.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Always Use a Lightweight Mutex</title>
		<link>http://preshing.com/20111124/always-use-a-lightweight-mutex</link>
		<comments>http://preshing.com/20111124/always-use-a-lightweight-mutex#comments</comments>
		<pubDate>Thu, 24 Nov 2011 14:34:15 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2248</guid>
		<description><![CDATA[In multithreaded programming, we often speak of locks (also known as mutexes). But a lock is only a concept. To actually use that concept, you need an implementation. As it turns out, there are many ways to implement a lock, &#8230; <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In multithreaded programming, we often speak of <a href="http://en.wikipedia.org/wiki/Lock_(computer_science)">locks</a> (also known as mutexes). But a lock is only a concept. To actually <em>use</em> that concept, you need an implementation. As it turns out, there are many ways to implement a lock, and those implementations vary wildly in performance.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/lightweight-mutex.png" alt="" title="" width="120" height="92" class="alignleft size-full wp-image-2542" />The Windows SDK provides two lock implementations for C/C++: the <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms684266%28v=vs.85%29.aspx">Mutex</a> and the <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms682530%28v=vs.85%29.aspx">Critical Section</a>. (As Ned Batchelder <a href="http://nedbatchelder.com/blog/200304/mutexes_and_critical_sections.html">points out</a>, <em>Critical Section</em> is probably not the best name to give to the lock itself, but we&#8217;ll forgive that here.)</p>
<p>The Windows Critical Section is what we call a <strong>lightweight mutex</strong>. It&#8217;s optimized for the case when there are no other threads competing for the lock. To demonstrate using a simple example, here&#8217;s a single thread which locks and unlocks a Windows Mutex exactly one million times.</p>
<pre>
HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
for (int i = 0; i < 1000000; i++)
{
    WaitForSingleObject(mutex, INFINITE);
    ReleaseMutex(mutex);
}
CloseHandle(mutex);
</pre>
<p><span id="more-2248"></span>Here's the same experiment using a Windows Critical Section.</p>
<pre>
CRITICAL_SECTION critSec;
InitializeCriticalSection(&#038;critSec);
for (int i = 0; i < 1000000; i++)
{
    EnterCriticalSection(&#038;critSec);
    LeaveCriticalSection(&#038;critSec);
}
DeleteCriticalSection(&#038;critSec);
</pre>
<p>If you insert some timing code around the inner loop, and divide the result by one million, you'll find the average time required for a pair of lock/unlock operations in both cases. I did that, and ran the experiment on two different processors. The results:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/mutex-vs-critical-section.png" alt="" title="" width="508" height="80" class="aligncenter size-full wp-image-2322" /></p>
<p>The Critical Section is <strong>25 times</strong> faster. As <a href="http://blogs.msdn.com/b/larryosterman/archive/2005/08/24/455741.aspx">Larry Osterman explains</a>, the Windows Mutex enters the kernel every time you use it, while the Critical Section does not. The tradeoff is that you can't share a Critical Section between processes. But who cares? Most of the time, you just want to protect some data within a single process. (It is actually possible to share a lightweight mutex between processes - just not using a Critical Section.)</p>
<p>Now, suppose you have a thread which acquires a Critical Section 100000 times per second, and there are no other threads competing for the lock. Based on the above figures, you can expect to pay between 0.2% and 0.6% in lock overhead. Not too bad! At lower frequencies, the overhead becomes negligible. I'm ignoring the hidden cost of synchronizing the processor's cache, which is something I'll write about in a future post, but it doesn't make a big difference.</p>
<h2>Other Platforms</h2>
<p>In MacOS 10.6.6, a lock implementation is provided using the <a href="http://en.wikipedia.org/wiki/POSIX_Threads">POSIX Threads</a> API. It's a lightweight mutex which doesn't enter the kernel unless there's contention. A pair of uncontended calls to <code>pthread_mutex_lock</code> and <code>pthread_mutex_unlock</code> takes about <strong>92 ns</strong> on my 1.86 GHz Core 2 Duo. Interestingly, it detects when there's only one thread running, and in that case switches to a trivial codepath taking only 38 ns.</p>
<p>MacOS also offers <code><a href="http://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSLock_Class/Reference/Reference.html">NSLock</a></code>, an Objective-C class, but this is really just a wrapper around the aforementioned POSIX mutex. Because each operation must wind its way through <code>objc_msgSend</code>, the overhead is a little higher: <strong>155 ns</strong> on my Core 2 Duo, or 98 ns if there's only a single thread.</p>
<p>Naturally, Ubuntu 11.10 provides a lock implementation using the POSIX Threads API as well. It's another lightweight mutex, based on a Linux-specific construct known as a <a href="http://en.wikipedia.org/wiki/Futex">futex</a>. A pair of <code>pthread_mutex_lock</code>/<code>pthread_mutex_unlock</code> calls takes about <strong>66 ns</strong> on my Core 2 Duo. You can even share this implementation between processes, but I didn't test that.</p>
<p>Even the Playstation 3 SDK offers a choice between a lightweight mutex and a heavy one. Back in 2007, early in the development of a Playstation 3 game I worked on, we were using the heavy mutex. Switching to the lightweight mutex made the game start <strong>17</strong> seconds faster! For me, that's when the difference really hit home.</p>
<p>In my previous post, I <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">argued against the misconception that locks are slow</a> and provided some data to support the argument. At this point, it should be clear that if you aren't using a lightweight mutex, the entire argument goes out the window. I'm fairly sure that the existence of heavy lock implementations has only added to this misconception over the years.</p>
<p>Some of you old-timers may point out ancient platforms where a heavy lock was the only implementation available, or when a <a href="http://en.wikipedia.org/wiki/Semaphore_%28programming%29">semaphore</a> had to be used for the job. But it seems all modern platforms offer a lightweight mutex. And even if they didn't, you could write your own lightweight mutex at the application level, even sharing it between processes, provided you're willing to live with certain caveats. In an upcoming post, I'll take a look at one implementation known as the <a href="http://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html#Engineering1-26">Benaphore</a>, and present a recursive variation of it.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111124/always-use-a-lightweight-mutex/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Locks Aren&#8217;t Slow; Lock Contention Is</title>
		<link>http://preshing.com/20111118/locks-arent-slow-lock-contention-is</link>
		<comments>http://preshing.com/20111118/locks-arent-slow-lock-contention-is#comments</comments>
		<pubDate>Fri, 18 Nov 2011 13:46:35 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=2159</guid>
		<description><![CDATA[Locks (also known as mutexes) have a history of being misjudged. Back in 1986, in a Usenet discussion on multithreading, Matthew Dillon wrote, &#8220;Most people have the misconception that locks are slow.&#8221; 25 years later, this misconception still seems to &#8230; <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Lock_(computer_science)">Locks</a> (also known as <strong>mutexes</strong>) have a history of being misjudged. Back in 1986, in a Usenet discussion on multithreading, Matthew Dillon <a href="http://groups.google.com/group/net.micro.mac/msg/752d18de371bd65c?dmode=source">wrote</a>, &#8220;Most people have the misconception that locks are slow.&#8221; 25 years later, this misconception still seems to <a href="http://www.cs.washington.edu/education/courses/cse451/03wi/section/prodcons.htm">pop up</a> once in a while.</p>
<p>It&#8217;s true that locking is slow on some platforms, or when the lock is highly contended. And when you&#8217;re developing a multithreaded application, it&#8217;s very common to find a huge performance bottleneck caused by a single lock. But that doesn&#8217;t mean all locks are slow. As I&#8217;ll show in this post, sometimes a locking strategy achieves excellent performance.</p>
<p>Perhaps the most easily-overlooked source of this misconception: Not all programmers may be aware of the difference between a lightweight mutex and a &#8220;kernel mutex&#8221;. I&#8217;ll talk about that in my next post, <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">Always Use a Lightweight Mutex</a>. For now, let&#8217;s just say that if you&#8217;re programming in C/C++ on Windows, the <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms682530.aspx">Critical Section</a> object is the one you want.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/lock-competition-thumbnail.png" alt="" title="" width="154" height="95" class="alignright size-full wp-image-2539" />Other times, the conclusion that locks are slow is supported by a benchmark. For example, <a href="http://ridiculousfish.com/blog/posts/barrier.html">this post</a> measures the performance of a lock under heavy conditions: each thread must hold the lock to do any work (high contention), and the lock is held for an extremely short interval of time (high frequency). It&#8217;s a good read, but in a real application, you generally want to avoid using locks in that way. To put things in context, I&#8217;ve devised a benchmark which includes both best-case and worst-case usage scenarios for locks.</p>
<p><span id="more-2159"></span>Locks may be frowned upon for other reasons. There&#8217;s a whole other family of techniques out there known as lock-free (or <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx">lockless</a>) programming. Lock-free programming is extremely challenging, but delivers huge performance gains in a lot of real-world scenarios. I know programmers who spent days, even weeks fine-tuning a lock-free algorithm, subjecting it to a battery of tests, only to discover hidden timing bugs several months later. The combination of danger and reward can be very enticing to a certain kind of programmer &#8212; and this includes me, as you&#8217;ll see in future posts! With lock-free techniques beckoning us to use them, locks can begin to feel boring, slow and busted.</p>
<p>But don&#8217;t disregard locks yet. One good example of a place where locks perform admirably, in real software, is when protecting the memory allocator. <a href="http://g.oswego.edu/dl/html/malloc.html">Doug Lea&#8217;s Malloc</a> is a popular memory allocator in video game development, but it&#8217;s single threaded, so we need to protect it using a lock. During gameplay, it&#8217;s not uncommon to see multiple threads hammering the memory allocator, say around 15000 times per second. While loading, this figure can climb to 100000 times per second or more. It&#8217;s not a big problem, though. As you&#8217;ll see, locks handle the workload like a champ.</p>
<h2>Lock Contention Benchmark</h2>
<p>In this test, we spawn a thread which generates random numbers, using a custom <a href="http://en.wikipedia.org/wiki/Mersenne_twister">Mersenne Twister</a> implementation. Every once in a while, it acquires and releases a lock. The lengths of time between acquiring and releasing the lock are random, but they tend towards average values which we decide ahead of time. For example, suppose we want to acquire the lock 15000 times per second, and keep it held 50% of the time. Here&#8217;s what part of the timeline would look like. Red means the lock is held, grey means it&#8217;s released:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/single-thread-timeline.png" alt="" title="" width="466" height="16" class="aligncenter size-full wp-image-2241" /></p>
<p>This is essentially a Poisson process. If we know the average amount of time to generate a single random number &#8212; <strong>6.349 ns</strong> on a 2.66 GHz quad-core Xeon &#8212; we can measure time in <em>work units</em>, rather than seconds. We can then use the technique described in my previous post, <a href="http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process">How to Generate Random Timings for a Poisson Process</a>, to decide how many work units to perform between acquiring and releasing the lock. Here&#8217;s the implementation in C++. I&#8217;ve left out a few details, but if you like, you can download the complete source code <a href="http://preshing.com/files/LockCompetition.zip">here</a>.</p>
<div class="cpp"><pre class="de1">QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>start<span class="br0">&#41;</span><span class="sy4">;</span>
<span class="kw1">for</span> <span class="br0">&#40;</span><span class="sy4">;;</span><span class="br0">&#41;</span>
<span class="br0">&#123;</span>
    <span class="co1">// Do some work without holding the lock</span>
    workunits <span class="sy1">=</span> <span class="br0">&#40;</span><span class="kw4">int</span><span class="br0">&#41;</span> <span class="br0">&#40;</span>random.<span class="me1">poissonInterval</span><span class="br0">&#40;</span>averageUnlockedCount<span class="br0">&#41;</span> <span class="sy2">+</span> <span class="nu17">0.5f</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> i <span class="sy1">=</span> <span class="nu0">1</span><span class="sy4">;</span> i <span class="sy1">&lt;</span> workunits<span class="sy4">;</span> i<span class="sy2">++</span><span class="br0">&#41;</span>
        random.<span class="me1">integer</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>       <span class="co1">// Do one work unit</span>
    workDone <span class="sy2">+</span><span class="sy1">=</span> workunits<span class="sy4">;</span>
&nbsp;
    QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>end<span class="br0">&#41;</span><span class="sy4">;</span>
    elapsedTime <span class="sy1">=</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> start.<span class="me1">QuadPart</span><span class="br0">&#41;</span> <span class="sy2">*</span> ooFreq<span class="sy4">;</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>elapsedTime <span class="sy1">&gt;=</span> timeLimit<span class="br0">&#41;</span>
        <span class="kw1">break</span><span class="sy4">;</span>
&nbsp;
    <span class="co1">// Do some work while holding the lock</span>
    EnterCriticalSection<span class="br0">&#40;</span><span class="sy3">&amp;</span>criticalSection<span class="br0">&#41;</span><span class="sy4">;</span>
    workunits <span class="sy1">=</span> <span class="br0">&#40;</span><span class="kw4">int</span><span class="br0">&#41;</span> <span class="br0">&#40;</span>random.<span class="me1">poissonInterval</span><span class="br0">&#40;</span>averageLockedCount<span class="br0">&#41;</span> <span class="sy2">+</span> <span class="nu17">0.5f</span><span class="br0">&#41;</span><span class="sy4">;</span>
    <span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> i <span class="sy1">=</span> <span class="nu0">1</span><span class="sy4">;</span> i <span class="sy1">&lt;</span> workunits<span class="sy4">;</span> i<span class="sy2">++</span><span class="br0">&#41;</span>
        random.<span class="me1">integer</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy4">;</span>       <span class="co1">// Do one work unit</span>
    workDone <span class="sy2">+</span><span class="sy1">=</span> workunits<span class="sy4">;</span>
    LeaveCriticalSection<span class="br0">&#40;</span><span class="sy3">&amp;</span>criticalSection<span class="br0">&#41;</span><span class="sy4">;</span>
&nbsp;
    QueryPerformanceCounter<span class="br0">&#40;</span><span class="sy3">&amp;</span>end<span class="br0">&#41;</span><span class="sy4">;</span>
    elapsedTime <span class="sy1">=</span> <span class="br0">&#40;</span>end.<span class="me1">QuadPart</span> <span class="sy2">-</span> start.<span class="me1">QuadPart</span><span class="br0">&#41;</span> <span class="sy2">*</span> ooFreq<span class="sy4">;</span>
    <span class="kw1">if</span> <span class="br0">&#40;</span>elapsedTime <span class="sy1">&gt;=</span> timeLimit<span class="br0">&#41;</span>
        <span class="kw1">break</span><span class="sy4">;</span>
<span class="br0">&#125;</span></pre></div>
<p>Now suppose we launch two such threads, each running on a different core. Each thread will hold the lock during 50% <em>of the time when it can perform work</em>, but if one thread tries to acquire the lock while the other thread is holding it, it will be forced to wait. This is known as <strong>lock contention</strong>.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/double-thread-timeline.png" alt="" title="" width="473" height="40" class="aligncenter size-full wp-image-2239" /></p>
<p>In my opinion, this is a pretty good simulation of the way a lock might be used in a real application. When we run the above scenario, we find that each thread spends roughly 25% of its time waiting, and 75% of its time doing actual work. Together, both threads achieve a net performance of <strong>1.5x</strong> compared to the single-threaded case.</p>
<p>I ran several variations of the test on a 2.66 GHz quad-core Xeon, from 1 thread, 2 threads, all the way up to 4 threads, each running on its own core. I also varied the duration of the lock, from the trivial case where the the lock is never held, all the way up to the maximum where each thread must hold the lock for 100% of its workload. In all cases, the lock frequency remained constant &#8212; threads acquired the lock 15000 times for each second of work performed.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/thread-parallelism.png" alt="" title="" width="440" height="274" class="aligncenter size-full wp-image-2236" /></p>
<p>The results were interesting. For short lock durations, up to say 10%, the system achieved very high parallelism. Not perfect parallelism, but close. Locks are fast!</p>
<p>To put the results in perspective, I analyzed the memory allocator lock in a multithreaded game engine. During gameplay, with 15000 locks per second coming from 3 threads, the lock duration was in the neighborhood of just <strong>2%</strong>. That&#8217;s well within the comfort zone on the left side of the diagram.</p>
<p>These results also show that once the lock duration passes 90%, there&#8217;s no point using multiple threads anymore. A single thread performs better. Most surprising is the way the performance of 4 threads drops off a cliff around the 60% mark! This looked like an anomaly, so I re-ran the tests several additional times, even trying a different testing order. The same behavior happened consistently. My best hypothesis is that the experiment hits some kind of snag in the Windows scheduler, but I didn&#8217;t investigate further.</p>
<h2>Lock Frequency Benchmark</h2>
<p>Even a lightweight mutex has overhead. As my <a href="http://preshing.com/20111124/always-use-a-lightweight-mutex">next post</a> shows, a pair of acquire/release operations on a Windows Critical Section takes about <strong>23.5 ns</strong> on the CPU used in these tests. Therefore, 15000 locks per second is low enough that lock overhead does not significantly impact the results. But what happens as we turn up the dial on lock frequency?</p>
<p>The above tests offer very fine control over the amount of work performed between one lock and the next, so I performed a new batch of tests using smaller amounts: from a very fine-grained 10 ns between locks, all the way up to 31 &mu;s, which corresponds to roughly 32000 acquires per second. Each test used exactly two threads:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/11/two-thread-granularities.png" alt="" title="" width="473" height="185" class="aligncenter size-full wp-image-2283" /></p>
<p>As you might expect, for very high lock frequencies, the overhead of the lock itself begins to dwarf the actual work being done. Several benchmarks you&#8217;ll find online, including the one linked earlier, fall into the bottom-right corner of this chart. At such frequencies, you&#8217;re talking about some seriously short lock times &#8212; on the scale of a few CPU instructions. The good news is that, when the work between locks is that simple, a lock-free implementation is more likely to be feasible.</p>
<p>At the same time, the results show that locking up to 320000 times per second (3.1 &mu;s between successive locks) is not unreasonable. In game development, the memory allocator may flirt with this frequency during load times. You can still achieve more than 1.5x parallelism if the lock duration is short.</p>
<p>We&#8217;ve now seen a wide spectrum of lock performance: cases where it performs great, and cases where the application slows to a crawl. I&#8217;ve argued that the lock around the memory allocator in a game engine will often achieve excellent performance. Given this example from the real world, it cannot be said that <em>all</em> locks are slow. Admittedly, it&#8217;s very easy to abuse locks, but one shouldn&#8217;t live in too much fear &#8212; any resulting bottlenecks will show up during careful profiling. When you consider how reliable locks are, and the relative ease of understanding them (compared to lock-free techniques), locks are actually pretty awesome sometimes.</p>
<p>The goal of this post was to give locks a little respect where deserved &#8212; corrections are welcome. I also realize that locks are used in a wide variety of industries and applications, and it may not always be so easy to strike a good balance in lock performance. If you&#8217;ve found that to be the case in your own experience, I would love to hear from you in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111118/locks-arent-slow-lock-contention-is/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>How to Generate Random Timings for a Poisson Process</title>
		<link>http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process</link>
		<comments>http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process#comments</comments>
		<pubDate>Fri, 07 Oct 2011 06:01:09 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1948</guid>
		<description><![CDATA[What&#8217;s a Poisson process, and how is it useful? Any time you have events which occur individually at random moments, but which tend to occur at an average rate when viewed as a group, you have a Poisson process. For &#8230; <a href="http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What&#8217;s a Poisson process, and how is it useful?</p>
<p>Any time you have events which occur individually at random moments, but which tend to occur at an average rate when viewed as a group, you have a Poisson process.</p>
<p>For example, the <a href="http://earthquake.usgs.gov/earthquakes/eqarchives/year/eqstats.php">USGS</a> estimates that each year, there are approximately 13000 earthquakes of magnitude 4+ around the world. Those earthquakes are scattered randomly throughout the year, but there are more or less 13000 per year. That&#8217;s one example of a Poisson process. The <a href="http://en.wikipedia.org/wiki/Poisson_process#Examples">Wikipedia page</a> lists several others.</p>
<p>In statistics, there are a bunch of functions and equations to help model a Poisson process. I&#8217;ll present one of those functions in this post, and demonstrate its use in writing a simulation. </p>
<h2>The Exponential Distribution</h2>
<p>If 13000 such earthquakes happen every year, it means that, on average, one earthquake happens every 40 minutes. So, let&#8217;s define a variable &lambda; = <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B40%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;frac{1}{40}' title='&#92;frac{1}{40}' class='latex' /> and call it the <em>rate parameter</em>. The rate parameter &lambda; is a measure of frequency: the average rate of events (in this case, earthquakes) per unit of time (in this case, minutes).</p>
<p><span id="more-1948"></span>Knowing this, we can ask questions like, what is the probability that an earthquake will happen within the next minute? What&#8217;s the probability within the next 10 minutes? There&#8217;s a well-known function to answer such questions. It&#8217;s called the <a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative distribution function</a> for the <a href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a>, and it looks like this:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=F%28x%29+%3D+1+-+e%5E%7B-%5Clambda+x%7D&#038;bg=ffffff&#038;fg=000&#038;s=2' alt='F(x) = 1 - e^{-&#92;lambda x}' title='F(x) = 1 - e^{-&#92;lambda x}' class='latex' /></center></p>
<p><img src="http://preshing.com/wp-content/uploads/2011/10/exponential-curve.png" alt="" title="" width="425" height="227" class="aligncenter size-full wp-image-2104" /></p>
<p>Basically, the more time passes, the more likely it is that, somewhere in the world, an earthquake will occur. The word &#8220;exponential&#8221;, in this context, actually refers to <a href="http://en.wikipedia.org/wiki/Exponential_decay">exponential decay</a>. As time passes, the probability of having <em>no</em> earthquake decays towards zero &#8212; and correspondingly, the probability of having at least one earthquake increases towards one.</p>
<p>Plugging in a few values, we find that:</p>
<ul>
<li>The probability of having an earthquake within the next minute is <img src='http://s0.wp.com/latex.php?latex=F%281%29+%5Capprox+0.0247&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(1) &#92;approx 0.0247' title='F(1) &#92;approx 0.0247' class='latex' />. This value is pretty close to <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B40%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;frac{1}{40}' title='&#92;frac{1}{40}' class='latex' />, our prescribed earthquake frequency, but it&#8217;s not equal.</li>
<li>The probability of having an earthquake within the next 10 minutes is <img src='http://s0.wp.com/latex.php?latex=F%2810%29+%5Capprox+0.221&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(10) &#92;approx 0.221' title='F(10) &#92;approx 0.221' class='latex' />.</li>
</ul>
<p>In particular, note that after 40 minutes &#8212; the prescribed average time between earthquakes &#8212; the probability is only <img src='http://s0.wp.com/latex.php?latex=F%2840%29+%5Capprox+0.632&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(40) &#92;approx 0.632' title='F(40) &#92;approx 0.632' class='latex' />. So, given any 40 minute interval of time, it&#8217;s pretty likely that we&#8217;ll have an earthquake within that time interval, but it won&#8217;t always happen.</p>
<h2>Writing a Simulation</h2>
<p>Now, suppose we want to simulate the occurrence of earthquakes in a game engine, or some other kind of program. First, we need to figure out when each earthquake should begin.</p>
<p>One approach is to loop, and after each interval of X minutes, sample a random floating-point value between 0 and 1. If this number is less than <img src='http://s0.wp.com/latex.php?latex=F%28X%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='F(X)' title='F(X)' class='latex' />, then start an earthquake! X could even be a fractional value, so you could sample several times per minute, or even several times per second. This approach will probably work just fine, as long as your random number generator is uniform and offers enough numerical precision. However, if you intend to sample 60 times per second, with &lambda; = <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B40%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;frac{1}{40}' title='&#92;frac{1}{40}' class='latex' />, you&#8217;ll need at least 18 bits of precision from the random number generator, which the Standard C Runtime Library doesn&#8217;t always offer.</p>
<p>Another approach is to sidestep the whole sampling strategy, and simply write a function to determine the exact amount of time until the next earthquake. This function should return random numbers, but not the uniform kind of random number produced by most generators. We want to generate random numbers in a way that follows our exponential distribution.</p>
<p>Donald Knuth describes a way to generate such values in section 3.4.1 (D) of <a href="http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">The Art of Computer Programming</a>. Simply choose a random point on the y-axis between 0 and 1, distributed uniformly, and locate the corresponding time value on the x-axis. For example, if we choose the point 0.2 from the top of the graph, the time until our next earthquake would be 64.38 minutes.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/10/inverse-lookup.png" alt="" title="" width="288" height="140" class="aligncenter size-full wp-image-2103" /></p>
<p>Given that the inverse of the exponential function is ln, it&#8217;s pretty easy to write this analytically, where U is the random value between 0 and 1:</p>
<p><center><img src='http://s0.wp.com/latex.php?latex=%5Cmathrm%7BnextTime%7D+%3D+%5Cdfrac%7B-%5Cln+U%7D%7B%5Clambda%7D&#038;bg=ffffff&#038;fg=000&#038;s=2' alt='&#92;mathrm{nextTime} = &#92;dfrac{-&#92;ln U}{&#92;lambda}' title='&#92;mathrm{nextTime} = &#92;dfrac{-&#92;ln U}{&#92;lambda}' class='latex' /></center></p>
<h2>The Implementation</h2>
<p>Here&#8217;s one way to implement it in Python. Note that you can&#8217;t pass zero to <code>math.log</code>, but we avoid that by subtracting the result of <code><a href="http://docs.python.org/library/random.html#random.random">random.random</a></code>, which is always less than one, from one.</p>
<pre>
import math
import random

def nextTime(rateParameter):
    return -math.log(1.0 - random.random()) / rateParameter
</pre>
<p><center></p>
<div style="border:1px solid #eeeeee;background-color:#fffff4;text-align:center;width:80%;padding-bottom:3px;"><strong>Update:</strong> After writing this post, I learned that Python has a standard library function which does exactly the same thing as <code>nextTime</code>. It&#8217;s called <code><a href="http://docs.python.org/library/random.html#random.expovariate">random.expovariate</a></code>.</div>
<p></center></p>
<p>Here are a few sample calls. The values look pretty reasonable:</p>
<pre>
>>> nextTime(1/40.0)
91.074923814190498
>>> nextTime(1/40.0)
46.88573030224817
>>> nextTime(1/40.0)
14.965086245136733
>>> nextTime(1/40.0)
26.902965535881194
</pre>
<p>Let&#8217;s run some tests to make sure that the average time returned by this function really is 40. The following expression calculates the average of one million calls, and the results are pretty consistent. I&#8217;m always amazed to see randomness behaving the way we want!</p>
<pre>
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
39.985564565743751
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
40.029018385760551
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
40.016843319423266
>>> sum([nextTime(1/40.0) for i in xrange(1000000)]) / 1000000
39.965097296560664
</pre>
<p>Just for fun, here&#8217;s a series of points spaced according to the output of <code>nextTime</code>. This is basically what a Poisson process looks like when plotted along a timeline:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/10/timeline.png" alt="" title="timeline" width="485" height="11" class="aligncenter size-full wp-image-2024" /></p>
<p>And here&#8217;s an implementation of <code>nextTime</code> in C, using the standard library&#8217;s random number generator. Again, we&#8217;re careful not to pass zero to <code>logf</code>.</p>
<pre>
#include &lt;math.h>
#include &lt;stdlib.h>

float nextTime(float rateParameter)
{
    return -logf(1.0f - (float) random() / (RAND_MAX + 1)) / rateParameter;
}
</pre>
<p>This technique could have various applications in a game engine, such as spawning particles from a particle emitter, or choosing moments when an AI could take a decision. I also use it in my <a href="http://preshing.com/20111118/locks-arent-slow-lock-contention-is">next post</a>, to measure the performance of threads which hold a lock for various intervals of time.</p>
<p>Any stats experts out there? If I&#8217;ve abused any terminology, or if you see any way to improve this post, I&#8217;d be interested in your comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20111007/how-to-generate-random-timings-for-a-poisson-process/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>High-Resolution Mandelbrot in Obfuscated Python</title>
		<link>http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python</link>
		<comments>http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python#comments</comments>
		<pubDate>Mon, 26 Sep 2011 10:23:17 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1846</guid>
		<description><![CDATA[Here&#8217;s a followup to last month&#8217;s post about Penrose Tiling in Obfuscated Python. The Mandelbrot set is a traditional favorite among authors of obfuscated code. You can find obfuscated code in C, Perl, Haskell, Python and many other languages. Nearly &#8230; <a href="http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a followup to last month&#8217;s post about <a href="http://preshing.com/20110822/penrose-tiling-in-obfuscated-python">Penrose Tiling in Obfuscated Python</a>.</p>
<p>The <a href="http://en.wikipedia.org/wiki/Mandelbrot_set">Mandelbrot set</a> is a traditional favorite among authors of obfuscated code. You can find obfuscated code in <a href="http://www.iwriteiam.nl/SigProgM.html">C</a>, <a href="http://www.maths.tcd.ie/~mkerrin/Programs/usr/others/mandelbrot">Perl</a>, <a href="http://snakelemma.blogspot.com/2009/08/mandelbrot-set-in-haskell.html">Haskell</a>, <a href="http://forums.thedailywtf.com/forums/p/5518/118328.aspx#118328">Python</a> and many other languages. Nearly all examples render the Mandelbrot set as ASCII art.</p>
<p>The following Python script, on the other hand, begins as ASCII art:</p>
<pre>
_                                      =   (
                                        255,
                                      lambda
                               V       ,B,c
                             :c   and Y(V*V+B,B,  c
                               -1)if(abs(V)&lt;6)else
               (              2+c-4*abs(V)**-0.4)/i
                 )  ;v,      x=1500,1000;C=range(v*x
                  );import  struct;P=struct.pack;M,\
            j  ='&lt;QIIHHHH',open('M.bmp','wb').write
for X in j('BM'+P(M,v*x*3+26,26,12,v,x,1,24))or C:
            i  ,Y=_;j(P('BBB',*(lambda T:(T*80+T**9
                  *i-950*T  **99,T*70-880*T**18+701*
                 T  **9     ,T*i**(1-T**45*2)))(sum(
               [              Y(0,(A%3/3.+X%v+(X/v+
                               A/3/3.-x/2)/1j)*2.5
                             /x   -2.7,i)**2 for  \
                               A       in C
                                      [:9]])
                                        /9)
                                       )   )
</pre>
<p><span id="more-1846"></span>It renders the Mandelbrot set as a full-color, anti-aliased, 1500&#215;1000 image. Click to enlarge:</p>
<p><a href="http://preshing.com/wp-content/uploads/2011/09/M.jpg"><img src="http://preshing.com/wp-content/uploads/2011/09/M-small.jpg" alt="" title="" width="535" height="357" class="aligncenter size-full wp-image-1851" /></a></p>
<p>No third-party libraries are required &#8212; just pure Python. However, it will only run on Python 2.5 &#8211; 2.7; Python 3 is not supported. The output file is written to <code>M.bmp</code>, in Windows bitmap format.</p>
<p>It runs very slowly, taking about 18 minutes on my 1.86 GHz Core 2 Duo (or 9 minutes using <a href="http://pypy.org/">PyPy</a>). With some modifications, it&#8217;s possible to make this code run up to 20 times faster. However, doing so requires sacrificing either code size or image quality.</p>
<p>If you&#8217;re willing to leave the script running for a few hours, you can increase the image resolution on line 8. (Just make sure the width is divisible by 4.) The resulting detail is quite nice. Here are some 1:1 pixel excerpts from an image rendered at 7200&#215;4800:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/09/detail.jpg" alt="" title="" width="535" height="357" class="aligncenter size-full wp-image-1861" /></p>
<p><img src="http://preshing.com/wp-content/uploads/2011/09/detail2.jpg" alt="" title="" width="535" height="357" class="aligncenter size-full wp-image-1862" /></p>
<p>The entire 7200&#215;4800 image is too large to share here, but it&#8217;s perfect for making prints. So that&#8217;s what I did! Notice the Python script superimposed in the lower-left corner. Is this the first poster to include its own source code?</p>
<p><a href="http://www.cafepress.com/preshing"><img src="http://preshing.com/wp-content/uploads/2011/09/poster-wall.jpg" alt="" title="" width="320" height="275" class="aligncenter size-full wp-image-1915" /></a></p>
<p>If this kind of thing gives you kicks, you can order your own print (or a coffee mug) at <a href="http://www.cafepress.com/preshing">CafePress</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python/feed</wfw:commentRss>
		<slash:comments>32</slash:comments>
		</item>
		<item>
		<title>Timing Your Code Using Python&#8217;s &#8220;with&#8221; Statement</title>
		<link>http://preshing.com/20110924/timing-your-code-using-pythons-with-statement</link>
		<comments>http://preshing.com/20110924/timing-your-code-using-pythons-with-statement#comments</comments>
		<pubDate>Sat, 24 Sep 2011 22:32:41 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1805</guid>
		<description><![CDATA[It&#8217;s common to want to time a piece of code. In Python, the with statement provides a convenient way to do so. If you&#8217;ve followed my previous post, The Python with Statement by Example, you should have no problem following &#8230; <a href="http://preshing.com/20110924/timing-your-code-using-pythons-with-statement">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s common to want to time a piece of code. In Python, the <code>with</code> statement provides a convenient way to do so.</p>
<p>If you&#8217;ve followed my previous post, <a href="http://preshing.com/20110920/the-python-with-statement-by-example">The Python with Statement by Example</a>, you should have no problem following along here. All you need is a class which implements the <code>__enter__</code> and <code>__exit__</code> methods:</p>
<pre>
import time

class Timer:
    def __enter__(self):
        self.start = time.clock()
        return self

    def __exit__(self, *args):
        self.end = time.clock()
        self.interval = self.end - self.start
</pre>
<p>Of course, this is not a revolutionary idea. We&#8217;re just subtracting a couple of <code><a href="http://docs.python.org/library/time.html#time.clock">time.clock</a></code> calls. If you google around, you&#8217;ll find <a href="http://code.activestate.com/recipes/498113-time-sections-of-code-by-using-with-statement/">several</a> <a href="http://www.daniweb.com/software-development/python/code/216610">people</a> suggesting to use the <code>with</code> statement in the same way. I&#8217;ve only tweaked the implementation details.</p>
<p><span id="more-1805"></span>In the above class, the <code>__enter__</code> method returns the <code>Timer</code> object itself, allowing us to assign it to a local variable using the <code><a href="http://docs.python.org/reference/compound_stmts.html#the-with-statement">"as" target</a></code> part of the <code>with</code> statement. So we can write the following:</p>
<pre>
import httplib

with Timer() <span class="highlight">as t</span>:
    conn = httplib.HTTPConnection('google.com')
    conn.request('GET', '/')

print('Request took %.03f sec.' % <span class="highlight">t</span>.interval)
</pre>
<p>The main advantage of using the <code>with</code> statement is that the <code>__exit__</code> method will be called regardless of how the nested block exits. Even if an exception is raised in the middle of the nested block &#8212; as would happen if network problems interfere with the above <code><a href="http://docs.python.org/library/httplib.html">HTTPConnection</a></code> &#8212; the <code>__exit__</code> method will be called. To see the result, though, we&#8217;d have to handle the exception in a <code>try/finally</code> block:</p>
<pre>
<span class="highlight">try:</span>
    with Timer() as t:
        conn = httplib.HTTPConnection('google.com')
        conn.request('GET', '/')
<span class="highlight">finally:</span>
    print('Request took %.03f sec.' % t.interval)
</pre>
<p>Now, even if your network cable comes unplugged, you&#8217;ll still see the running time of this code.</p>
<p>Of course, Python has its own set of <a href="http://docs.python.org/library/debug.html">debugging and profiling</a> modules in the standard library. Use whatever suits your purpose.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20110924/timing-your-code-using-pythons-with-statement/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Python &#8220;with&#8221; Statement by Example</title>
		<link>http://preshing.com/20110920/the-python-with-statement-by-example</link>
		<comments>http://preshing.com/20110920/the-python-with-statement-by-example#comments</comments>
		<pubDate>Tue, 20 Sep 2011 13:09:10 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1651</guid>
		<description><![CDATA[Python&#8217;s with statement was first introduced five years ago, in Python 2.5. It&#8217;s handy when you have two related operations which you&#8217;d like to execute as a pair, with a block of code in between. The classic example is opening &#8230; <a href="http://preshing.com/20110920/the-python-with-statement-by-example">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Python&#8217;s <code><a href="http://docs.python.org/reference/compound_stmts.html#the-with-statement">with</a></code> statement was first introduced five years ago, in Python 2.5. It&#8217;s handy when you have two related operations which you&#8217;d like to execute as a pair, with a block of code in between. The classic example is opening a file, manipulating the file, then closing it:</p>
<pre>
with open('output.txt', 'w') as f:
    f.write('Hi there!')
</pre>
<p>The above <code>with</code> statement will automatically close the file after the nested block of code. (Continue reading to see exactly how the close occurs.) The advantage of using a <code>with</code> statement is that it is guaranteed to close the file no matter <em>how</em> the nested block exits. If an exception occurs before the end of the block, it will close the file before the exception is caught by an outer exception handler. If the nested block were to contain a <code>return</code> statement, or a <code>continue</code> or <code>break</code> statement, the <code>with</code> statement would automatically close the file in those cases, too.</p>
<p><span id="more-1651"></span>Here&#8217;s another example. The <a href="http://cairographics.org/pycairo/">pycairo</a> drawing library contains a <code>Context</code> class which exposes a <code><a href="http://cairographics.org/documentation/pycairo/2/reference/context.html#cairo.Context.save">save</a></code> method, to push the current drawing state on an internal stack, and a <code><a href="http://cairographics.org/documentation/pycairo/2/reference/context.html#cairo.Context.restore">restore</a></code> method, to restore the drawing state from the stack. These two functions are always called in a pair, with some code in between.</p>
<p>This code sample uses a <code>Context</code> object (&#8220;cairo context&#8221;) to draw six rectangles, each with a different rotation. Each call to <code><a href="http://cairographics.org/documentation/pycairo/2/reference/context.html#cairo.Context.rotate">rotate</a></code> is actually combined with the current transformation, so we use a pair of calls to <code>save</code> and <code>restore</code> to preserve the drawing state on each iteration of the loop. This prevents the rotations from combining with each other:</p>
<pre>
cr.translate(68, 68)
for i in xrange(6):
    cr.<span class="highlight">save</span>()
    cr.<span class="highlight">rotate</span>(2 * math.pi * i / 6)
    cr.rectangle(-25, -60, 50, 40)
    cr.stroke()
    cr.<span class="highlight">restore</span>()
</pre>
<p><img src="http://preshing.com/wp-content/uploads/2011/09/six-rectangles.png" alt="" title="six-rectangles" width="136" height="136" class="aligncenter size-full wp-image-1722" /></p>
<p>That&#8217;s a fairly simple example, but for larger scripts, it can become cumbersome to keep track of which <code>save</code> goes with which <code>restore</code>, and to keep them correctly matched. The <code>with</code> statement can help tidy things up a bit.</p>
<p>By themselves, pycairo&#8217;s <code>save</code> and <code>restore</code> methods do not support the <code>with</code> statement, so we&#8217;ll have to add the support on our own. There are two ways to support the <code>with</code> statement: by implementing a context manager class, or by writing a generator function. I&#8217;ll demonstrate both approaches.</p>
<h2>Implementing the Context Manager as a Class</h2>
<p>Here&#8217;s the first approach. To implement a context manager, we define a class containing an <code>__enter__</code> and <code>__exit__</code> method. The class below accepts a cairo context, <code>cr</code>, in its constructor:</p>
<pre>
class Saved():
    def __init__(self, cr):
        self.cr = cr
    def <span class="highlight">__enter__</span>(self):
        self.cr.save()
        return self.cr
    def <span class="highlight">__exit__</span>(self, type, value, traceback):
        self.cr.restore()
</pre>
<p>Thanks to those two methods, it&#8217;s valid to instantiate a <code>Saved</code> object and use it in a <code>with</code> statement. The <code>Saved</code> object is considered to be the <a href="http://docs.python.org/reference/datamodel.html#context-managers">context manager</a>.</p>
<pre>
cr.translate(68, 68)
for i in xrange(6):
    <span class="highlight">with Saved(cr):</span>
        cr.rotate(2 * math.pi * i / 6)
        cr.rectangle(-25, -60, 50, 40)
        cr.stroke()
</pre>
<p>Here are the exact steps taken by the Python interpreter when it reaches the <code>with</code> statement:</p>
<ol>
<li>The <code>with</code> statement stores the <code>Saved</code> object in a temporary, hidden variable, since it&#8217;ll be needed later. (Actually, it only stores the bound <code>__exit__</code> method, but that&#8217;s a detail.)</li>
<li>The <code>with</code> statement calls <code>__enter__</code> on the <code>Saved</code> object, giving the context manager a chance to do its job.</li>
<li>The <code>__enter__</code> method calls <code>save</code> on the cairo context.</li>
<li>The <code>__enter__</code> method returns the cairo context, but as you can see, we have not specified the optional <code><a href="http://docs.python.org/reference/compound_stmts.html#the-with-statement">"as" target</a></code> part of the <code>with</code> statement. Therefore, the return value is not saved anywhere. We don&#8217;t need it; we know it&#8217;s the same cairo context that we passed in.</li>
<li>The nested block of code is executed. It sets up the rotation and draws a rectangle.</li>
<li>At the end of the nested block, the <code>with</code> statement calls the <code>Saved</code> object&#8217;s <code>__exit__</code> method, passing the arguments <code>(None, None, None)</code> to indicate that no exception occured.</li>
<li>The <code>__exit__</code> method calls <code>restore</code> on the cairo context.</li>
</ol>
<p>Once we understand what the Python interpreter is doing, we can make better sense of the example at the beginning of this blog post, where we opened a file in the <code>with</code> statement: File objects expose their own <code>__enter__</code> and <code>__exit__</code> methods, and can therefore act as their own context managers. Specifically, the <code>__exit__</code> method closes the file.</p>
<h3>Exception Handling</h3>
<p>Returning to the drawing example, what happens if an exception occurs within the nested code block? For example, suppose we mistakenly passed the wrong number of arguments to the <code>rectangle</code> call. In that case, the steps taken by the Python interpreter would be:</p>
<ol>
<li>The <code>rectangle</code> method raises a <code>TypeError</code> exception: &#8220;Context.rectangle() takes exactly 4 arguments.&#8221;</li>
<li>The <code>with</code> statement catches this exception.</li>
<li>The <code>with</code> statement calls <code>__exit__</code> on the <code>Saved</code> object. It passes information about the exception in three arguments: (<em>type</em>, <em>value</em>, <em>traceback</em>) &#8212; the same values you&#8217;d get by calling <code><a href="http://docs.python.org/library/sys.html#sys.exc_info">sys.exc_info</a></code>. This tells the <code>__exit__</code> method everything it could possibly need to know about the exception that occurred.</li>
<li>In this case, our <code>__exit__</code> method doesn&#8217;t particularly care. It calls <code>restore</code> on the cairo context anyway, and returns <code>None</code>. (In Python, when no <code>return</code> statement is specified, the function actually returns <code>None</code>.)</li>
<li>The <code>with</code> statement checks to see whether this return value is true. Since it isn&#8217;t, the <code>with</code> statement re-raises the <code>TypeError</code> exception to be handled by someone else.</li>
</ol>
<p>In this manner, we can guarantee that <code>restore</code> will always be called on the cairo context, whether an exception occurs or not.</p>
<h2>Implementing the Context Manager as a Generator</h2>
<p>That brings us to the second approach for supporting the <code>with</code> statement. Instead of implementing a class for the context manager, we can write a <a href="http://docs.python.org/tutorial/classes.html#generators">generator function</a>. Here&#8217;s a simplified example of such a generator function. Let me point out right away that this example is incomplete, since it does not handle exceptions very well. Read on for more details:</p>
<pre>
from contextlib import contextmanager

@contextmanager
def saved(cr):
    cr.save()
    <span class="highlight">yield</span> cr
    cr.restore()
</pre>
<p>There is a certain charm to writing a generator like this one. At first glance, it appears simpler than the previous approach: A single function takes the place of an entire class definition. But don&#8217;t be fooled! This approach involves many more steps, and a lot more complexity than the previous approach. It took me several reads of <a href="http://www.python.org/dev/peps/pep-0343/">PEP 343</a> &#8212;  which is more of a historical document than a reference &#8212; before I could claim to understand it completely. It requires familiarity with Python decorators, generators, iterators and functions-returning-functions, in addition to the object-oriented programming and exception handling we&#8217;ve already seen.</p>
<p>To make this generator work, two entities from <code><a href="http://docs.python.org/library/contextlib.html">contextlib</a></code>, a standard Python module, are required: the <code>contextmanager</code> function, and an internal class named <code>GeneratorContextManager</code>. The source code, <code><a href="http://hg.python.org/cpython/file/2.7/Lib/contextlib.py">contextlib.py</a></code>, is a bit hairy, but at least it&#8217;s short. I&#8217;ll simply describe what happens, and you are free to refer to the source code, and any other supplementary materials, as needed.</p>
<p>Let&#8217;s start with the generator itself. Here&#8217;s what happens when the above code snippet runs:</p>
<ol>
<li>The Python interpreter recognizes the <code><a href="http://docs.python.org/reference/expressions.html#yield-expressions">yield</a></code> statement in the middle of the function definition. As a result, the <code>def</code> statement does not create a normal function; it creates a generator function.</li>
<li>Because of the presence of the <code>@contextmanager</code> <a href="http://docs.python.org/glossary.html#term-decorator">decorator</a>, <code>contextmanager</code> is called with the generator function as its argument.</li>
<li>The <code>contextmanager</code> function returns a &#8220;factory&#8221; function, which creates <code>GeneratorContextManager</code> objects wrapped around the provided generator. (line 83 of <code>contextlib.py</code>)</li>
<li>Finally, the factory function is assigned to <code>saved</code>. From this point on, when we call <code>saved</code>, we&#8217;ll actually be calling the factory function.</li>
</ol>
<p>Equipped with all that good stuff, we can now write:</p>
<pre>
for i in xrange(6):
    <span class="highlight">with saved(cr):</span>
        cr.rotate(2 * math.pi * i / 6)
        cr.rectangle(-25, -60, 50, 40)
        cr.stroke()
</pre>
<p>Here are all the steps taken by the Python interpreter when it reaches the <code>with</code> statement.</p>
<ol>
<li>The <code>with</code> statement calls <code>saved</code>, which of course, calls the factory function, passing <code>cr</code>, a cairo context, as its only argument.</li>
<li>The factory function passes the cairo context to our generator function, creating a <a href="http://docs.python.org/reference/expressions.html#yield-expressions">generator</a>.</li>
<li>The generator is passed to the constructor of <code>GeneratorContextManager</code>, an internal class which will act as our context manager.</li>
<li>The <code>with</code> statement saves the <code>GeneratorContextManager</code> object in a temporary hidden variable. (Actually, it only stores the bound <code>__exit__</code> method, but that&#8217;s a detail.)</li>
<li>The <code>with</code> statement calls <code>__enter__</code> on the <code>GeneratorContextManager</code> object.</li>
<li><code>__enter__</code> calls <code><a href="http://docs.python.org/reference/expressions.html#generator.next">next</a></code> on the generator.</li>
<li>Our generator function &#8212; the block of code we defined under <code>def saved(cr)</code> &#8212; runs up until the <code>yield</code> statement. This calls <code>save</code> on the cairo context.</li>
<li>The <code>yield</code> statement yields the cairo context, which becomes the return value for the call to <code>next</code> on the iterator.</li>
<li>The <code>__enter__</code> method returns the cairo context, but as you can see, we have not specified the optional <code>"as" target</code> part of the <code>with</code> statement. Therefore, the return value is not saved anywhere. We don&#8217;t need it; we know it&#8217;s the same cairo context that we passed in.</li>
<li>The nested code block is executed. It sets up the rotation and draws a rectangle.</li>
<li>At the end of the nested block, the <code>with</code> statement calls the <code>__exit__</code> method on the <code>GeneratorContextManager</code> object, passing the arguments <code>(None, None, None)</code> to indicate that no exception occured.</li>
<li>The <code>__exit__</code> method calls <code>next</code> on the iterator (expecting a <code>StopIteration</code> exception).</li>
<li>Our generator resumes execution after the <code>yield</code> statement. This calls <code>restore</code> on the cairo context.</li>
<li>The generator returns, raising a <code>StopIteration</code> exception (as expected).</li>
<li>The <code>__exit__</code> method catches the <code>StopIteration</code> exception, and returns normally.</li>
</ol>
<p>And that&#8217;s it! We&#8217;ve successfully used this generator function as a <code>with</code> statement context manager. In this example, it helped that no exceptions occured. To correctly deal with exceptions, we&#8217;ll have to improve the generator function a little bit.</p>
<h3>Exception Handling</h3>
<p>Now, what happens if an exception occurs within the nested block while using this approach? Again, let&#8217;s suppose we&#8217;ve mistakenly passed the wrong number of arguments to the <code>rectangle</code> call. Here&#8217;s what would happen:</p>
<ol>
<li>The <code>rectangle</code> method raises a <code>TypeError</code> exception: &#8220;Context.rectangle() takes exactly 4 arguments.&#8221;</li>
<li>The <code>with</code> statement catches this exception.</li>
<li>The <code>with</code> statement calls <code>__exit__</code> on the <code>GeneratorContextManager</code> object. It passes information about the exception in three arguments: (<em>type</em>, <em>value</em>, <em>traceback</em>).</li>
<li><code>__exit__</code> calls <code><a href="http://docs.python.org/reference/expressions.html#generator.throw">throw</a></code> on the iterator, passing the same three arguments.</li>
<li>The <code>TypeError</code> exception is raised in the context of our generator function, on the line containing the <code>yield</code> statement.</li>
</ol>
<p>Uh oh! At this point, our current generator function has a problem: <code>restore</code> will <em>not</em> be called on the cairo context. An exception has been raised on the line containing the <code>yield</code> statement, so the rest of the generator function will not be executed. We need to make the generator more robust, by inserting a <code><a href="http://docs.python.org/tutorial/errors.html#defining-clean-up-actions">try/finally</a></code> block around the <code>yield</code>:</p>
<pre>
@contextmanager
def saved(cr):
    cr.save()
    <span class="highlight">try</span>:
        yield cr
    <span class="highlight">finally</span>:
        cr.restore()
</pre>
<p>Continuing where we left off:</p>
<ol start="6">
<li>Inside our generator, the <code>finally</code> block executes. This calls <code>restore</code> on the cairo context.</li>
<li>The <code>TypeError</code> exception went unhandled by the generator, so it is re-raised in the <code>__exit__</code> method, on the line containing the call to <code>throw</code> on the iterator. (line 35 of <code>contextlib.py</code>)</li>
<li>The <code>TypeError</code> exception is caught by <code>__exit__</code>.</li>
<li><code>__exit__</code> sees that the exception caught is the same exception that was passed in, and as a result, returns <code>None</code>.</li>
<li>The <code>with</code> statement checks to see whether this return value is true. Since it isn&#8217;t, the <code>with</code> statement re-raises the <code>TypeError</code> exception, to be handled by someone else.</li>
</ol>
<p>Thus concludes our journey through the Python <code>with</code> statement. If, like me, you&#8217;ve had a hard time understanding this statement completely &#8212; especially if you were attracted to the generator form of writing context managers &#8212; don&#8217;t feel bad. It&#8217;s complicated! It cleverly ties together several of Python&#8217;s language features, many of which were themselves introduced fairly recently in Python&#8217;s history. If any Pythonistas out there spot an error or oversight in the above explanation, please let me know in the comments.</p>
<h2>Drawing a Fractal Tree</h2>
<p>For those of you who have endured the entire blog post up to this point, here&#8217;s a small bonus script. It uses our newly minted cairo context manager to recursively draw a fractal tree.</p>
<pre>
import cairo
from contextlib import contextmanager

@contextmanager
def saved(cr):
    cr.save()
    try:
        yield cr
    finally:
        cr.restore()

def Tree(angle):
    cr.move_to(0, 0)
    cr.translate(0, -65)
    cr.line_to(0, 0)
    cr.stroke()
    cr.scale(0.72, 0.72)
    if angle > 0.12:
        for a in [-angle, angle]:
            <span class="highlight">with saved(cr):</span>
                cr.rotate(a)
                Tree(angle * 0.75)

surf = cairo.ImageSurface(cairo.FORMAT_ARGB32, 280, 204)
cr = cairo.Context(surf)
cr.translate(140, 203)
cr.set_line_width(5)
Tree(0.75)
surf.write_to_png('fractal-tree.png')
</pre>
<p><img src="http://preshing.com/wp-content/uploads/2011/09/tree.png" alt="" title="tree" width="280" height="204" class="aligncenter size-full wp-image-1696" /></p>
<p>For yet another example of <code>with</code> statement usage in Python, see <a href="http://preshing.com/20110924/timing-your-code-using-pythons-with-statement">Timing Your Code Using Python’s &#8220;with&#8221; Statement</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20110920/the-python-with-statement-by-example/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Penrose Tiling Explained</title>
		<link>http://preshing.com/20110831/penrose-tiling-explained</link>
		<comments>http://preshing.com/20110831/penrose-tiling-explained#comments</comments>
		<pubDate>Wed, 31 Aug 2011 11:05:47 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1505</guid>
		<description><![CDATA[Last week, I posted some obfuscated Python which generates Penrose tiling. Today, I&#8217;ll explain the basic algorithm behind that Python script, and share the non-obfuscated version. The algorithm manipulates a list of red and blue isosceles triangles. Each red triangle &#8230; <a href="http://preshing.com/20110831/penrose-tiling-explained">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Last week, I posted some <a href="http://preshing.com/20110822/penrose-tiling-in-obfuscated-python">obfuscated Python which generates Penrose tiling</a>. Today, I&#8217;ll explain the basic algorithm behind that Python script, and share the non-obfuscated version.</p>
<p>The algorithm manipulates a list of red and blue isosceles triangles. Each red triangle has a 36&deg; angle at its apex, while each blue triangle has a 108&deg; angle.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/08/red-blue-triangle.png" alt="" title="red-blue-triangle" width="345" height="121" class="aligncenter size-full wp-image-1586" /></p>
<p>In Python, we can represent such triangles as tuples of the form <code>(color, A, B, C)</code>. For the first element, <code>color</code>, a value of 0 indicates a red triangle, while 1 indicates blue. The rest of the tuple gives the co-ordinates of the <strong>A</strong>, <strong>B</strong> and <strong>C</strong> vertices, expressed as <a href="http://docs.python.org/library/stdtypes.html#numeric-types-int-float-long-complex">complex numbers</a>. Complex numbers work well here since they can represent any point on the 2D plane &#8212; the real component gives the x co-ordinate, while the imaginary component gives the y co-ordinate.</p>
<p><span id="more-1505"></span>As you can see, we draw an outline along the sides of the triangle, but not along the base. This allows each triangle to connect with another triangle of the same color, forming the <a href="http://en.wikipedia.org/wiki/Rhombus">rhombus</a>-shaped tiles that are visible in the final Penrose tiling.</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/08/connection.png" alt="" title="connection" width="130" height="96" class="aligncenter size-full wp-image-1544" /></p>
<p>Now here&#8217;s the fun part. Given a list of such triangles, we can subdivide each one to generate another triangle list. A red triangle is subdivided into two smaller triangles as follows:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/08/red-triangle-subdivision.png" alt="" title="red-triangle-subdivision" width="234" height="121" class="aligncenter size-full wp-image-1536" /></p>
<p>The above subdivision introduces a new vertex <strong>P</strong>, located at a point along the edge <strong>AB</strong> which satisfies the <a href="http://en.wikipedia.org/wiki/Golden_ratio">golden ratio</a>, <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1+%2B+%5Csqrt%7B5%7D%7D%7B2%7D&#038;bg=ffffff&#038;fg=000&#038;s=1' alt='&#92;frac{1 + &#92;sqrt{5}}{2}' title='&#92;frac{1 + &#92;sqrt{5}}{2}' class='latex' />.</p>
<p>Similarly, each blue triangle is subdivided into three smaller triangles:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/08/blue-triangle-subdivision.png" alt="" title="blue-triangle-subdivision" width="405" height="95" class="aligncenter size-full wp-image-1534" /></p>
<p>This subdivision introduces two new vertices: <strong>Q</strong> along the edge <strong>BA</strong>, and <strong>R</strong> along the edge <strong>BC</strong>, at points which also satisfy the golden ratio. As well, two of the resulting triangles are <em>mirrored</em> &#8212; I&#8217;ve drawn a highlight in the corner of each triangle to help identify which ones are mirrored and which are not.</p>
<p>All of the above steps can be performed using just a few lines of Python. This function accepts a list of triangles represented as tuples, subdivides each one, and returns the new triangle list:</p>
<pre>
goldenRatio = (1 + math.sqrt(5)) / 2

def subdivide(triangles):
    result = []
    for color, A, B, C in triangles:
        if color == 0:
            # Subdivide red triangle
            P = A + (B - A) / goldenRatio
            result += [(0, C, P, B), (1, P, C, A)]
        else:
            # Subdivide blue triangle
            Q = B + (A - B) / goldenRatio
            R = B + (C - B) / goldenRatio
            result += [(1, R, C, A), (1, Q, R, B), (0, R, Q, A)]
    return result
</pre>
<p>And here&#8217;s some code to actually draw the triangle list. It uses <a href="http://cairographics.org/pycairo/">pycairo</a>, a Python wrapper around the excellent <a href="http://cairographics.org/">cairo</a> drawing library.</p>
<pre>
# Draw red triangles
for color, A, B, C in triangles:
    if color == 0:
        cr.move_to(A.real, A.imag)
        cr.line_to(B.real, B.imag)
        cr.line_to(C.real, C.imag)
        cr.close_path()
cr.set_source_rgb(1.0, 0.35, 0.35)
cr.fill()    

# Draw blue triangles
for color, A, B, C in triangles:
    if color == 1:
        cr.move_to(A.real, A.imag)
        cr.line_to(B.real, B.imag)
        cr.line_to(C.real, C.imag)
        cr.close_path()
cr.set_source_rgb(0.4, 0.4, 1.0)
cr.fill()

# Determine line width from size of first triangle
color, A, B, C = triangles[0]
cr.set_line_width(abs(B - A) / 10.0)
cr.set_line_join(cairo.LINE_JOIN_ROUND)

# Draw outlines
for color, A, B, C in triangles:
    cr.move_to(C.real, C.imag)
    cr.line_to(A.real, A.imag)
    cr.line_to(B.real, B.imag)
cr.set_source_rgb(0.2, 0.2, 0.2)
cr.stroke()
</pre>
<p>Using all of the above code, we can, for example, start with a single red triangle, subdivide it several times, and draw the result after each subdivision. You can see the tiling pattern begin to emerge:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/08/triangle-sequence.png" alt="" title="triangle-sequence" width="504" height="102" class="aligncenter size-full wp-image-1547" /></p>
<p>You can even begin the sequence using another triangle list. Here&#8217;s some code to start with a &#8220;wheel&#8221; shape consisting of 10 red triangles:</p>
<pre>
# Create wheel of red triangles around the origin
triangles = []
for i in xrange(10):
    B = cmath.rect(1, (2*i - 1) * math.pi / 10)
    C = cmath.rect(1, (2*i + 1) * math.pi / 10)
    if i % 2 == 0:
        B, C = C, B  # Make sure to mirror every second triangle
    triangles.append((0, 0j, B, C))
</pre>
<p>If we subdivide this wheel shape repeatedly, we get the following sequence of tilings. Notice that each tiling contains a lot of symmetry &#8212; both reflective and rotational symmetry around 5 different axes:</p>
<p><img src="http://preshing.com/wp-content/uploads/2011/08/wheel-sequence.png" alt="" title="wheel-sequence" width="519" height="334" class="aligncenter size-full wp-image-1549" /></p>
<p>If you study either the top or bottom row of this sequence carefully, you&#8217;ll notice that for each tiling except the first, an upside-down copy appears in the tiling to the right. I&#8217;ve drawn some yellow outlines to make this more obvious. Looking at it another way: if you take any of these tilings, subdivide it twice, flip it vertically and enlarge the result, you&#8217;ve basically <em>added another ring</em> around the tiling. By repeating this process indefinitely, you can see how a Penrose tiling could be made to completely fill the entire plane.</p>
<p>Finally, here&#8217;s a (non-obfuscated) Python script which ties everything together: download <a href="http://preshing.com/files/penrose.py">penrose.py</a>. It starts with a wheel pattern, subdivides it 10 times, and renders the enlarged, cropped result inside a 1000&#215;1000 image. </p>
<p>I pieced this explanation together from various sources: mainly <a href="http://www.math.ubc.ca/~cass/courses/m308-02b/projects/schweber/penrose.html">this page at UBC</a> and the <a href="http://en.wikipedia.org/wiki/Penrose_tiling">Wikipedia entry</a>. Mind you, this is <em>not</em> the only algorithm which can generate a Penrose tiling. Another method involves <a href="http://www.quadibloc.com/math/pen06.htm">projecting a 5-dimensional set of lattice points onto a 2D plane</a>. I haven&#8217;t taken the time to fully understand that one, but it seems to open up the possibility of <a href="http://condellpark.com/kd/q530center.gif">interesting color patterns</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20110831/penrose-tiling-explained/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Penrose Tiling in Obfuscated Python</title>
		<link>http://preshing.com/20110822/penrose-tiling-in-obfuscated-python</link>
		<comments>http://preshing.com/20110822/penrose-tiling-in-obfuscated-python#comments</comments>
		<pubDate>Mon, 22 Aug 2011 11:51:10 +0000</pubDate>
		<dc:creator>Jeff Preshing</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://preshing.com/?p=1436</guid>
		<description><![CDATA[Who says you can&#8217;t write obfuscated Python? Here&#8217;s a Python script which renders some Penrose tiling. Yes, this is valid Python code: _ =\ """if! 1:"e,V=100 0,(0j-1)**-.2; v,S=.5/ V.real, [(0,0,4 *e,4*e* V)];w=1 -v"def! E(T,A, B,C):P ,Q,R=B*w+ A*v,B*w+C *v,A*w+B*v;retur n[(1,Q,C,A),(1,P ,Q,B),(0,Q,P,A)]*T+[(0,C &#8230; <a href="http://preshing.com/20110822/penrose-tiling-in-obfuscated-python">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Who says you <a href="http://blog.delaguardia.com.mx/obfuscated-python-contest">can&#8217;t</a> <a href="http://code.activestate.com/lists/python-list/16171/">write</a> obfuscated Python?</p>
<p>Here&#8217;s a Python script which renders some <a href="http://en.wikipedia.org/wiki/Penrose_tiling">Penrose tiling</a>. Yes, this is valid Python code:</p>
<pre>
_                                 =\
                                """if!
                              1:"e,V=100
                            0,(0j-1)**-.2;
                           v,S=.5/  V.real,
                         [(0,0,4      *e,4*e*
                       V)];w=1          -v"def!
                      E(T,A,              B,C):P
                  ,Q,R=B*w+                A*v,B*w+C
            *v,A*w+B*v;retur              n[(1,Q,C,A),(1,P
     ,Q,B),(0,Q,P,A)]*T+[(0,C            ,R,B),(1,R,C,A)]*(1-T)"f
or!i!in!_[:11]:S       =sum([E          (*x)for       !x!in!S],[])"imp
  ort!cair               o!as!O;      s=O.Ima               geSurfac
   e(1,e,e)               ;c=O.Con  text(s);               M,L,G=c.
     move_to                ,c.line_to,c.s                et_sour
       ce_rgb                a"def!z(f,a)                :f(-a.
        imag,a.       real-e-e)"for!T,A,B,C!in[i       !for!i!
          in!S!if!i[""";exec(reduce(lambda x,i:x.replace(chr
           (i),"\n "[34-i:]),   range(   35),_+"""0]]:z(M,A
             );z(L,B);z         (L,C);         c.close_pa
             th()"G             (.4,.3             ,1);c.
             paint(             );G(.7             ,.7,1)
             ;c.fil             l()"fo             r!i!in
             !range             (9):"!             g=1-i/
             8;d=i/          4*g;G(d,d,d,          1-g*.8
             )"!def     !y(f,a):z(f,a+(1+2j)*(     1j**(i
             /2.))*g)"!for!T,A,B,C!in!S:y(M,C);y(L,A);y(M
             ,A);y(L,B)"!c.st            roke()"s.write_t
             o_png('pen                        rose.png')
             """                                       ))
</pre>
<p><span id="more-1436"></span>When this program runs, it outputs a 1000&#215;1000 image file to <code>penrose.png</code>, consisting of about 2212 Penrose tiles rendered with a 3D relief effect. Here&#8217;s a slice of the image (click to enlarge):</p>
<p><a href="http://preshing.com/wp-content/uploads/2011/08/penrose.jpg"><img src="http://preshing.com/wp-content/uploads/2011/08/penrose-cropped.jpg" alt="" width="535" height="463" class="aligncenter size-full wp-image-1454" /></a></p>
<p>The script requires <a href="http://cairographics.org/pycairo/">Pycairo</a>. It only runs on Python <= 2.7; Python 3 is not supported. It started life as a regular Python script, but in my effort to make the code more compact, I got a bit carried away.</p>
<p>Penrose tilings are cool because they cover the entire plane in an aperiodic way &mdash; a shifted copy of the image never matches the original. They were invented by Sir Roger Penrose after a series of attempts to tile the plane with pentagonal shapes. For an explanation of the algorithm behind this script, see my next post, <a href="http://preshing.com/20110831/penrose-tiling-explained">Penrose Tiling Explained</a>.</p>
<p>Python never got much credit as an obfuscated programming language, compared to C or Perl. It seems a contest never took place, and there aren&#8217;t too many examples of obfuscated Python on the web: You&#8217;ll find a few examples <a href="http://docs.python.org/faq/programming.html#is-it-possible-to-write-obfuscated-one-liners-in-python">in the official Python FAQ</a> and on various pages such as <a href="http://p-nand-q.com/python/obfuscated_python.html">here</a> and <a href="http://c2.com/cgi/wiki?ObfuscatedPython">here</a>. There was also a <a href="http://pycon.tv/video/46/">talk at PyCon 2011</a>.</p>
<p>I believe this is the first example of obfuscated Python which outputs a high-resolution image. You&#8217;ll find another example in my followup post, <a href="http://preshing.com/20110926/high-resolution-mandelbrot-in-obfuscated-python">High-Resolution Mandelbrot in Obfuscated Python</a>. If you know of any others, let me know in the comments!</p>
]]></content:encoded>
			<wfw:commentRss>http://preshing.com/20110822/penrose-tiling-in-obfuscated-python/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

