Preshing on ProgrammingPreshing on Programming

Implementing a Recursive Mutex

When optimizing code for multiple CPU cores, sometimes you need to write a new synchronization primitive. I don’t mean to encourage it, but it does happen. And if you’re going to do it, you might as well start by looking at a few examples. This won’t save you from shooting yourself in the foot, but it may help reduce the number of times, so you can walk away with a few toes remaining.

In my previous post, I showed how to implement a synchronization primitive known as the Benaphore in C++ on Win32. The Benaphore is not lock-free (being a lock itself), but it does serve as a simple yet instructive example of writing a synchronization primitive in user space. It also offers very low overhead when there’s no lock contention.

One limitation of the implementation I showed was that it was non-recursive. This means that if the same thread attempts to obtain the same lock twice, it will deadlock. In this post, I’ll show how to extend the implementation to support recursive locking.

Roll Your Own Lightweight Mutex

In an earlier post, I pointed out the importance of using a lightweight mutex. I also mentioned it was possible to write your own, provided you can live with certain limitations.

Why would you do such a thing? Well, in the past, some platforms (like BeOS) didn’t provide a lightweight mutex in the native API. Today, that’s not really a concern. I’m mainly showing this because it’s an interesting look at implementing synchronization primitives in general. As a curiosity, it just so happens this implementation shaves almost 50% off the overhead of the Windows Critical Section in the uncontended case. [Update: A closer look shows that under very high contention, a Windows Critical Section still performs much better.]

For the record, there are numerous ways to write your own mutex – or lock – entirely in user space, each with its own tradeoffs:

  • Spin locks. These employ a busy-wait strategy which has the potential to waste CPU time, and in the worst case, can lead to livelock when competing threads run on the same core. Still, some programmers have found measurable speed improvements switching to spin locks in certain cases.
  • Peterson’s algorithm is like a spinlock for two threads. A neat trick, but seems useless on today’s platforms. I find it noteworthy because Bartosz Milewski used this algorithm as a case study while discussing the finer points of the x86 memory model. Others have discussed Peterson’s lock in similar contexts.
  • Charles Bloom has a long writeup describing various mutex implementations. Excellent information, but possibly greek to anyone unfamiliar with C++11’s atomics library and Relacy’s ($) notation.

A Look Back at Single-Threaded CPU Performance

Throughout the 80’s and 90’s, CPUs were able to run virtually any kind of software twice as fast every 18-20 months. The rate of change was incredible. Your 486SX-16 was almost obsolete by the time you got it through the door. But eventually, at some point in the mid-2000’s, progress slowed down considerably for single-threaded software – which was most software.

Perhaps the turning point came in May 2004, when Intel canceled its latest single-core development effort to focus on multicore designs. Later that year, Herb Sutter wrote his now-famous article, The Free Lunch Is Over. Not all software will run remarkably faster year-over-year anymore, he warned us. Concurrent software would continue its meteoric rise, but single-threaded software was about to get left in the dust.

So, what’s happened since 2004? Clearly, multicore computing has become mainstream. Everybody acknowledges that single-threaded CPU performance no longer increases as quickly as it previously did – but at what rate is it actually increasing?

It’s tough to find an answer. Bill Dally of nVidia threw out a few numbers in a recent presentation: He had predicted 19% per year, but says it’s turned out closer to 5%. Last year, Chuck Moore of AMD presented this graph, suggesting that single-threaded CPU performance recently started going backwards:

A C++ Profiling Module for Multithreaded APIs

In my post about lock contention, I gave some statistics for the memory allocator in a multithreaded game engine: 15000 calls per second coming from 3 threads, taking around 2% CPU. To collect those statistics, I wrote a small profiling module, which I’ll share here.

A profiling module is different from conventional profilers like xperf or VTune in that no third-party tools are required. You just drop the module into any C++ application, and the process collects and reports performance data by itself.

This particular profiling module is meant to act on one or more target modules in the application. A target module can be anything which exposes a well-defined API, such as a memory allocator. To make it work, you must insert a macro named API_PROFILER into every public function exposed by that API. Below, I’ve added it to dlmalloc, one of the functions in the Doug Lea Malloc API. The same macro should be added to dlrealloc, dlfree, and other public functions as well.

DEFINE_API_PROFILER(dlmalloc);

void* dlmalloc(size_t bytes)
{
    API_PROFILER(dlmalloc);

#if USE_LOCKS
    ensure_initialization();
#endif

    if (!PREACTION(gm))
    {
        void* mem;
        size_t nb;
        if (bytes <= MAX_SMALL_REQUEST)
        {
            ...

Always Use a Lightweight Mutex

In multithreaded programming, we often speak of locks (also known as mutexes). But a lock is only a concept. To actually use that concept, you need an implementation. As it turns out, there are many ways to implement a lock, and those implementations vary wildly in performance.

The Windows SDK provides two lock implementations for C/C++: the Mutex and the Critical Section. (As Ned Batchelder points out, Critical Section is probably not the best name to give to the lock itself, but we’ll forgive that here.)

The Windows Critical Section is what we call a lightweight mutex. It’s optimized for the case when there are no other threads competing for the lock. To demonstrate using a simple example, here’s a single thread which locks and unlocks a Windows Mutex exactly one million times.

HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
for (int i = 0; i < 1000000; i++)
{
    WaitForSingleObject(mutex, INFINITE);
    ReleaseMutex(mutex);
}
CloseHandle(mutex);

Locks Aren’t Slow; Lock Contention Is

Locks (also known as mutexes) have a history of being misjudged. Back in 1986, in a Usenet discussion on multithreading, Matthew Dillon wrote, “Most people have the misconception that locks are slow.” 25 years later, this misconception still seems to pop up once in a while.

It’s true that locking is slow on some platforms, or when the lock is highly contended. And when you’re developing a multithreaded application, it’s very common to find a huge performance bottleneck caused by a single lock. But that doesn’t mean all locks are slow. As I’ll show in this post, sometimes a locking strategy achieves excellent performance.

Perhaps the most easily-overlooked source of this misconception: Not all programmers may be aware of the difference between a lightweight mutex and a “kernel mutex”. I’ll talk about that in my next post, Always Use a Lightweight Mutex. For now, let’s just say that if you’re programming in C/C++ on Windows, the Critical Section object is the one you want.

Other times, the conclusion that locks are slow is supported by a benchmark. For example, this post measures the performance of a lock under heavy conditions: each thread must hold the lock to do any work (high contention), and the lock is held for an extremely short interval of time (high frequency). It’s a good read, but in a real application, you generally want to avoid using locks in that way. To put things in context, I’ve devised a benchmark which includes both best-case and worst-case usage scenarios for locks.

How to Generate Random Timings for a Poisson Process

What’s a Poisson process, and how is it useful?

Any time you have events which occur individually at random moments, but which tend to occur at an average rate when viewed as a group, you have a Poisson process.

For example, the USGS estimates that each year, there are approximately 13000 earthquakes of magnitude 4+ around the world. Those earthquakes are scattered randomly throughout the year, but there are more or less 13000 per year. That’s one example of a Poisson process. The Wikipedia page lists several others.

In statistics, there are a bunch of functions and equations to help model a Poisson process. I’ll present one of those functions in this post, and demonstrate its use in writing a simulation.

The Exponential Distribution

If 13000 such earthquakes happen every year, it means that, on average, one earthquake happens every 40 minutes. So, let’s define a variable λ = \(\frac{1}{40} \) and call it the rate parameter. The rate parameter λ is a measure of frequency: the average rate of events (in this case, earthquakes) per unit of time (in this case, minutes).

High-Resolution Mandelbrot in Obfuscated Python

Here’s a followup to last month’s post about Penrose Tiling in Obfuscated Python.

The Mandelbrot set is a traditional favorite among authors of obfuscated code. You can find obfuscated code in C, Perl, Haskell, Python and many other languages. Nearly all examples render the Mandelbrot set as ASCII art.

The following Python script, on the other hand, begins as ASCII art:

_                                      =   (
                                        255,
                                      lambda
                               V       ,B,c
                             :c   and Y(V*V+B,B,  c
                               -1)if(abs(V)<6)else
               (              2+c-4*abs(V)**-0.4)/i
                 )  ;v,      x=1500,1000;C=range(v*x
                  );import  struct;P=struct.pack;M,\
            j  ='<QIIHHHH',open('M.bmp','wb').write
for X in j('BM'+P(M,v*x*3+26,26,12,v,x,1,24))or C:
            i  ,Y=_;j(P('BBB',*(lambda T:(T*80+T**9
                  *i-950*T  **99,T*70-880*T**18+701*
                 T  **9     ,T*i**(1-T**45*2)))(sum(
               [              Y(0,(A%3/3.+X%v+(X/v+
                               A/3/3.-x/2)/1j)*2.5
                             /x   -2.7,i)**2 for  \
                               A       in C
                                      [:9]])
                                        /9)
                                       )   )

Timing Your Code Using Python’s “with” Statement

It’s common to want to time a piece of code. In Python, the with statement provides a convenient way to do so.

If you’ve followed my previous post, The Python with Statement by Example, you should have no problem following along here. All you need is a class which implements the __enter__ and __exit__ methods:

import time

class Timer:    
    def __enter__(self):
        self.start = time.clock()
        return self

    def __exit__(self, *args):
        self.end = time.clock()
        self.interval = self.end - self.start

Of course, this is not a revolutionary idea. We’re just subtracting a couple of time.clock calls. If you google around, you’ll find several people suggesting to use the with statement in the same way. I’ve only tweaked the implementation details.

The Python “with” Statement by Example

Python’s with statement was first introduced five years ago, in Python 2.5. It’s handy when you have two related operations which you’d like to execute as a pair, with a block of code in between. The classic example is opening a file, manipulating the file, then closing it:

with open('output.txt', 'w') as f:
    f.write('Hi there!')

The above with statement will automatically close the file after the nested block of code. (Continue reading to see exactly how the close occurs.) The advantage of using a with statement is that it is guaranteed to close the file no matter how the nested block exits. If an exception occurs before the end of the block, it will close the file before the exception is caught by an outer exception handler. If the nested block were to contain a return statement, or a continue or break statement, the with statement would automatically close the file in those cases, too.