Always Use a Lightweight Mutex

November 24, 2011

In multithreaded programming, we often speak of locks (also known as mutexes). But a lock is only a concept. To actually use that concept, you need an implementation. As it turns out, there are many ways to implement a lock, and those implementations vary wildly in performance.

The Windows SDK provides two lock implementations for C/C++: the Mutex and the Critical Section. (As Ned Batchelder points out, Critical Section is probably not the best name to give to the lock itself, but we’ll forgive that here.)

The Windows Critical Section is what we call a lightweight mutex. It’s optimized for the case when there are no other threads competing for the lock. To demonstrate using a simple example, here’s a single thread which locks and unlocks a Windows Mutex exactly one million times.

HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
for (int i = 0; i < 1000000; i++)
{
    WaitForSingleObject(mutex, INFINITE);
    ReleaseMutex(mutex);
}
CloseHandle(mutex);

Here's the same experiment using a Windows Critical Section.

CRITICAL_SECTION critSec;
InitializeCriticalSection(&critSec);
for (int i = 0; i < 1000000; i++)
{
    EnterCriticalSection(&critSec);
    LeaveCriticalSection(&critSec);
}	
DeleteCriticalSection(&critSec);

If you insert some timing code around the inner loop, and divide the result by one million, you'll find the average time required for a pair of lock/unlock operations in both cases. I did that, and ran the experiment on two different processors. The results:

The Critical Section is 25 times faster. As Larry Osterman explains, the Windows Mutex enters the kernel every time you use it, while the Critical Section does not. The tradeoff is that you can't share a Critical Section between processes. But who cares? Most of the time, you just want to protect some data within a single process. (It is actually possible to share a lightweight mutex between processes - just not using a Critical Section. See Roll Your Own Lightweight Mutex for example.)

Now, suppose you have a thread which acquires a Critical Section 100000 times per second, and there are no other threads competing for the lock. Based on the above figures, you can expect to pay between 0.2% and 0.6% in lock overhead. Not too bad! At lower frequencies, the overhead becomes negligible.

Other Platforms

In MacOS 10.6.6, a lock implementation is provided using the POSIX Threads API. It's a lightweight mutex which doesn't enter the kernel unless there's contention. A pair of uncontended calls to pthread_mutex_lock and pthread_mutex_unlock takes about 92 ns on my 1.86 GHz Core 2 Duo. Interestingly, it detects when there's only one thread running, and in that case switches to a trivial codepath taking only 38 ns.

MacOS also offers NSLock, an Objective-C class, but this is really just a wrapper around the aforementioned POSIX mutex. Because each operation must wind its way through objc_msgSend, the overhead is a little higher: 155 ns on my Core 2 Duo, or 98 ns if there's only a single thread.

Naturally, Ubuntu 11.10 provides a lock implementation using the POSIX Threads API as well. It's another lightweight mutex, based on a Linux-specific construct known as a futex. A pair of pthread_mutex_lock/pthread_mutex_unlock calls takes about 66 ns on my Core 2 Duo. You can even share this implementation between processes, but I didn't test that.

Even the Playstation 3 SDK offers a choice between a lightweight mutex and a heavy one. Back in 2007, early in the development of a Playstation 3 game I worked on, we were using the heavy mutex. Switching to the lightweight mutex made the game start 17 seconds faster! For me, that's when the difference really hit home.

In my previous post, I argued against the misconception that locks are slow and provided some data to support the argument. At this point, it should be clear that if you aren't using a lightweight mutex, the entire argument goes out the window. I'm fairly sure that the existence of heavy lock implementations has only added to this misconception over the years.

Some of you old-timers may point out ancient platforms where a heavy lock was the only implementation available, or when a semaphore had to be used for the job. But it seems all modern platforms offer a lightweight mutex. And even if they didn't, you could write your own lightweight mutex at the application level, even sharing it between processes, provided you're willing to live with certain caveats. You'll find one example in my followup post, Roll Your Own Lightweight Mutex.

9 Comments

  • Reply Jean-Francois Dube on November 25, 2011

    Nice posts Jeff, very enlightening. I also came to the conslusion, as described in this post (http://jfdube.wordpress.com/2011/09/24/lessons-learnt-while-spinning), that using the OS provided locking primitives is almost always a benefit, especially on platforms without thread priority inversion.

    Also, once again, very nice graphs! ;)

  • Reply DAN THE MAN on December 13, 2011

    Very interesting indeed….

    I’m wondering if this would help me unlock Elisabeth Taylor’s rings and diamonds locks ?

    But seriouly you should think of translating all this in the language of Moliere you would get over a hundred hits a day…Peut etre !

    Lol ,I hope you’re well and will have a good time for christmas…Hope more people get interested in your great work….

  • Reply Mariusz Lapinski on December 14, 2011

    Mutexes can be shared across the processes, however critical sections and slim reader/writer locks can be used on synchronization primitives in the shared memory.

    I would suggest using slim reader/writer locks whenever applicable (they are not recursive). On the second place, I would place critical sections, but with spin locks. If the performance of slim r/w is still not satisfying, one can consider using lockless mechanisms.

    Please refer to: http://msdn.microsoft.com/en-us/library/windows/desktop/ee418650(v=vs.85).aspx

  • Reply Andrew on January 17, 2012

    Very interesting post Jeff. I was able to replicate your test with a single thread locking and unlocking, but when I tried to apply it to actual code, switching from the mutex to the windows “critical section” actually made it slower – about 10%.

    I then ran the actual code with a single thread, and noticed that the CS was faster with the one thread until the OS would switch threads, in which case there would be a significant slowdown. I should note that my test machine has two actual CPUs (and multiple cores on each CPU), so maybe the cost of synchronizing the cache isn’t actually that negligible?

    • Reply Jeff Preshing on January 17, 2012

      That sounds bizarre Andrew. You’re certain the mutex is working? Perhaps you could take a look at the thread activity in xperf to get some clue as to what’s happening.

      As a sanity check, I tried re-running the test suite from Locks Aren’t Slow; Lock Contention Is using a mutex instead of a critical section. As expected, the critical section was consistently faster.

      • Reply Andrew on January 18, 2012

        So I profiled it, and the critical sections did have less waiting time (and less waits total), but the code was slower. I discovered that it’s an issue with the incremental compiling (using the Intel compiler). If I do full rebuilds, critical sections are in fact faster. D’oh.

  • Reply Rich on January 24, 2012

    As I was reading through your post I also remembered the old school multi-process mutex in the PlayStation3 SDK. My guess is that at the start of a platform’s life, the libraries include all of the components that were needed to prepare the OS for use by the platform to run lots of processes and do lots of exciting things. It’s only later that the more optimal lightweight mutexes arrive at which point it’s a nice free performance boost for those who upgrade.

  • Reply Gemera on October 24, 2012

    Overall v.interesting and chimes with other sources.

    Only one small thing – you say about Critical Section:

    “It’s optimized for the case when there are no other threads competing for the lock.”

    I’m assuming that’s a typo and should read “processes” rather than “threads”.

Leave a Reply