Roll Your Own Lightweight Mutex

February 26, 2012

In an earlier post, I pointed out the importance of using a lightweight mutex. I also mentioned it was possible to write your own, provided you can live with certain limitations.

Why would you do such a thing? Well, in the past, some platforms (like BeOS) didn’t provide a lightweight mutex in the native API. Today, that’s not really a concern. I’m mainly showing this because it’s an interesting look at implementing synchronization primitives in general. As a curiosity, it just so happens this implementation shaves almost 50% off the overhead of the Windows Critical Section in the uncontended case. [Update: A closer look shows that under very high contention, a Windows Critical Section still performs much better.]

For the record, there are numerous ways to write your own mutex — or lock — entirely in user space, each with its own tradeoffs:

  • Spin locks. These employ a busy-wait strategy which has the potential to waste CPU time, and in the worst case, can lead to livelock when competing threads run on the same core. Still, some programmers have found measurable speed improvements switching to spin locks in certain cases.
  • Peterson’s algorithm is like a spinlock for two threads. A neat trick, but seems useless on today’s platforms. I find it noteworthy because Bartosz Milewski used this algorithm as a case study while discussing the finer points of the x86 memory model. Others have discussed Peterson’s lock in similar contexts.
  • Charles Bloom has a long writeup describing various mutex implementations. Excellent information, but possibly greek to anyone unfamiliar with C++11′s atomics library and Relacy‘s ($) notation.

Some of those implementations are pretty advanced. Here’s a relatively simple technique, using a semaphore and some atomic operations. I came up with it while writing my post about lock contention, but soon afterwards learned it was already in use as far back as 1996, when some engineers referred to it as the Benaphore. Here’s a C++ implementation for Win32, assuming an x86 CPU architecture:

#include <windows.h>
#include <intrin.h>

class Benaphore
{
private:
    LONG m_counter;
    HANDLE m_semaphore;

public:
    Benaphore()
    {
        m_counter = 0;
        m_semaphore = CreateSemaphore(NULL, 0, 1, NULL);
    }

    ~Benaphore()
    {
        CloseHandle(m_semaphore);
    }

    void Lock()
    {
        if (_InterlockedIncrement(&m_counter) > 1) // x86/64 guarantees acquire semantics
        {
            WaitForSingleObject(m_semaphore, INFINITE);
        }
    }

    void Unlock()
    {
        if (_InterlockedDecrement(&m_counter) > 0) // x86/64 guarantees release semantics
        {
            ReleaseSemaphore(m_semaphore, 1, NULL);
        }
    }
};

This implementation also serves as a convenient introduction to atomics, which are at the heart of many lock-free algorithms.

_InterlockedIncrement is an atomic read-modify-write (RMW) operation on Win32. If multiple threads attempt an atomic RMW at the same time, on the same piece of data, they will all line up in a row and execute one-at-a-time. This makes it possible to reason about what happens, and ensure correctness. It even works on multicore and multiprocessor systems. (For more information about atomic operations, check out this post by JF Dubé.)

Every modern processor provides some atomic RMW instructions, though the APIs provided by the toolchain may expose different meanings for the return values. On Win32, _InterlockedIncrement adds 1 to the specified integer and returns the new value. Since m_counter is initialized to 0, the first thread to call Lock will receive a return value of 1 from _InterlockedIncrement. As such, it skips over the WaitForSingleObject call and returns immediately. The lock now belongs to this thread, and life is peachy.

If another thread calls Lock while the first thread is still holding it, it will receive a return value of 2 from _InterlockedIncrement. This is a clue that the lock is already busy. At this point, it’s not safe to continue, so we bite the bullet and jump into one of those expensive kernel calls: WaitForSingleObject. This performs a decrement on the semaphore. We specified an initial count of 0 in CreateSemaphore, so the thread is now forced to wait until someone else comes along and increments this semaphore before it can proceed.

Next, suppose the first thread calls Unlock. The return value of _InterlockedDecrement will be 1. This is a clue that another thread is waiting for the lock, and that we should increment the semaphore using ReleaseSemaphore. The second thread is then able to continue, and it effectively obtains ownership of the lock.

Even if the timing is very tight, and the first thread calls ReleaseSemaphore before the second calls WaitForSingleObject, everything will function normally. And if you add a third, fourth or any number of other threads into the picture, that’s fine too. For good measure, you can even add a TryLock function to the implementation:

    bool TryLock()
    {
        LONG result = _InterlockedCompareExchange(&m_counter, 1, 0);
        return (result != 0);
    }

Performance and Caveats

You might have noticed the underscore in front of _InterlockedIncrement. This is the compiler intrinsic version of InterlockedIncrement. It outputs a lock xadd instruction in place. And since Lock is defined right inside the class definition of Benaphore, the compiler treats it as an inline function. A call to Benaphore::Lock compiles down to 10 instructions using default Release settings, and there are no function calls in the uncontended case. Here’s the disassembly:

In the uncontended case, the Benaphore even outperforms a Critical Section on Win32. I timed a pair of uncontended lock/unlock operations, just as I previously did for the Mutex and Critical Section, and found the following timings:

If you have an application which is hitting a lock millions of times per second on Win32, this Benaphore implementation just might boost your overall performance by a couple of percent. [Update: I tried it. It doesn't.] If you forego the intrinsics and just use regular, non-intrinsic version of the atomics, an indirect call into kernel32.dll is involved, so the Benaphore loses some of its edge: 49.8 ns on my Core 2 Duo.

Furthermore, with some (fairly heavy) code modifications, you could even share this Benaphore between processes — something Critical Section isn’t capable of. You’d have to put m_counter in shared memory, and use a named semaphore. However, depending on your use case, this might defeat the purpose of having separate processes in the first place. If one process terminates at the wrong moment, it could leave the Benaphore in an inconsistent state, and take the other processes down with it.

This implementation is non-recursive. If the same thread attempts to lock the same Benaphore twice, it will deadlock. In my next post, I’ll extend the implementation to allow recursion.

If you port this code to another CPU architecture, such as that in the Xbox 360 or a multicore iOS device, you’ll have two replace the _InterlockedIncrement macro with something which explicitly guarantees acquire and release semantics.

Finally, if you do use the Benaphore on another platform, be aware that it may introduce the possibility of priority inversion. For example, MacOS X and Linux avoid priority inversion by performing priority inheritance when you take a POSIX lock. If you use a Benaphore, you’ll bypass this OS mechanism, and it won’t be able to help you. Things are different on Windows, though: Windows fights priority inversion by randomly boosting the priority of starving threads, which behaves the same regardless of your choice of synchronization primitive.

13 Comments

  • Reply Phil Teschner on February 26, 2012

    Just figured I’d let you know that you referred to Charles Bloom as Chris Bloom in your article.

  • Reply Jean-Francois Dube on February 27, 2012

    Nice article, Jeff!

    Some notes:

    1. When working on Splinter Cell: Conviction on Xbox 360, I replaced a couple of heavily-used lightweight mutexes by spinlocks to save a some micro-seconds. It worked as expected, but due to priority inversion, we had very rare and random freezes that were nightmares to track. After a week or so, we decided to revert the changes.

    2. Spinlocks are a bad idea when concurrently used by threads running on the same core (especially on OS that don’t changes threads affinity, ie Xbox 360), as they just spend their time busy-waiting until the OS decides to schedule the other threads. Lightweith mutexes, on the other hand, will loop n times then enter the kernel.

    For those reasons, I think it’s always better to use lightweight mutexes (see http://jfdube.wordpress.com/2011/09/24/lessons-learnt-while-spinning) when the OS provides an implemention.

  • Reply Arseny Kapoulkine on April 20, 2012

    If you have an application which is hitting a lock millions of times per second on Win32, you should change locking policies.

    Also, your Benaphore class has one important difference when compared to critical section – critical section lazily allocates the kernel object (event) that allows it to sleep; i.e. it may be faster, but it’s not recursive and it consumes more memory.

  • Reply Bruce Dawson on May 17, 2012

    Another reason to not roll your own is that the OS locks are likely to have additional features that your lock won’t have. On Windows I can think of several:

    1) When you create a critical section you can specify a spin count, which can give a nice balance between spinning and doing a kernel transition to wait on a semaphore.
    2) Windows 7 has a flag you can set (RTL_CRITICAL_SECTION_FLAG_DYNAMIC_SPIN) which automates the process of choosing the correct spin count.
    3) Critical sections in Windows are ETW instrumented so that you can follow wait chains — to allow sophisticated profiling.

    As I said in a recent post:
    “The ability to profile your code trumps any small performance improvement.”
    http://randomascii.wordpress.com/2012/05/05/xperf-wait-analysisfinding-idle-time/

    • Reply Jeff Preshing on May 20, 2012

      Benaphores use semaphores, which are also ETW instrumented. I created some xperf logs using the capture settings recommended on your blog, and the resulting “ReadyThreadStack” data was just as informative as when Critical Sections were used. Still, I agree with your general sentiment!

      Either way, I think if a reader is able to grasp how the Benaphore works, they’ve (hopefully) gained something, which was my real motivation for this post.

  • Reply Perry on June 12, 2012

    Hello,
    First: great blog, congrats!
    Second: why does this implementation use semaphore on Windows? why doesn’t it use Mutex or, better, Events?
    Perry.

    • Reply Jeff Preshing on June 12, 2012

      On Windows, you actually could use an auto-reset event, I think. I only chose the semaphore because it’s available on other platforms. However, you can’t substitute a mutex in place of the semaphore, because the thread which decrements the semaphore count is different from the thread which increases it.

  • Reply Bruce Dawson on June 12, 2012

    Why a semaphore? Because it works.

    A mutex doesn’t make sense.

    An event is basically a semaphore with a maximum count of one (http://blogs.msdn.com/b/oldnewthing/archive/2006/06/22/642849.aspx) and while I *think* it works I find it harder to reason about than the semaphore.

    Let me reverse the question: why *not* a semaphore? A semaphore works. What advantage would a mutex or event have?

  • Reply Perry on June 12, 2012

    Hello,
    thank you for replying!
    What I am thinking about regards the performance when you unlock the mutex after another thread tried to lock it. In this case the Benaphore has the same performance of a Semaphone, then, why don’t use a better performance primitive?
    Honestly after post the question, I answer me that is because the semaphore is a primitive that can be locked by a thread and unlocked by another one, instead a mutex must be unlocked by the same thread that locked it.

    • Reply Bruce Dawson on June 13, 2012

      Why do you think a mutex or event would have better performance than a semaphore? They are all kernel objects and they all pay the cost of a kernel transition and potential context switch. I would expect their performance to be virtually identical, although it’s difficult to compare because they do different things.

      Remember that the cost of the semaphore only matters when there is contention, and the goal should be to minimize that.

Leave a Reply