This Is Why They Call It a Weakly-Ordered CPU

On this blog, I’ve been rambling on about lock-free programming subjects such as acquire and release semantics and weakly-ordered CPUs. I’ve tried to make these subjects approachable and understandable, but at the end of the day, talk is cheap! Nothing drives the point home better than a concrete example.

If there’s one thing that characterizes a weakly-ordered CPU, it’s that one CPU core can see values change in shared memory in a different order than another core wrote them. That’s what I’d like to demonstrate in this post using pure C++11.

For normal applications, the x86/64 processor families from Intel and AMD do not have this characteristic. So we can forget about demonstrating this phenomenon on pretty much every modern desktop or notebook computer in the world. What we really need is a weakly-ordered multicore device. Fortunately, I happen to have one right here in my pocket:

The iPhone 4S fits the bill. It runs on a dual-core ARM-based processor, and the ARM architecture is, in fact, weakly-ordered.

The Experiment

Our experiment will consist of an single integer, sharedValue, protected by a mutex. We’ll spawn two threads, and each thread will run until it has incremented sharedValue 10000000 times.

We won’t let our threads block waiting on the mutex. Instead, each thread will loop repeatedly doing busy work (ie. just wasting CPU time) and attempting to lock the mutex at random moments. If the lock succeeds, the thread will increment sharedValue, then unlock. If the lock fails, it will just go back to doing busy work. Here’s some pseudocode:

count = 0
while count < 10000000:
    doRandomAmountOfBusyWork()
    if tryLockMutex():
        // The lock succeeded
        sharedValue++
        unlockMutex()
        count++
    endif
endwhile

With each thread running on a separate CPU core, the timeline should look something like this. Each red section represents a successful lock and increment, while the dark blue ticks represent lock attempts which failed because the other thread was already holding the mutex.

It bears repeating that a mutex is just a concept, and there are many ways to implement one. We could use the implementation provided by std::mutex, and of course, everything will function correctly. But then I’d have nothing to show you. Instead, let’s implement a custom mutex – then let’s break it to demonstrate the consequences of weak hardware ordering. Intuitively, the potential for memory reordering will be highest at those moments when there is a “close shave” between threads – for example, at the moment circled in the above diagram, when one thread acquires the lock just as the other thread releases it.

The latest version of Xcode has terrific support for C++11 threads and atomic types, so let’s use those. All C++11 identifiers are defined in the std namespace, so let’s assume using namespace std; was placed somewhere earlier in the code.

A Ridiculously Simple Mutex

Our mutex will consist of a single integer flag, where 1 indicates that the mutex is held, and 0 means it isn’t. To ensure mutual exclusivity, a thread can only set flag to 1 if the previous value was 0, and it must do so atomically. To achieve this, we’ll define flag as a C++11 atomic type, atomic<int>, and use a read-modify-write operation:

int expected = 0;
if (flag.compare_exchange_strong(expected, 1, memory_order_acquire))
{
    // The lock succeeded
}

The memory_order_acquire argument used above is considered an ordering constraint. We’re placing acquire semantics on the operation, to help guarantee that we receive the latest shared values from the previous thread which held the lock.

To release the lock, we perform the following:

flag.store(0, memory_order_release);

This sets flag back to 0 using the memory_order_release ordering constraint, which applies release semantics. Acquire and release semantics must be used as a pair to ensure that shared values propagate completely from one thread to the next.

If We Don’t Use Acquire and Release Semantics…

Now, let’s write the experiment in C++11, but instead of specifying the correct ordering constraints, let’s put memory_order_relaxed in both places. This means no particular memory ordering will be enforced by the C++11 compiler, and any kind of reordering is permitted.

void IncrementSharedValue10000000Times(RandomDelay& randomDelay)
{
    int count = 0;
    while (count < 10000000)
    {
        randomDelay.doBusyWork();
        int expected = 0;
        if (flag.compare_exchange_strong(expected, 1, memory_order_relaxed))
        {
            // Lock was successful
            sharedValue++;
            flag.store(0, memory_order_relaxed);
            count++;
        }
    }
}

At this point, it’s informative to look at the resulting ARM assembly code generated by the compiler, in Release, using the Disassembly view in Xcode:

If you aren’t very familiar with assembly language, don’t worry. All we want to know is whether the compiler has reordered any operations on shared variables. This would include the two operations on flag, and the increment of sharedValue in between. Above, I’ve annotated the corresponding sections of assembly code. As you can see, we got lucky: The compiler chose not to reorder those operations, even though the memory_order_relaxed argument means that, in all fairness, it could have.

I’ve put together a sample application which repeats this experiment indefinitely, printing the final value of sharedValue at the end of each trial run. It’s available on GitHub if you’d like to view the source code or run it yourself.

Here’s the iPhone, hard at work, running the experiment:

And here’s the output from the Output panel in Xcode:

Check it out! The final value of sharedValue is consistently less than 20000000, even though both threads perform exactly 10000000 increments, and the order of assembly instructions exactly matches the order of operations on shared variables as specified in C++.

As you might have guessed, these results are entirely due to memory reordering on the CPU. To point out just one possible reordering – and there are several – the memory interaction of str.w r0, [r11] (the store to sharedValue) could be reordered with that of str r5, [r6] (the store of 0 to flag). In other words, the mutex could be effectively unlocked before we’re finished with it! As a result, the other thread would be free to wipe out the change made by this one, resulting in a mismatched sharedValue count at the end of the experiment, just as we’re seeing here.

Using Acquire and Release Semantics Correctly

Fixing our sample application, of course, means putting the correct C++11 memory ordering constraints back in place:

void IncrementSharedValue10000000Times(RandomDelay& randomDelay)
{
    int count = 0;
    while (count < 10000000)
    {
        randomDelay.doBusyWork();
        int expected = 0;
        if (flag.compare_exchange_strong(expected, 1, memory_order_acquire))
        {
            // Lock was successful
            sharedValue++;
            flag.store(0, memory_order_release);
            count++;
        }
    }
}

As a result, the compiler now inserts a couple of dmb ish instructions, which act as memory barriers in the ARMv7 instruction set. I’m not an ARM expert – comments are welcome – but it’s safe to assume this instruction, much like lwsync on PowerPC, provides all the memory barrier types needed for acquire semantics on compare_exchange_strong, and release semantics on store.

This time, our little home-grown mutex really does protect sharedValue, ensuring all modifications are passed safely from one thread to the next each time the mutex is locked.

If you still don’t grasp intuitively what’s going on in this experiment, I’d suggest a review of my source control analogy post. In terms of that analogy, you can imagine two workstations each having local copies of sharedValue and flag, with some effort required to keep them in sync. Personally, I find visualizing it this way very helpful.

I’d just like to reiterate that the memory reordering we saw here can only be observed on a multicore or multiprocessor device. If you take the same compiled application and run it on an iPhone 3GS or first-generation iPad, which use the same ARMv7 architecture but have only a single CPU core, you won’t see any mismatch in the final count of sharedValue.

Interesting Notes

You can build and run this sample application on any Windows, MacOS or Linux machine with a x86/64 CPU, but unless your compiler performs reordering on specific instructions, you won’t witness any memory reordering at runtime – even on a multicore system! Indeed, when I tested it using Visual Studio 2012, no memory reordering occurred. That’s because x86/64 processors are what is usually considered strongly-ordered: When one CPU core performs a sequence of writes, every other CPU core sees those values change in the same order that they were written.

This goes to show how easy it is to use C++11 atomics incorrectly without knowing it, simply because it appears to work correctly on a specific processor and toolchain.

Incidentally, the release candidate of Visual Studio 2012 generates rather poor x86 machine code for this sample. It’s nowhere near as efficient as the ARM code generated by Xcode. Meanwhile, performance is the main reason to use lock-free programming on multicore in the first place! It’s enough to turn me off using C++11 atomics on Windows for the time being. [Update Feb. 2013: As mentioned in the comments, the latest version of VS2012 Professional now generates much better machine code.]

This post is a followup to an earlier post where I demonstrated StoreLoad reordering on x86/64. In my experience, however, the need for #StoreLoad does not come up quite as often in practice as the ordering constraints demonstrated here.

Finally, I’m not the first person to demonstrate weak hardware ordering in practice, though I might be the first to demonstrate it using C++11. There are earlier posts by Pierre Lebeaupin and ridiculousfish which use different experiments to demonstrate the same phenomenon.