On this blog, I’ve been rambling on about lock-free programming subjects such as acquire and release semantics and weakly-ordered CPUs. I’ve tried to make these subjects approachable and understandable, but at the end of the day, talk is cheap! Nothing drives the point home better than a concrete example.
If there’s one thing that characterizes a weakly-ordered CPU, it’s that one CPU core can see values change in shared memory in a different order than another core wrote them. That’s what I’d like to demonstrate in this post using pure C++11.
For normal applications, the x86/64 processor families from Intel and AMD do not have this characteristic. So we can forget about demonstrating this phenomenon on pretty much every modern desktop or notebook computer in the world. What we really need is a weakly-ordered multicore device. Fortunately, I happen to have one right here in my pocket:

The iPhone 4S fits the bill. It runs on a dual-core ARM-based processor, and the ARM architecture is, in fact, weakly-ordered.
The Experiment
Our experiment will consist of an single integer, sharedValue, protected by a mutex. We’ll spawn two threads, and each thread will run until it has incremented sharedValue 10000000 times.
We won’t let our threads block waiting on the mutex. Instead, each thread will loop repeatedly doing busy work (ie. just wasting CPU time) and attempting to lock the mutex at random moments. If the lock succeeds, the thread will increment sharedValue, then unlock. If the lock fails, it will just go back to doing busy work. Here’s some pseudocode:
count = 0
while count < 10000000:
doRandomAmountOfBusyWork()
if tryLockMutex():
// The lock succeeded
sharedValue++
unlockMutex()
count++
endif
endwhile
With each thread running on a separate CPU core, the timeline should look something like this. Each red section represents a successful lock and increment, while the dark blue ticks represent lock attempts which failed because the other thread was already holding the mutex.

It bears repeating that a mutex is just a concept, and there are many ways to implement one. We could use the implementation provided by std::mutex, and of course, everything will function correctly. But then I'd have nothing to show you. Instead, let's implement a custom mutex — then let's break it to demonstrate the consequences of weak hardware ordering. Intuitively, the potential for memory reordering will be highest at those moments when there is a "close shave" between threads — for example, at the moment circled in the above diagram, when one thread acquires the lock just as the other thread releases it.
The latest version of Xcode has terrific support for C++11 threads and atomic types, so let's use those. All C++11 identifiers are defined in the std namespace, so let's assume using namespace std; was placed somewhere earlier in the code.
A Ridiculously Simple Mutex
Our mutex will consist of a single integer flag, where 1 indicates that the mutex is held, and 0 means it isn't. To ensure mutual exclusivity, a thread can only set flag to 1 if the previous value was 0, and it must do so atomically. To achieve this, we'll define flag as a C++11 atomic type, atomic<int>, and use a read-modify-write operation:
int expected = 0;
if (flag.compare_exchange_strong(expected, 1, memory_order_acquire))
{
// The lock succeeded
}
The memory_order_acquire argument used above is considered an ordering constraint. We're placing acquire semantics on the operation, to help guarantee that we receive the latest shared values from the previous thread which held the lock.
To release the lock, we perform the following:
flag.store(0, memory_order_release);
This sets flag back to 0 using the memory_order_release ordering constraint, which applies release semantics. Acquire and release semantics must be used as a pair to ensure that shared values propagate completely from one thread to the next.
If We Don't Use Acquire and Release Semantics...
Now, let's write the experiment in C++11, but instead of specifying the correct ordering constraints, let's put memory_order_relaxed in both places. This means no particular memory ordering will be enforced by the C++11 compiler, and any kind of reordering is permitted.
void IncrementSharedValue10000000Times(RandomDelay& randomDelay)
{
int count = 0;
while (count < 10000000)
{
randomDelay.doBusyWork();
int expected = 0;
if (flag.compare_exchange_strong(expected, 1, memory_order_relaxed))
{
// Lock was successful
sharedValue++;
flag.store(0, memory_order_relaxed);
count++;
}
}
}
At this point, it's informative to look at the resulting ARM assembly code generated by the compiler, in Release, using the Disassembly view in Xcode:

If you aren't very familiar with assembly language, don't worry. All we want to know is whether the compiler has reordered any operations on shared variables. This would include the two operations on flag, and the increment of sharedValue in between. Above, I've annotated the corresponding sections of assembly code. As you can see, we got lucky: The compiler chose not to reorder those operations, even though the memory_order_relaxed argument means that, in all fairness, it could have.
I've put together a sample application which repeats this experiment indefinitely, printing the final value of sharedValue at the end of each trial run. It's available on GitHub if you'd like to view the source code or run it yourself.
Here's the iPhone, hard at work, running the experiment:

And here's the output from the Output panel in Xcode:

Check it out! The final value of sharedValue is consistently less than 20000000, even though both threads perform exactly 10000000 increments, and the order of assembly instructions exactly matches the order of operations on shared variables as specified in C++.
As you might have guessed, these results are entirely due to memory reordering on the CPU. To point out just one possible reordering — and there are several — the memory interaction of str.w r0, [r11] (the store to sharedValue) could be reordered with that of str r5, [r6] (the store of 0 to flag). In other words, the mutex could be effectively unlocked before we're finished with it! As a result, the other thread would be free to wipe out the change made by this one, resulting in a mismatched sharedValue count at the end of the experiment, just as we're seeing here.
Using Acquire and Release Semantics Correctly
Fixing our sample application, of course, means putting the correct C++11 memory ordering constraints back in place:
void IncrementSharedValue10000000Times(RandomDelay& randomDelay)
{
int count = 0;
while (count < 10000000)
{
randomDelay.doBusyWork();
int expected = 0;
if (flag.compare_exchange_strong(expected, 1, memory_order_acquire))
{
// Lock was successful
sharedValue++;
flag.store(0, memory_order_release);
count++;
}
}
}
As a result, the compiler now inserts a couple of dmb ish instructions, which act as memory barriers in the ARMv7 instruction set. I'm not an ARM expert — comments are welcome — but it's safe to assume this instruction, much like lwsync on PowerPC, provides all the memory barrier types needed for acquire semantics on compare_exchange_strong, and release semantics on store.

This time, our little home-grown mutex really does protect sharedValue, ensuring all modifications are passed safely from one thread to the next each time the mutex is locked.

If you still don't grasp intuitively what's going on in this experiment, I'd suggest a review of my source control analogy post. In terms of that analogy, you can imagine two workstations each having local copies of sharedValue and flag, with some effort required to keep them in sync. Personally, I find visualizing it this way very helpful.
I'd just like to reiterate that the memory reordering we saw here can only be observed on a multicore or multiprocessor device. If you take the same compiled application and run it on an iPhone 3GS or first-generation iPad, which use the same ARMv7 architecture but have only a single CPU core, you won't see any mismatch in the final count of sharedValue.
Interesting Notes
You can build and run this sample application on any Windows, MacOS or Linux machine with a x86/64 CPU, but unless your compiler performs reordering on specific instructions, you won't witness any memory reordering at runtime — even on a multicore system! Indeed, when I tested it using Visual Studio 2012, no memory reordering occurred. That's because x86/64 processors are what is usually considered strongly-ordered: When one CPU core performs a sequence of writes, every other CPU core sees those values change in the same order that they were written.

This goes to show how easy it is to use C++11 atomics incorrectly without knowing it, simply because it appears to work correctly on a specific processor and toolchain.
Incidentally, the release candidate of Visual Studio 2012 generates rather poor x86 machine code for this sample. It's nowhere near as efficient as the ARM code generated by Xcode. Meanwhile, performance is the main reason to use lock-free programming on multicore in the first place! It's enough to turn me off using C++11 atomics on Windows for the time being. [Update Feb. 2013: As mentioned in the comments, the latest version of VS2012 Professional now generates much better machine code.]
This post is a followup to an earlier post where I demonstrated StoreLoad reordering on x86/64. In my experience, however, the need for #StoreLoad does not come up quite as often in practice as the ordering constraints demonstrated here.
Finally, I'm not the first person to demonstrate weak hardware ordering in practice, though I might be the first to demonstrate it using C++11. There are earlier posts by Pierre Lebeaupin and ridiculousfish which use different experiments to demonstrate the same phenomenon.
Fantastic article. There actually are other uses for lock-free programming besides performance, though. Its required for async-signal safety, for example. But for the other 99.9% of use-cases, I agree.
I’d love to see a quick follow-up with a C11 implementation. Clang seems to have good support for this, though it does not ship with stdatomic.h, so it is slightly annoying to use.
You’re right about lock-free programming being needed for signals and interrupts. Thanks for pointing that out. I revised the text a little bit with that in mind.
Also, locking approaches are generally less composable than lock-free approaches. When we hide a blocking call deep down in an abstraction we increase the possibility of accidental deadlocks and livelocks at the higher level.
I remember there was a sudden rash of bug reports in multithreaded libraries and applications around April 2011 – just after the iPad 2 was released. That was the first mass market hardware with a multicore ARM CPU, and it gave a lot of supposedly threadsafe code a workout it had never had before.
I always wondered about that.
Nice work as always.
Interesting post, all the more so because I had just finished reading a post on @synchronized at refactr.com. This is a great way to start the weekend. Thanks for a good read.
So what’s the performance difference?
Without memory barriers, each trial takes a minimum of 1.712 sec to complete on my iPhone 4S. With memory barriers, each trial takes a minimum of 2.428 sec. From this we can estimate the cost of each
dmb ishinstruction in this experiment at 36 ns.and the perf if using the native std::mutex is?
This is the most important part of perf – writing a new lock type mechanism is all very well (and loads of people do it), but I always question whether they’re simply wasting everyone’s time. (not to say that your article is bad, its great, and I hope it acts as a lesson to the people who think they know better than everyone else
)
I’ve never seen a lock-free algorithm outperform anything on non-massively multicore hardware. Lock-free algorithms require many more atomic instructions, even for the uncontended cases. Those are usually several times slower than regular instructions. The iPhone is definitely not massively multi-core, and neither are your typical Linux or Windows server.
Everybody and their cousin implements lock-free algorithms, and everybody and their cousin ends up with a piece of software that runs slower than using a simple mutex.
As has been mentioned, though, there are some legitimate cases, like in asynchronous signal contexts on Unix. But there you don’t do it for performance, but because you have no choice.
Anonymous: Maybe it depends on your definition of performance…
In desktop real-time audio applications, you should always use lock-free programming when interacting with the real-time audio thread because otherwise you can end up with priority inversion. Using lock-free programming for “performance” in this case means we do it to avoid inadvertently delaying the realtime thread (through priority inversion), which leads to a dropout in audio.
You might not notice the difference right away, but spin up a background task hogging 100% of your CPU, and then you’ll notice a difference.
In the games industry there are many examples where a lock-free algorithm measurably outperforms a locking one, on a small number of cores — though it’s a method of last resort. If it didn’t make our framerate go up, we wouldn’t do it.
What did VS2012 make of it and how does it compare in performance to `lock inc [somevar]`?
Instead of generating a single
lock cmpxchginstruction, it goes two nested function calls deep. If you step through, it executes between 15-30 instructions where one would suffice, many of which are branches.I just checked again using the latest version of Visual Studio Professional 2012 (Update 1), and the generated code is much better. The call to
compare_exchange_strong()finally generates a singlelock cmpxchginstruction.You may try it on the apple’s new swift core or ARM’s A15, or Qualcomm’s Krait core. ARM v7 is a weakly ordering memory system but these cores act very different on coherency.
Hi
I think there is a bug in your code, the g_randomValues is not initialized because RandomDelay::Initialize is not been invoked.
And if RandomDelay::Initialize is been called, it will cause infinite loop in busywork.
I tested it on my K860i with Samsung Exynos 4412 for a long time, but it always output
sharedValue=20000000
These cpu are all base on ARM Cortex-A9 so I think it should be worked on Exynos 4412.
Do you have any ideas, thanks
wow, it works after long time wait.
Thanks
I think following code is better to illustrate illustrate weak order.
const long repeat_times=10000;
std::atomic_bool x[repeat_times],y[repeat_times];
void* write_x_then_y_relaxed(void*) {
for (int i=0; i<repeat_times; i++) {
x[i].store(true,std::memory_order_relaxed);// #4
y[i].store(true,std::memory_order_relaxed);// #5
usleep(1);
}
return (void*)1;
}
void* read_y_then_x_relaxed(void*) {
for (int i=0; i<repeat_times; i++) {
while (!y[i].load(std::memory_order_relaxed)); //#3
if (!x[i].load(std::memory_order_relaxed)) { //#2
assert(false);
}
}
return (void*)2;
}
I think you’re right, there is a bug. On this line I incremented by
m_wrap(instead of say, 1) which meant it was always testing the same value. I guess I was lucky it ever returned from the delay. I’ll submit a fix once I have a chance to test.Thanks for you reply.
I think another key point is keep your phone awake.
I tried my own code and your sample for a long time but it never happen. But when I check my sms, it fire the assert.
Maybe there is only one core working when the phone is idle, it every threads are running on same core.
But I can’t find the specification demonstrate how these cores works.
As far as ‘dmb ish’ is concerned, this instruction performs a Data Memory Barrier in the Inner Shareable barrier domain. ARM architecture allows you to specify what other components of the system need to see this update, or what other components may heve changed the value that the next read needs to see. A dmb instruction can either affect both loads and stores, or just stores (if the option includes ‘ST’).
I found an explanation of barrier domains and the barrier instructions at http://blogs.arm.com/software-enablement/594-memory-access-ordering-part-3-memory-access-ordering-in-the-arm-architecture/ It’s unclear whether there is a *requirement* that all processor cores are in the inner shareable domain, and other devices on the chip in the outer shareable domain, but that seems to be the implication.
Why you said that this code will work on single core processors? If those two operations would be reordered it will also not work on single core. In case of that reordering store to sharedVaraibke would be not protected by flag.
You seem to be describing a case where we have an out-of-order processor, and the instruction at 0x51a26 (the store of 0 to flag) has executed, but the instruction at 0x51a22 (the store to sharedValue) has not.
These situations are extremely temporary. In particular, it’s impossible for a single-core CPU to switch to another thread without completing that instruction at 0x51a22. A single core will never see its own memory interactions happen out of order. If it does, it’s broken!
Keep in mind that 0x51a26 comes after 0x51a22 in program order. It’s impossible to execute 0x51a26 without fetching 0x51a22 first. And once it’s fetched, that’s a commitment. The CPU will complete that instruction (from its own point of view) before it screws anything up, even if the two instructions happen out-of-order under the hood.
The only way this code could not work on a single core processor (whether it’s an out-of-order processor or not) is if the compiler had reordered the instructions, which in this case, it did not.
all posts
If you like this blog, and you've found the posts valuable to you in some way, consider leaving a tip!
© 2011-2012 Jeff Preshing. Powered by WordPress.