Preshing on ProgrammingPreshing on Programming

Can Reordering of Release/Acquire Operations Introduce Deadlock?

I wasn’t planning to write about lock-free programming again, but a commenter named Mike recently asked an interesting question on my Acquire and Release Semantics post from 2012. It’s a question I wondered about years ago, but could never really reconcile until (possibly) now.

A quick recap: A read-acquire operation cannot be reordered, either by the compiler or the CPU, with any read or write operation that follows it in program order. A write-release operation cannot be reordered with any read or write operation that precedes it in program order.

Those rules don’t prevent the reordering of a write-release followed by a read-acquire. For example, in C++, if A and B are std::atomic<int>, and we write:, std::memory_order_release);
int b = B.load(std::memory_order_acquire);

…the compiler is free to reorder those statements, as if we had written:

int b = B.load(std::memory_order_acquire);, std::memory_order_release);

And that’s fair. Why the heck not? On many architectures, including x86, the CPU could perform this reordering anyway.

Here’s a Standalone Cairo DLL for Windows

Cairo is an open source C library for drawing vector graphics. I used it to create many of the diagrams and graphs on this blog.

Cairo is great, but it’s always been difficult to find a precompiled Windows DLL that’s up-to-date and that doesn’t depend on a bunch of other DLLs. I was recently unable to find such a DLL, so I wrote a script to simplify the build process for one. The script is shared on GitHub:

If you just want a binary package, you can download one from the Releases page:

The binary package contains Cairo header files, import libraries and DLLs for both x86 and x64. The DLLs are statically linked with their own C runtime and have no external dependencies. Since Cairo’s API is pure C, these DLLs should work with any application built with any version of MSVC. I configured these DLLs to render text using FreeType because I find the quality of FreeType-rendered text better than Win32-rendered text, which Cairo normally uses by default. FreeType also supports more font formats and gives text a consistent appearance across different operating systems.

Learn CMake’s Scripting Language in 15 Minutes

As explained in my previous post, every CMake-based project must contain a script named CMakeLists.txt. This script defines targets, but it can also do a lot of other things, such as finding third-party libraries or generating C++ header files. CMake scripts have a lot of flexibility.

Every time you integrate an external library, and often when adding support for another platform, you’ll need to edit the script. I spent a long time editing CMake scripts without really understanding the language, as the documentation is quite scattered, but eventually, things clicked. The goal of this post is to get you to the same point as quickly as possible.

How to Build a CMake-Based Project

CMake is a versatile tool that helps you build C/C++ projects on just about any platform you can think of. It’s used by many popular open source projects including LLVM, Qt, KDE and Blender.

All CMake-based projects contain a script named CMakeLists.txt, and this post is meant as a guide for configuring and building such projects. This post won’t show you how to write a CMake script – that’s getting ahead of things, in my opinion.

As an example, I’ve prepared a CMake-based project that uses SDL2 and OpenGL to render a spinning 3D logo. You can build it on Windows, MacOS or Linux.

Using Quiescent States to Reclaim Memory

If you want to support multiple readers for a data structure, while protecting against concurrent writes, a read-write lock might seem like the only way – but it isn’t! You can achieve the same thing without a read-write lock if you allow several copies of the data structure to exist in memory. You just need a way to delete old copies when they’re no longer in use.

Let’s look at one way to achieve that in C++. We’ll start with an example based on a read-write lock.

Using a Read-Write Lock

Suppose you have a network server with dozens of threads. Each thread broadcasts messages to dozens of connected clients. Once in a while, a new client connects or an existing client disconnects, so the list of connected clients must change. We can store the list of connected clients in a std::vector and protect it using a read-write lock such as std::shared_mutex.

Leapfrog Probing

A hash table is a data structure that stores a set of items, each of which maps a specific key to a specific value. There are many ways to implement a hash table, but they all have one thing in common: buckets. Every hash table maintains an array of buckets somewhere, and each item belongs to exactly one bucket.

To determine the bucket for a given item, you typically hash the item’s key, then compute its modulus – that is, the remainder when divided by the number of buckets. For a hash table with 16 buckets, the modulus is given by the final hexadecimal digit of the hash.

Inevitably, several items will end up belonging to same bucket. For simplicity, let’s suppose the hash function is invertible, so that we only need to store hashed keys. A well-known strategy is to store the bucket contents in a linked list:

A Resizable Concurrent Map

In an earlier post, I showed how to implement the “world’s simplest lock-free hash table” in C++. It was so simple that you couldn’t even delete entries or resize the table. Well, a few years have passed since then, and I’ve recently written some concurrent maps without those limitations. You’ll find them in my Junction project on GitHub.

Junction contains several concurrent maps – even the ‘world’s simplest’ is there, under the name ConcurrentMap_Crude. For brevity, let’s call that one the Crude map. In this post, I’ll explain the difference between the Crude map and Junction’s Linear map. Linear is the simplest Junction map that supports both resize and delete.

You can review the original post for an explanation of how the Crude map works. To recap: It’s based on open addressing and linear probing. That means it’s basically a big array of keys and values using a linear search. When inserting or looking up a given key, you hash the key to determine where to begin the search. Concurrent inserts and lookups are permitted.

New Concurrent Hash Maps for C++

A map is a data structure that maps a collection of keys to a collection of values. It’s a common concept in computer programming. You typically manipulate maps using functions such as find, insert and erase.

A concurrent map is one that lets you call some of those functions concurrently – even in combinations where the map is modified. If it lets you call insert from multiple threads, with no mutual exclusion, it’s a concurrent map. If it lets you call insert while another thread is calling find, with no mutual exclusion, it’s a concurrent map. Other combinations might be allowed, too. Traditional maps, such as std::map and std::unordered_map, don’t allow that.

Today I’m releasing Junction, a C++ library that contains several new concurrent maps. It’s BSD-licensed, so you can use the source code freely in any project, for any purpose.

On my Core i7-5930K, Junction’s two fastest maps outperform all other concurrent maps.

You Can Do Any Kind of Atomic Read-Modify-Write Operation

Atomic read-modify-write operations – or “RMWs” – are more sophisticated than atomic loads and stores. They let you read from a variable in shared memory and simultaneously write a different value in its place. In the C++11 atomic library, all of the following functions perform an RMW:


fetch_add, for example, reads from a shared variable, adds another value to it, and writes the result back – all in one indivisible step. You can accomplish the same thing using a mutex, but a mutex-based version wouldn’t be lock-free. RMW operations, on the other hand, are designed to be lock-free. They’ll take advantage of lock-free CPU instructions whenever possible, such as ldrex/strex on ARMv7.

Safe Bitfields in C++

In my cpp11-on-multicore project on GitHub, there’s a class that packs three 10-bit values into a 32-bit integer.

I could have implemented it using traditional bitfields…

struct Status
    uint32_t readers : 10;
    uint32_t waitToRead : 10;
    uint32_t writers : 10;

Or with some bit twiddling…

uint32_t status = readers | (waitToRead << 10) | (writers << 20);