Preshing on ProgrammingPreshing on Programming

Acquire and Release Fences

Acquire and release fences, in my opinion, are rather misunderstood on the web right now. That’s too bad, because the C++11 Standards Committee did a great job specifying the meaning of these memory fences. They enable robust algorithms which scale well across multiple cores, and map nicely onto today’s most common CPU architectures.

First things first: Acquire and release fences are considered low-level lock-free operations. If you stick with higher-level, sequentially consistent atomic types, such as volatile variables in Java 5+, or default atomics in C++11, you don’t need acquire and release fences. The tradeoff is that sequentially consistent types are slightly less scalable or performant for some algorithms.

On the other hand, if you’ve developed for multicore devices in the days before C++11, you might feel an affinity for acquire and release fences. Perhaps, like me, you remember struggling with the placement of some lwsync intrinsics while synchronizing threads on Xbox 360. What’s cool is that once you understand acquire and release fences, you actually see what we were trying to accomplish using those platform-specific fences all along.

Acquire and release fences, as you might imagine, are standalone memory fences, which means that they aren’t coupled with any particular memory operation. So, how do they work?

An acquire fence prevents the memory reordering of any read which precedes it in program order with any read or write which follows it in program order.

A release fence prevents the memory reordering of any read or write which precedes it in program order with any write which follows it in program order.

The Synchronizes-With Relation

In an earlier post, I explained how atomic operations let you manipulate shared variables concurrently without any torn reads or torn writes. Quite often, though, a thread only modifies a shared variable when there are no concurrent readers or writers. In such cases, atomic operations are unnecessary. We just need a way to safely propagate modifications from one thread to another once they’re complete. That’s where the synchronizes-with relation comes in.

Synchronizes-with” is a term invented by language designers to describe ways in which the memory effects of source-level operations – even non-atomic operations – are guaranteed to become visible to other threads. This is a desirable guarantee when writing lock-free code, since you can use it to avoid unwelcome surprises caused by memory reordering.

Synchronizes-with” is a fairly modern computer science term. You’ll find it in the specifications of C++11, Java 5+ and LLVM, all of which were published within the last 10 years. Each specification defines this term, then uses it to make formal guarantees to the programmer. One thing they have in common is that whenever there’s a synchronizes-with relationship between two operations, typically on different threads, there’s a happens-before relationship between those operations as well.

The Happens-Before Relation

Happens-before is a modern computer science term which is instrumental in describing the software memory models behind C++11, Java, Go and even LLVM.

You’ll find a definition of the happens-before relation in the specifications of each of the above languages. Conveniently, the definitions given in those specifications are basically the same, though each specification has a different way of saying it. Roughly speaking, the common definition can be stated as follows:

Let A and B represent operations performed by a multithreaded process. If A happens-before B, then the memory effects of A effectively become visible to the thread performing B before B is performed.

When you consider the various ways in which memory reordering can complicate lock-free programming, the guarantee that A happens-before B is a desirable one. There are several ways to obtain this guarantee, differing slightly from one programming language to next – though obviously, all languages must rely on the same mechanisms at the processor level.

Atomic vs. Non-Atomic Operations

Much has already been written about atomic operations on the web, usually with a focus on atomic read-modify-write (RMW) operations. However, those aren’t the only kinds of atomic operations. There are also atomic loads and stores, which are equally important. In this post, I’ll compare atomic loads and stores to their non-atomic counterparts at both the processor level and the C/C++ language level. Along the way, we’ll clarify the C++11 concept of a “data race”.

An operation acting on shared memory is atomic if it completes in a single step relative to other threads. When an atomic store is performed on a shared variable, no other thread can observe the modification half-complete. When an atomic load is performed on a shared variable, it reads the entire value as it appeared at a single moment in time. Non-atomic loads and stores do not make those guarantees.

The World’s Simplest Lock-Free Hash Table

A lock-free hash table is a double-edged sword. There are applications where it can provide a performance improvement that would be impossible to achieve otherwise. The downside is that it’s complicated.

The first working lock-free hash table I heard about was written in Java by Dr. Cliff Click. He released the source code back in 2007 and gave a presentation about it at Google that same year. When I first watched that presentation, I’ll admit, I didn’t understand most of it. The main conclusion I took away from it was that Dr. Cliff Click must be some kind of wizard.

Luckily, six years has given me enough time to (mostly) catch up to Cliff on this subject. It turns out that you don’t have to be a wizard to understand and implement a very basic, but perfectly functional, lock-free hash table. I’ll share the source code for one here. I’m pretty sure that anyone with experience writing multithreaded C++, and a willingness to comb through previous information on this blog, can fully understand it.

A Lock-Free… Linear Search?

Why yes. In this post, I’ll present a C++ class which maps integer keys to integer values. Both its SetItem and GetItem methods are thread-safe and lock-free. The catch is that both operations are implemented using a linear search.

As you might suspect, this class is not very efficient for large collections. Obviously, I won’t recommend using it in practice – at least not for storing more than a few items. But I think it’s worth studying anyway, as it provides a nice context in which to discuss several lock-free programming concepts. As a bonus, once you grasp how this class works, you’ll be ready to tackle a basic lock-free hash table.

The class, which I’ve imaginatively named ArrayOfItems, contains an ordinary array of key/value pairs. The array is preallocated to a size sufficiently large to hold all the items we’re going to store. We declare both fields as mint_atomic32_t, since we’re going to manipulate them using Mintomic, a portable library for low-level lock-free programming which I released earlier this month.

struct Entry
{
    mint_atomic32_t key;
    mint_atomic32_t value;
};
Entry *m_entries;

Introducing Mintomic: A Small, Portable Lock-Free API

Today, I’m releasing an open source library called Mintomic. Mintomic is an API for low-level lock-free programming in C and C++. It runs on a variety of platforms including Windows, Linux, MacOS, iOS and Xbox 360. Mintomic’s goals are to be efficient, straightforward, and (mostly) compatible with older compilers.


View the documentation

View on GitHub

Mintomic (short for “minimal atomic”) draws lot of inspiration from the C/C++11 atomic library standards, with an important exception: In Mintomic, all atomic operations are “relaxed”. The only way to enforce memory ordering is with explicit fences. Here’s an example taken from my post about weak hardware ordering, rewritten using Mintomic:

// Define a shared atomic variable:
mint_atomic32_t flag;

void IncrementSharedValue10000000Times(TimeWaster& tw)
{
    int count = 0;
    while (count < 10000000)
    {
        tw.wasteRandomCycles();

        // Atomic read-modify-write operation:
        if (mint_compare_exchange_strong_32_relaxed(&flag, 0, 1) == 0)
        {
            mint_thread_fence_acquire();    // Acquire fence
            g_sharedValue++;
            mint_thread_fence_release();    // Release fence

            // Atomic store:
            mint_store_32_relaxed(&flag, 0);
            count++;
        }
    }
}

View Your Filesystem History Using Python

Sometimes, it’s useful to look back on your filesystem history.

For example, after installing some new software, you might want to know which files have changed on your hard drive. Or, if you’re a programmer getting started on a new project, you may need to follow a complex and unfamiliar build process. A list of recently modified files can reveal a lot about how that build process works.

Here’s a short Python script to create such a list. It lists the contents of a folder recursively, sorted by modification time.

This Hash Table Is Faster Than a Judy Array

In game development, we use associative maps for many things. Dynamic loading, object reflection, rendering, shader management.

A lot of them fit the following usage pattern:

  • The keys and values have simple integer or pointer types.
  • The keys are effectively random.
  • The basic operation is lookup-or-insert: Look for an existing item first, and add one if not found. Usually, the item is found.
  • Deletes happen infrequently, and when they do, they tend to happen in bulk.

Occasionally, one of these maps is identified as a performance bottleneck. It raises the question: What’s the best data structure to use in this scenario?

How to Generate a Sequence of Unique Random Integers

Suppose we wish to generate a sequence of 10000000 random 32-bit integers with no repeats. How can we do it?

I faced this problem recently, and considered several options before finally implementing a custom, non-repeating pseudo-random number generator which runs in O(1) time, requires just 8 bytes of storage, and has pretty good distribution. I thought I’d share the details here.

Approaches Considered

There are already several well-known pseudo-random number generators (PRNGs) such as the Mersenne Twister, an excellent PRNG which distributes integers uniformly across the entire 32-bit range. Unfortunately, calling this PRNG 10000000 times does not tend to generate a sequence of 10000000 unique values. According to Hash Collision Probabilities, the probability of all 10000000 random numbers being unique is just: