A Lock-Free... Linear Search?

Why yes. In this post, I’ll present a C++ class which maps integer keys to integer values. Both its SetItem and GetItem methods are thread-safe and lock-free. The catch is that both operations are implemented using a linear search.

As you might suspect, this class is not very efficient for large collections. Obviously, I won’t recommend using it in practice – at least not for storing more than a few items. But I think it’s worth studying anyway, as it provides a nice context in which to discuss several lock-free programming concepts. As a bonus, once you grasp how this class works, you’ll be ready to tackle a basic lock-free hash table.

The class, which I’ve imaginatively named ArrayOfItems, contains an ordinary array of key/value pairs. The array is preallocated to a size sufficiently large to hold all the items we’re going to store. We declare both fields as mint_atomic32_t, since we’re going to manipulate them using Mintomic, a portable library for low-level lock-free programming which I released earlier this month.

struct Entry
{
    mint_atomic32_t key;
    mint_atomic32_t value;
};
Entry *m_entries;

An entry is considered unused if its key is 0, and initially, the array is all zeros. We forbid 0 from being stored as an actual key or value. Since new keys are added after a linear search, the array fills from left to right. And as mentioned already, we intend to call SetItem and GetItem from multiple threads in parallel.

You’d think this would be complicated, but here’s some lock-free C++ code which implements SetItem:

void ArrayOfItems::SetItem(uint32_t key, uint32_t value)
{
    for (uint32_t idx = 0;; idx++)
    {
        uint32_t prevKey = mint_compare_exchange_strong_32_relaxed(&m_entries[idx].key, 0, key);
        if ((prevKey == 0) || (prevKey == key))
        {
            mint_store_32_relaxed(&m_entries[idx].value, value);
            return;
        }
    }
}

This code is strikingly similar to what you would do in the single-threaded case: Scan through the entire array, one-at-a-time, and store the new key in the first entry whose existing key is 0. If that succeeds – or if the key is already found – store the new value in the same entry, and Bob’s your uncle.

There are several considerations making this code correct, thread-safe and lock-free. Let’s look at atomicity, then the role of the read-modify-write operation, then at memory ordering.

Atomicity

Atomic operations guarantee that no thread will ever read a partially-updated value from shared memory, even when two threads manipulate the same variable from different CPU cores simultaneously. In SetItem, even the store uses an atomic library function: mint_store_32_relaxed. It may seem silly to wrap the store in a function named mint_store_32_relaxed, since we know that on all modern CPUs, a plain, natively-aligned 32-bit assignment is already atomic. But in Mintomic, it’s good practice to code as though all memory operations are non-atomic unless they use an atomic library function. (Indeed, in the C++11 atomic library standard, it’s required to code this way.) For example, if these were 64-bit integers, a plain assignment on x86 would most definitely not be atomic.

Read-Modify-Write

Another consideration is the use of mint_compare_exchange_strong_32_relaxed in SetItem. Many readers will recognize this function as a compare-and-swap, which is often refered to as CAS for short. This is a more elaborate kind of atomic operation; one which reads, modifies and writes (RMW) all in one indivisible step. The first argument is the address of the shared integer; the second argument is the value we expect to find in that address; and the third argument is the new value we’d like to store if the expected value is found. The CAS returns the previous value stored at that address, which is sort of like getting a report card back telling us what happened.

More importantly, when several threads are hammering on the same variable, they will all line up in a row and execute the CAS one-at-a time, resulting in a single, global order of all CAS operations (indeed, all RMW operations) on that variable. This is nice! It means that no SetItem call will ever corrupt the data structure by stepping on another call’s toes. Once every SetItem call has completed, the array is always left in a consistent state.

mint_compare_exchange_strong_32_relaxed generates a lock cmpxchg instruction on x86. As CPU instructions go, this is an expensive one. The above implementation actually uses it more often than necessary, so it’s possible to do better, as I’ll demonstrate further down.

Memory Ordering

The final consideration is memory ordering. SetItem and GetItem use Mintomic’s relaxed atomic primitives, so they are totally susceptible to memory reordering – that is, changes to the order in which modifications to shared memory become visible across different threads. Here’s the code for GetItem:

uint32_t ArrayOfItems::GetItem(uint32_t key)
{
    for (uint32_t idx = 0;; idx++)
    {
        uint32_t probedKey = mint_load_32_relaxed(&m_entries[idx].key);
        if (probedKey == key)
            return mint_load_32_relaxed(&m_entries[idx].value);
        if (probedKey == 0)
            return 0;          
    }
}

GetItem does another linear search, stopping once it finds an entry with a matching key, or a zero key, which is considered the end of the array. A zero value is returned when the key is not found. You’ll notice that neither SetItem nor GetItem use any fence instructions, which are the only way to enforce memory ordering in Mintomic. The beauty of these functions is that they don’t need to enforce memory ordering – they work fine even with the presence of memory reordering!

To see how, let’s return to the source control analogy. Think of each thread as having its own private copy of the shared array. Modifications to the shared array will eventually propagate to each thread’s private copy, but not necessarily in the same order that they were written. Therefore, when GetItem examines each array entry, its private copy of that entry may contain any of the following combinations:

(0, 0) - The end of the item list. The key passed to GetItem was not found.
(key, 0) - An item that has not yet been fully initialized by a parallel call to SetItem. If the key matches the one passed to GetItem, 0 is returned, which is fine.
(key, value) - A fully initialized, valid item in the collection.
(0, value) - Only possible if the stores performed by a single SetItem call got reordered either by the compiler or the processor, and have not yet fully propagated to this thread’s private copy. GetItem will treat this as the end of the item list, which means the key passed to GetItem was not found, which is fine.

Every possible combination is valid. An important part of the trick is that once SetItem assigns a non-zero key to an entry, that entry’s key will never change again. Therefore, once GetItem locates that key, it will either load a corresponding value of zero, or it will load a value which some thread actually intended to store.

Don’t feel too safe, however. Internally, ArrayOfItems is resilient against memory reordering, but externally, the caller may still need to enforce memory ordering using one of Mintomic’s fence instructions. For example, if you publicize the availability of some non-atomic data to other threads by storing a flag in the collection, you must place release semantics on that store by issuing a release fence immediately beforehand.

// Shared variables
char message[256];
ArrayOfItems collection;

void PublishMessage()
{
    // Write to shared memory non-atomically.
    strcpy(message, "I pity the fool!");
    
    // Release fence: The only way to safely pass non-atomic data between threads using Mintomic.
    mint_thread_fence_release();
    
    // Set a flag to indicate to other threads that the message is ready.
    collection.SetItem(SHARED_FLAG_KEY, 1)
}

Similarly, the call to GetItem would need an acquire fence in the receiving thread. (It’s actually tempting to rename these functions SetItemRelaxed and GetItemRelaxed as a reminder of this.) In this way, manipulating an item in the collection is not much different from performing relaxed operations on a shared atomic variable, such as a mint_atomic32_t or a std::atomic<int> in C++11, at a fixed memory address.

Reducing the Number of CAS

The above implementations of SetItem and GetItem are perfectly functional, but we can do better. As mentioned above, CAS, or mint_compare_exchange_strong_32_relaxed, is a relatively expensive operation. Currently, we perform a CAS on each iteration of the loop inside SetItem, but as the array contents grow, most of these CAS will not actually modify their target variable. We can easily optimize SetItem by first checking whether the CAS is necessary, and avoiding it when possible, as follows:

void ArrayOfItems::SetItem(uint32_t key, uint32_t value)
{
    for (uint32_t idx = 0;; idx++)
    {
        // Load the key that was there.
        uint32_t probedKey = mint_load_32_relaxed(&m_entries[idx].key);
        if (probedKey != key)
        {
            // The entry was either free, or contains another key.
            if (probedKey != 0)
                continue;           // Usually, it contains another key. Keep probing.
                
            // The entry was free. Now let's try to take it using a CAS.
            uint32_t prevKey = mint_compare_exchange_strong_32_relaxed(&m_entries[idx].key, 0, key);
            if ((prevKey != 0) && (prevKey != key))
                continue;       // Another thread just stole it from underneath us.

            // Either we just added the key, or another thread did.
        }
        
        // Store the value in this array entry.
        mint_store_32_relaxed(&m_entries[idx].value, value);
        return;
    }
}

The check is performed using mint_load_32_relaxed, which is less expensive than a CAS on all CPUs which Mintomic supports. If the array entry already contains a non-matching, non-zero key, we skip to the next entry without performing a CAS. Otherwise, we must still check whether the subsequent CAS was successful; another thread could (and occasionally will) steal the entry between the mint_load_32_relaxed and the CAS.

The above optimization makes the following sample application run more than ten times faster on my Core 2 Quad Q6600.

Sample Application

To increase your confidence (and mine) that ArrayOfItems works as advertised, I’ve written a sample application which repeatedly performs the following experiment: Two threads are launched in parallel, each adding 2000 unique items using SetItem, then verifying those items using GetItem. The application is based on Mintomic, so it builds and runs on Windows, Linux, MacOS, iOS and Xbox 360.

All the code is on GitHub, so you can compile and run it yourself. See the README.md file for build instructions.

With minimal changes, you could easily adapt ArrayOfItems to use C++11 instead of Mintomic, though I doubt this class is useful enough that anyone would want to do that. For large collections, ArrayOfItems is probably the worst-performing lock-free associative map you could design.

Kind of surprising, then, that with fairly minor modifications, you can transform it into one of the best! That’ll be the subject of the next post.