Preshing on ProgrammingPreshing on Programming

Atomic vs. Non-Atomic Operations

Much has already been written about atomic operations on the web, usually with a focus on atomic read-modify-write (RMW) operations. However, those aren’t the only kinds of atomic operations. There are also atomic loads and stores, which are equally important. In this post, I’ll compare atomic loads and stores to their non-atomic counterparts at both the processor level and the C/C++ language level. Along the way, we’ll clarify the C++11 concept of a “data race”.

An operation acting on shared memory is atomic if it completes in a single step relative to other threads. When an atomic store is performed on a shared variable, no other thread can observe the modification half-complete. When an atomic load is performed on a shared variable, it reads the entire value as it appeared at a single moment in time. Non-atomic loads and stores do not make those guarantees.

The World’s Simplest Lock-Free Hash Table

A lock-free hash table is a double-edged sword. There are applications where it can provide a performance improvement that would be impossible to achieve otherwise. The downside is that it’s complicated.

The first working lock-free hash table I heard about was written in Java by Dr. Cliff Click. He released the source code back in 2007 and gave a presentation about it at Google that same year. When I first watched that presentation, I’ll admit, I didn’t understand most of it. The main conclusion I took away from it was that Dr. Cliff Click must be some kind of wizard.

Luckily, six years has given me enough time to (mostly) catch up to Cliff on this subject. It turns out that you don’t have to be a wizard to understand and implement a very basic, but perfectly functional, lock-free hash table. I’ll share the source code for one here. I’m pretty sure that anyone with experience writing multithreaded C++, and a willingness to comb through previous information on this blog, can fully understand it.

A Lock-Free… Linear Search?

Why yes. In this post, I’ll present a C++ class which maps integer keys to integer values. Both its SetItem and GetItem methods are thread-safe and lock-free. The catch is that both operations are implemented using a linear search.

As you might suspect, this class is not very efficient for large collections. Obviously, I won’t recommend using it in practice – at least not for storing more than a few items. But I think it’s worth studying anyway, as it provides a nice context in which to discuss several lock-free programming concepts. As a bonus, once you grasp how this class works, you’ll be ready to tackle a basic lock-free hash table.

The class, which I’ve imaginatively named ArrayOfItems, contains an ordinary array of key/value pairs. The array is preallocated to a size sufficiently large to hold all the items we’re going to store. We declare both fields as mint_atomic32_t, since we’re going to manipulate them using Mintomic, a portable library for low-level lock-free programming which I released earlier this month.

struct Entry
{
    mint_atomic32_t key;
    mint_atomic32_t value;
};
Entry *m_entries;

Introducing Mintomic: A Small, Portable Lock-Free API

Today, I’m releasing an open source library called Mintomic. Mintomic is an API for low-level lock-free programming in C and C++. It runs on a variety of platforms including Windows, Linux, MacOS, iOS and Xbox 360. Mintomic’s goals are to be efficient, straightforward, and (mostly) compatible with older compilers.


View the documentation

View on GitHub

Mintomic (short for “minimal atomic”) draws lot of inspiration from the C/C++11 atomic library standards, with an important exception: In Mintomic, all atomic operations are “relaxed”. The only way to enforce memory ordering is with explicit fences. Here’s an example taken from my post about weak hardware ordering, rewritten using Mintomic:

// Define a shared atomic variable:
mint_atomic32_t flag;

void IncrementSharedValue10000000Times(TimeWaster& tw)
{
    int count = 0;
    while (count < 10000000)
    {
        tw.wasteRandomCycles();

        // Atomic read-modify-write operation:
        if (mint_compare_exchange_strong_32_relaxed(&flag, 0, 1) == 0)
        {
            mint_thread_fence_acquire();    // Acquire fence
            g_sharedValue++;
            mint_thread_fence_release();    // Release fence

            // Atomic store:
            mint_store_32_relaxed(&flag, 0);
            count++;
        }
    }
}

View Your Filesystem History Using Python

Sometimes, it’s useful to look back on your filesystem history.

For example, after installing some new software, you might want to know which files have changed on your hard drive. Or, if you’re a programmer getting started on a new project, you may need to follow a complex and unfamiliar build process. A list of recently modified files can reveal a lot about how that build process works.

Here’s a short Python script to create such a list. It lists the contents of a folder recursively, sorted by modification time.

This Hash Table Is Faster Than a Judy Array

In game development, we use associative maps for many things. Dynamic loading, object reflection, rendering, shader management.

A lot of them fit the following usage pattern:

  • The keys and values have simple integer or pointer types.
  • The keys are effectively random.
  • The basic operation is lookup-or-insert: Look for an existing item first, and add one if not found. Usually, the item is found.
  • Deletes happen infrequently, and when they do, they tend to happen in bulk.

Occasionally, one of these maps is identified as a performance bottleneck. It raises the question: What’s the best data structure to use in this scenario?

How to Generate a Sequence of Unique Random Integers

Suppose we wish to generate a sequence of 10000000 random 32-bit integers with no repeats. How can we do it?

I faced this problem recently, and considered several options before finally implementing a custom, non-repeating pseudo-random number generator which runs in O(1) time, requires just 8 bytes of storage, and has pretty good distribution. I thought I’d share the details here.

Approaches Considered

There are already several well-known pseudo-random number generators (PRNGs) such as the Mersenne Twister, an excellent PRNG which distributes integers uniformly across the entire 32-bit range. Unfortunately, calling this PRNG 10000000 times does not tend to generate a sequence of 10000000 unique values. According to Hash Collision Probabilities, the probability of all 10000000 random numbers being unique is just:

Arithmetic Encoding Using Fixed-Point Math

My previous post acted as an introduction to arithmetic coding, using the 1MB Sorting Problem as a case study. I ended that post with the question of how to work with fractional values having millions of significant binary digits.

The answer is that we don’t have to. Let’s work with 32-bit fixed-point math instead, and see how far that gets us.

A 32-bit fixed-point number can represent fractions in the interval [0, 1) using 32 bits of precision. In other words, it can encode the first 32 bits of any binary fraction. To define such a fixed-point number, we simply take a regular 32-bit unsigned integer, and imagine it as the top of a fraction over 232.

As you can see, 0x0288df0d represents a fixed-point number approximately equal to D, which in the previous post, we estimated as the probability of encountering a delta value of 0 in one of our sorted sequences. It’s actually not such a bad approximation, either: The error is within 0.00000023%.

Arithmetic Coding and the 1MB Sorting Problem

It’s been two weeks since the 1MB Sorting Problem was originally featured on Reddit, and in case you don’t think this artificial problem has been thoroughly stomped into the ground yet, here’s a continuation of last post’s explanation of the working C++ program which solves it.

In that post, I gave a high-level outline of the approach, and showed an encoding scheme – which I’ve since learned is Golomb coding – which comes close to meeting the memory requirements, but doesn’t quite fit. Arithmetic coding, on the other hand, does fit. It’s interesting, because this problem, as it was phrased, almost seems designed to force you into arithmetic coding (though Nick Cleaton’s solution manages to avoid it).

I had read about arithmetic coding before, but I never had any reason to sit down to implement it. It always struck me as kind of mystical: How the heck you encode information in a fraction of a bit, anyway? The 1MB Sorting Problem turned out to be a great excuse to learn how arithmetic coding works.

How Many Sorted Sequences Even Exist?

It’s important to note that the whole reason why we are able to represent a sorted sequence of one million 8-digit numbers in less than 1 MB of memory is because mathematically, there simply aren’t that many different sorted sequences which exist.

1MB Sorting Explained

In my previous post, I shared some source code to sort one million 8-digit numbers in 1MB of RAM as an answer to this Stack Overflow question. The program works, but I didn’t explain how, leaving it as a kind of puzzle for the reader.

I had promised to explain it in a followup post, and in the meantime, there’s been a flurry of discussion in the comments and on Reddit. In particular, commenter Ben Wilhelm (aka ZorbaTHut) already managed to explain most of it (Nice work!), and by now, I think quite a few people already get it. Nonetheless, I’ll write up another explanation as promised.

Data Structures

In this implementation, there is a staging area and a circular buffer. The staging area is just big enough to hold 8000 plain 32-bit integers. The circular buffer always holds a sorted sequence of numbers, using a compact representation which we’ll get to shortly.