Preshing on ProgrammingPreshing on Programming

Lightweight In-Memory Logging

When debugging multithreaded code, it’s not always easy to determine which codepath was taken. You can’t always reproduce the bug while stepping through the debugger, nor can you always sprinkle printfs throughout the code, as you might in a single-threaded program. There might be millions of events before the bug occurs, and printf can easily slow the application to a crawl, mask the bug, or create a spam fest in the output log.

One way of attacking such problems is to instrument the code so that events are logged to a circular buffer in memory. This is similar to adding printfs, except that only the most recent events are kept in the log, and the performance overhead can be made very low using lock-free techniques.

Here’s one possible implementation. I’ve written it specifically for Windows in 32-bit C++, but you could easily adapt the idea to other platforms. The header file contains the following:

#include <windows.h>
#include <intrin.h>

namespace Logger
{
    struct Event
    {
        DWORD tid;        // Thread ID
        const char* msg;  // Message string
        DWORD param;      // A parameter which can mean anything you want
    };

    static const int BUFFER_SIZE = 65536;   // Must be a power of 2
    extern Event g_events[BUFFER_SIZE];
    extern LONG g_pos;

    inline void Log(const char* msg, DWORD param)
    {
        // Get next event index
        LONG index = _InterlockedIncrement(&g_pos);
        // Write an event at this index
        Event* e = g_events + (index & (BUFFER_SIZE - 1));  // Wrap to buffer size
        e->tid = ((DWORD*) __readfsdword(24))[9];           // Get thread ID
        e->msg = msg;
        e->param = param;
    }
}

#define LOG(m, p) Logger::Log(m, p)

Memory Reordering Caught in the Act

When writing lock-free code in C or C++, one must often take special care to enforce correct memory ordering. Otherwise, surprising things can happen.

Intel lists several such surprises in Volume 3, §8.2.3 of their x86/64 Architecture Specification. Here’s one of the simplest examples. Suppose you have two integers X and Y somewhere in memory, both initially 0. Two processors, running in parallel, execute the following machine code:

Don’t be thrown off by the use of assembly language in this example. It’s really the best way to illustrate CPU ordering. Each processor stores 1 into one of the integer variables, then loads the other integer into a register. (r1 and r2 are just placeholder names for actual x86 registers, such as eax.)

How to Remove Camera Shake from iPhone 4S Videos

Cell phone videos are notoriously shaky. It’s always been difficult to get a steady picture. So when Apple introduced a video stabilization feature on the iPhone 4S, I was really interested. I knew that the state of the art in digital video stabilization was capable of some amazing results. Finally, a few weeks back, I upgraded my phone to an iPhone 4S and took it on a two-week trip to Costa Rica.

By the end of the trip, I had shot hundreds of photos and 80 video clips – about 20 minutes of video in total. The iPhone 4S has a great camera, but I quickly learned that its built-in video stabilization feature is, well, not state of the art. It can perform some modest stabilization in cases where you attempt to hold the camera still and point it in a single direction. But other times, it doesn’t seem to help at all.

That’s OK, because with a little patience, and a free utility called Deshaker by Gunnar Thalin, you can remove the camera shake on your PC when you get home. For example, here’s some footage I shot of a coati at the top of Cerro Chato. The first part of the video shows the original, shaky iPhone 4S video – I blame the tasty Costa Rican coffee – and the second part shows the result after running it through Deshaker. While the result may not be flawless, I find it to be a significant improvement:

Implementing a Recursive Mutex

When optimizing code for multiple CPU cores, sometimes you need to write a new synchronization primitive. I don’t mean to encourage it, but it does happen. And if you’re going to do it, you might as well start by looking at a few examples. This won’t save you from shooting yourself in the foot, but it may help reduce the number of times, so you can walk away with a few toes remaining.

In my previous post, I showed how to implement a synchronization primitive known as the Benaphore in C++ on Win32. The Benaphore is not lock-free (being a lock itself), but it does serve as a simple yet instructive example of writing a synchronization primitive in user space. It also offers very low overhead when there’s no lock contention.

One limitation of the implementation I showed was that it was non-recursive. This means that if the same thread attempts to obtain the same lock twice, it will deadlock. In this post, I’ll show how to extend the implementation to support recursive locking.

Roll Your Own Lightweight Mutex

In an earlier post, I pointed out the importance of using a lightweight mutex. I also mentioned it was possible to write your own, provided you can live with certain limitations.

Why would you do such a thing? Well, in the past, some platforms (like BeOS) didn’t provide a lightweight mutex in the native API. Today, that’s not really a concern. I’m mainly showing this because it’s an interesting look at implementing synchronization primitives in general. As a curiosity, it just so happens this implementation shaves almost 50% off the overhead of the Windows Critical Section in the uncontended case. [Update: A closer look shows that under very high contention, a Windows Critical Section still performs much better.]

For the record, there are numerous ways to write your own mutex – or lock – entirely in user space, each with its own tradeoffs:

  • Spin locks. These employ a busy-wait strategy which has the potential to waste CPU time, and in the worst case, can lead to livelock when competing threads run on the same core. Still, some programmers have found measurable speed improvements switching to spin locks in certain cases.
  • Peterson’s algorithm is like a spinlock for two threads. A neat trick, but seems useless on today’s platforms. I find it noteworthy because Bartosz Milewski used this algorithm as a case study while discussing the finer points of the x86 memory model. Others have discussed Peterson’s lock in similar contexts.
  • Charles Bloom has a long writeup describing various mutex implementations. Excellent information, but possibly greek to anyone unfamiliar with C++11’s atomics library and Relacy’s ($) notation.

A Look Back at Single-Threaded CPU Performance

Throughout the 80’s and 90’s, CPUs were able to run virtually any kind of software twice as fast every 18-20 months. The rate of change was incredible. Your 486SX-16 was almost obsolete by the time you got it through the door. But eventually, at some point in the mid-2000’s, progress slowed down considerably for single-threaded software – which was most software.

Perhaps the turning point came in May 2004, when Intel canceled its latest single-core development effort to focus on multicore designs. Later that year, Herb Sutter wrote his now-famous article, The Free Lunch Is Over. Not all software will run remarkably faster year-over-year anymore, he warned us. Concurrent software would continue its meteoric rise, but single-threaded software was about to get left in the dust.

So, what’s happened since 2004? Clearly, multicore computing has become mainstream. Everybody acknowledges that single-threaded CPU performance no longer increases as quickly as it previously did – but at what rate is it actually increasing?

It’s tough to find an answer. Bill Dally of nVidia threw out a few numbers in a recent presentation: He had predicted 19% per year, but says it’s turned out closer to 5%. Last year, Chuck Moore of AMD presented this graph, suggesting that single-threaded CPU performance recently started going backwards:

A C++ Profiling Module for Multithreaded APIs

In my post about lock contention, I gave some statistics for the memory allocator in a multithreaded game engine: 15000 calls per second coming from 3 threads, taking around 2% CPU. To collect those statistics, I wrote a small profiling module, which I’ll share here.

A profiling module is different from conventional profilers like xperf or VTune in that no third-party tools are required. You just drop the module into any C++ application, and the process collects and reports performance data by itself.

This particular profiling module is meant to act on one or more target modules in the application. A target module can be anything which exposes a well-defined API, such as a memory allocator. To make it work, you must insert a macro named API_PROFILER into every public function exposed by that API. Below, I’ve added it to dlmalloc, one of the functions in the Doug Lea Malloc API. The same macro should be added to dlrealloc, dlfree, and other public functions as well.

DEFINE_API_PROFILER(dlmalloc);

void* dlmalloc(size_t bytes)
{
    API_PROFILER(dlmalloc);

#if USE_LOCKS
    ensure_initialization();
#endif

    if (!PREACTION(gm))
    {
        void* mem;
        size_t nb;
        if (bytes <= MAX_SMALL_REQUEST)
        {
            ...

Always Use a Lightweight Mutex

In multithreaded programming, we often speak of locks (also known as mutexes). But a lock is only a concept. To actually use that concept, you need an implementation. As it turns out, there are many ways to implement a lock, and those implementations vary wildly in performance.

The Windows SDK provides two lock implementations for C/C++: the Mutex and the Critical Section. (As Ned Batchelder points out, Critical Section is probably not the best name to give to the lock itself, but we’ll forgive that here.)

The Windows Critical Section is what we call a lightweight mutex. It’s optimized for the case when there are no other threads competing for the lock. To demonstrate using a simple example, here’s a single thread which locks and unlocks a Windows Mutex exactly one million times.

HANDLE mutex = CreateMutex(NULL, FALSE, NULL);
for (int i = 0; i < 1000000; i++)
{
    WaitForSingleObject(mutex, INFINITE);
    ReleaseMutex(mutex);
}
CloseHandle(mutex);

Locks Aren’t Slow; Lock Contention Is

Locks (also known as mutexes) have a history of being misjudged. Back in 1986, in a Usenet discussion on multithreading, Matthew Dillon wrote, “Most people have the misconception that locks are slow.” 25 years later, this misconception still seems to pop up once in a while.

It’s true that locking is slow on some platforms, or when the lock is highly contended. And when you’re developing a multithreaded application, it’s very common to find a huge performance bottleneck caused by a single lock. But that doesn’t mean all locks are slow. As I’ll show in this post, sometimes a locking strategy achieves excellent performance.

Perhaps the most easily-overlooked source of this misconception: Not all programmers may be aware of the difference between a lightweight mutex and a “kernel mutex”. I’ll talk about that in my next post, Always Use a Lightweight Mutex. For now, let’s just say that if you’re programming in C/C++ on Windows, the Critical Section object is the one you want.

Other times, the conclusion that locks are slow is supported by a benchmark. For example, this post measures the performance of a lock under heavy conditions: each thread must hold the lock to do any work (high contention), and the lock is held for an extremely short interval of time (high frequency). It’s a good read, but in a real application, you generally want to avoid using locks in that way. To put things in context, I’ve devised a benchmark which includes both best-case and worst-case usage scenarios for locks.

How to Generate Random Timings for a Poisson Process

What’s a Poisson process, and how is it useful?

Any time you have events which occur individually at random moments, but which tend to occur at an average rate when viewed as a group, you have a Poisson process.

For example, the USGS estimates that each year, there are approximately 13000 earthquakes of magnitude 4+ around the world. Those earthquakes are scattered randomly throughout the year, but there are more or less 13000 per year. That’s one example of a Poisson process. The Wikipedia page lists several others.

In statistics, there are a bunch of functions and equations to help model a Poisson process. I’ll present one of those functions in this post, and demonstrate its use in writing a simulation.

The Exponential Distribution

If 13000 such earthquakes happen every year, it means that, on average, one earthquake happens every 40 minutes. So, let’s define a variable λ = \(\frac{1}{40} \) and call it the rate parameter. The rate parameter λ is a measure of frequency: the average rate of events (in this case, earthquakes) per unit of time (in this case, minutes).