# 1MB Sorting Explained

October 26, 2012

In my previous post, I shared some source code to sort one million 8-digit numbers in 1MB of RAM as an answer to this Stack Overflow question. The program works, but I didn’t explain how, leaving it as a kind of puzzle for the reader.

I had promised to explain it in a followup post, and in the meantime, there’s been a flurry of discussion in the comments and on Reddit. In particular, commenter Ben Wilhelm (aka ZorbaTHut) already managed to explain most of it (Nice work!), and by now, I think quite a few people already get it. Nonetheless, I’ll write up another explanation as promised.

## Data Structures

In this implementation, there is a staging area and a circular buffer. The staging area is just big enough to hold 8000 plain 32-bit integers. The circular buffer always holds a sorted sequence of numbers, using a compact representation which we’ll get to shortly.

We read numbers into the staging area until it’s full. Once it’s full, we sort it in-place, then merge it with the sorted contents of the circular buffer. The merge is basically the same merge operation performed in Mergesort, but we conserve memory using a trick: The new sequence is written to the buffer immediately following the previous one, and the previous sequence is erased while it’s read. This being a circular buffer, we wrap around to the beginning of the buffer once we’ve reached the end.

These steps are repeated until all of the input is read and the final staging area is merged. At that point, we have the entire sorted sequence stored in our circular buffer.

The staging area is really just an optimization. Theoretically, we don’t need it — we could merge each input number directly into the circular buffer as it comes in. The only problem is that as the sorted sequence grows, the merge operation becomes very slow, mainly due to the decoding and re-encoding process that’s involved. In fact, if you increase the staging area to hold 20000 integers instead of just 8000, the program runs twice as fast. That’s why I made the staging area as large as possible in the available memory.

## A Sequence of Deltas

Next, we need a way to fit the sorted number sequence into a circular buffer significantly less than 1MB (1048576 bytes) in size. Many people on Stack Overflow and Reddit came up with this idea, which is indeed where I got it: Instead of storing the actual numbers, we store a delta sequence — the differences from one number to the next. Because the sequence is already sorted, these delta values tend to be quite small. In fact, given a sequence of N 8-digit numbers, the average delta value is just

$\frac{100000000}{N}$

By the time we’ve reached a million numbers, the average delta shrinks to just 100. Deltas around this size can be represented using 7 bits, since 27 = 128, so intuitively, you can begin to see the potential to save memory. It’s also OK to use a lot more bits to represent larger delta values, since those appear far less frequently. The bigger the delta, the sooner we start to reach the end of the list!

## Encoding Each Delta

All that’s left to explain is how we actually encode each delta value. Since each one will use a variable number of bits, it helps to look at the circular buffer as a bit stream, which is what BitReader and BitWriter are for.

There are many ways to encode them, and it’s tough to find a strategy which respects our extremely limited memory budget in every case. For example, this comment describes an encoding strategy which takes 1000000 bytes in the case where every delta is less than 128, but up to 1564452 bytes when other delta values are mixed in.

It turns out that the optimal encoding strategy is arithmetic coding. But before jumping into that, let’s first take a look at an alternate encoding strategy which requires a bit more memory than we’re allowed, but is much easier to understand. This alternate coding also has a few similarities to the arithmetic coding implementation I wrote, so it’ll serve as a pretty good warm-up to that one.

In this alternate strategy, we decode the next delta value as follows. Define an integer variable accumulator, initialize it to 0, then look at the incoming bit stream:

• If the next bit is 1, add 64 to accumulator and repeat.
• If the next bit is 0, read a 6-bit integer from the bitstream, add it to accumulator and return accumulator as the delta.

How much memory does this strategy need to represent an entire number sequence? Well, each delta value will require at least one 0 followed by 6 bits, and we’re going to have a million of those. On top of that, each leading 1 causes our sequence value to increase by 64, which obviously can’t happen more than $\frac{100000000}{64}$ times. Using this strategy, every possible sequence should fit within

$7 \times 1000000 + \displaystyle\frac{100000000}{64} = 8562500 \text{ bits} \approx \bold{1070312.5} \text{ bytes}$

Indeed, a sequence where all deltas are zero except for one 99999999 requires exactly that many bits. That’s more than 21KB over budget, so we have to do better.

Effectively, the above strategy follows a path determined by the bit stream to arrive at the next delta value. Here’s a graph to illustrate several possible bit paths:

For those who don’t already recognize it, this turns out to be a Huffman encoding tree. Huffman coding is really just a special case of arithmetic coding. (And as commenter Stergios Stergiou points out, this particular style of Huffman coding is known as Golomb coding.)

When arithmetic coding is used, the entire sequence fits safely within < 1013000 bytes, all the time. This strategy proved kind of challenging to implement, so it took the most time to get right. In the next two posts, I’ll describe how it was implemented using all that fancy bit magic you see in Decoder::decode and Encoder::encode.

Excellent. Congrats!

I was working on almost the same approach. But you coded it, explained it before I could get half way there.

Beware there’s a problem with the size of delta bits if you just pick a fixed size. There can be some crafted inputs that won’t fit the solutions. With variable bit length of the coded delta it’s OK, because you remove the vulnerable inflexion point.

Again, amazing solution, execution and explanation. I wish there were more like you on StackOverflow.

Alecco

Well done Jeff, this is very impressive considering your quick response time. Kudos!

This, of course, is Golomb(64) coding.

The alternate encoding, you mean? Reading up on it, I think you’re right. Cool.

Tell me what I am missing. if you have 1 million 27 bit numbers (8 decimal digits) in a highly entropic distribution, wouldn’t it be impossible to compress regardless of method? or does the structure of this particular problem avoid that problem?

i’m stupid. its the sorting that makes it possible. it isn’t compressing and restoring the original.

Very good. BTW, you could get rid of a separate fixed-size staging area, take the whole memory for the circular buffer, and use the gap between reading and writing position for staging (with a starting position right before the buffer’s reading position, proceeding backwards). This allows to maximize memory utilization and increase average speed.

That’s a terrific idea! Thanks for sharing. You’ve made me wonder what the overall performance improvement would be.

Thinking it through a little, it would add some complexity to the circular buffer. Imagine if all the staged numbers are greater than the contents of the existing sorted list. The merge operation would bump up against the staging area pretty quickly, so you’d need to find another empty part of the buffer to continue writing. You’d likely need to introduce some buffer partitioning schemes and/or a post-merge defragmentation step.

I don’t see much additional complexity here. Having a current min/max and number of collected input values, you can dynamically calculate a theoretical limit on a merged/compressed circular buffer at every data point, and stop staging when this limit is reached, switching to a sort/merge cycle. Thus areas will remain separated. Yes, the available staging memory limit may shrink to the “fixed” ~30K rather quickly (it actually depends on input statistics), but at least you can postpone first sort/merge till the first large chunk of input is staged (my guesstimate is ~1/6 or more of a total input size).

Just noticed this post, and found it interesting. There is a much easier way to get an arithmetic code that works even better than the one here, however. Instead of using 64 symbols, just use 2. Let the deltas to be encoded be d1, d2, …, dN where N=10^6. Encode value di as 1111…1110 where there are di 1′s and a terminating 0. The compute probabilities of 0 and 1 in the stream:

There are N di’s, so there are N 0′s. The sum of the di’s is at most 10^8-1, so there are at most this many 1′s. Then probability of a 1 as p1=(10^8-1)/*(10^8-1+10^6) which is approx 100/101 as above. And probability of a 0 is p1=1-p0 which is approx 1/101.

Then bits used is b0=-log(p0)/log(2), b1=-log(p1)/log(2), and max bitlength is 10^6 * b0 + (10^8-1)*b1 < 8093741 bits, a ittle under 252930 32-bit uints, less than your 253250 uints.

May test later for fun Good work!

Ah – one more thing Your solution will likely be much faster by writing larger blocks. So an interesting question is what block size (your 64 or my 2) gives enough space to presort or do other work to maximize the speed.

Not to kill the mood

In the original task the data was received through a TCP stream.

Since a byte can hold on 256 possible values, why would you even keep the whole data in memory if you can just count how many times each byte is appeared in the stream and that info would be enough to build a “sorted stream”?

Memory usage 256×4 = 1 KB

Not to kill your excitement, but TCP stream format is irrelevant.

Each number is 10 digits. This means that for instance 9,999,999,999 (a 10 digit number) is potentially represented as 2 540B E3FF in hex, which would need 256×5 using your weird memory calculation.

Next, there are a million of them, so your count needs to accommodate 20 bits per number for the case that all the numbers are the same.

This costs 10,000,000,000 * 20 bits = 23 Gb if my math is correct.

This exceeds the allowed memory of 1Mb by a rather large amount. In fact, I would just guess that even if you compress your counts through some scheme you will still be massively over budget.

Now your cryptic message goes on about byte based “sorted stream” counts held in memory. Perhaps if you wrote more of your work down in the message and did less of it in your head we could evaluate your algorithm? It seems like you are trying to use a bucket sort without calculating the true memory overhead which is too large.

Very nice work!

Can you speed up the program by changing the size of the staging area and the circular buffer during the run?

The first step the stage is half the memory, and the circ buffer the other half.
The second step the stage is smaller and the circ buffer bigger.
(don’t know how much smaller)
etc.
By adapting the size of the stage/buffer runtime you should be able to process more numbers per step.

Is this possible?

Rob